Compare commits

..

8 Commits

Author SHA1 Message Date
Ettore Di Giacinto
6746a6fc7e feat(backends): make PreferDevelopmentBackends install the development image as primary
When LOCALAI_PREFER_DEV_BACKENDS is set, install the -development image as the
primary backend URI (keeping the released image reachable as the first
fallback), instead of only reaching development as a download fallback when the
released image is missing. This lets an operator force backends built from the
development branch — e.g. to pick up a fix already on master before a release.

Threads PreferDevelopmentBackends through SystemState so InstallBackend can see
it, and reuses the same development-URI convention as the existing failure-path
fallback (released tag -> branch tag + dev suffix). The unexported developmentURI
helper is covered by a Ginkgo spec.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-25 23:37:19 +00:00
LocalAI [bot]
d388f874de feat(backends): darwin/Metal build for the privacy-filter backend (#10513)
* feat(backends): darwin/Metal build for the privacy-filter backend (timeboxed try)

The privacy-filter.cpp engine is already Metal-capable on Apple Silicon: it pulls
ggml and never forces GGML_METAL=OFF, and ggml defaults Metal ON on Apple, so a
plain Darwin build is Metal-enabled. grpc++/protobuf resolve from Homebrew via
find_package(... CONFIG). It just had no darwin build path - the existing
package.sh and run.sh are Linux-only and there was no make target / workflow step.

Adds the bespoke darwin path, modeled on the ds4 one:
- scripts/build/privacy-filter-darwin.sh: native make grpc-server, otool -L dylib
  bundling, create-oci-image (no Linux package.sh).
- Makefile: backends/privacy-filter-darwin target (+ .NOTPARALLEL).
- .github/workflows/backend_build_darwin.yml: gated build step for privacy-filter.
- scripts/changed-backends.js: inferBackendPathDarwin special-case -> backend/cpp.
- .github/backend-matrix.yml: includeDarwin entry (lang go, like ds4/llama-cpp).
- backend/index.yaml: metal: capability + metal-privacy-filter(-development) entries.
- backend/cpp/privacy-filter/run.sh: DYLD_LIBRARY_PATH branch on Darwin.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* fix(privacy-filter): macOS proto include + bundle ggml dylibs

Validated natively on an M4 (the build/package/load chain now works with Metal):

- CMakeLists.txt: hw_grpc_proto compiles the generated proto/grpc sources but
  only linked the binary dir, so on macOS it could not find protobuf's headers
  (runtime_version.h) - Homebrew puts them under /opt/homebrew, not /usr/include.
  Link protobuf::libprotobuf + gRPC::grpc++ so their include dirs propagate. No-op
  on Linux (apt headers are already on the default search path).
- privacy-filter-darwin.sh: bundle the ggml shared libs the binary @rpath-links
  (libggml{,-base,-cpu,-blas,-metal}); the otool -L walk only catches on-disk
  absolute deps and missed them. Resolved at runtime by run.sh's DYLD_LIBRARY_PATH.

M4 check: arm64 grpc-server links @rpath/libggml-metal.0.dylib; with the 15 ggml
dylibs + grpc/protobuf bundled, it loads clean (no dyld errors) and prints usage.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 01:18:41 +02:00
LocalAI [bot]
86677495a2 chore: ⬆️ Update ggml-org/llama.cpp to 9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1 (#10514)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-26 01:15:40 +02:00
LocalAI [bot]
253aedff06 chore: ⬆️ Update CrispStrobe/CrispASR to 8f1218141b792b8868861c1af17ba1e361b05dc0 (#10502)
⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-26 01:08:09 +02:00
LocalAI [bot]
74f07ecc35 fix(backends): quote $CURDIR in run.sh (fixes backends in paths with spaces) (#10519)
fix(backends): quote $CURDIR in run.sh so backends work in paths with spaces

The backend launcher scripts derive their own directory with
CURDIR=$(dirname "$(realpath $0)") and then referenced it unquoted as
$CURDIR (e.g. [ -f $CURDIR/lib/ld.so ], export LD_LIBRARY_PATH=$CURDIR/lib:...,
exec $CURDIR/<binary> "$@"). When a backend is installed under a path that
contains a space - notably macOS's ~/Library/Application Support/... - bash
word-splits the unquoted $CURDIR, so the test builtin fails with
"binary operator expected" and exec tries to run ".../Library/Application",
yielding "No such file or directory". The backend never starts, surfacing as
a gRPC "service not ready" error and an HTTP 500. Quote $CURDIR (and the
realpath "$0") in every affected run.sh; no logic changes.

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 01:02:48 +02:00
LocalAI [bot]
ae0da454a7 chore: pin localrecall to tagged v0.6.3 (#10518)
#10517 pinned the pseudo-version of the postgres connection-timeout fix;
mudler/LocalRecall@v0.6.3 now tags that exact commit. Use the clean release
tag instead of the pseudo-version. No code change.


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 01:02:15 +02:00
LocalAI [bot]
179210b970 chore: bump localrecall for postgres per-connection timeouts (#10517)
* chore: bump localrecall for postgres per-connection timeouts

Pulls mudler/LocalRecall#49: sets lock_timeout / idle_in_transaction
(default on) + opt-in statement_timeout on every pooled connection, so a
corrupt/wedged index (e.g. a BM25 insert spinning on a buffer-content lock)
can no longer hold its relation lock forever and head-of-line block the
whole vector store.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* docs(agents): document PostgreSQL connection safety timeouts

Note the POSTGRES_LOCK_TIMEOUT / POSTGRES_IDLE_IN_TRANSACTION_TIMEOUT /
POSTGRES_STATEMENT_TIMEOUT env vars read by the embedded vector store, and
that safe defaults are on automatically.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 00:53:03 +02:00
LocalAI [bot]
6c03e46390 chore: ⬆️ Update ikawrakow/ik_llama.cpp to b84902d2ad27c34f989f23947200c4b91b1568fd (#10515)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-25 23:42:21 +02:00
150 changed files with 412 additions and 19972 deletions

View File

@@ -4963,6 +4963,12 @@ includeDarwin:
tag-suffix: "-metal-darwin-arm64-sam3-cpp"
build-type: "metal"
lang: "go"
# privacy-filter (PII/NER) is a C++/ggml backend built by a bespoke darwin
# script (make backends/privacy-filter-darwin); ggml defaults Metal ON on Apple
# so the build is Metal-enabled. lang=go drives runner/toolchain selection only.
- backend: "privacy-filter"
tag-suffix: "-metal-darwin-arm64-privacy-filter"
lang: "go"
# LocalVQE has no Metal path; on Apple Silicon it builds CPU-only (GGML_METAL
# OFF) but is still a native arm64 image. Uses the darwin/metal build profile.
- backend: "localvqe"

View File

@@ -228,8 +228,17 @@ jobs:
run: |
make backends/ds4-darwin
# privacy-filter is a C++/ggml backend like ds4 - a single grpc-server with
# otool dylib bundling - so it gets its own bespoke darwin script rather than
# the generic build-darwin-go-backend path.
- name: Build privacy-filter backend (Darwin Metal)
if: inputs.backend == 'privacy-filter'
run: |
make protogen-go
make backends/privacy-filter-darwin
- name: Build ${{ inputs.backend }}-darwin
if: inputs.backend != 'llama-cpp' && inputs.backend != 'ds4'
if: inputs.backend != 'llama-cpp' && inputs.backend != 'ds4' && inputs.backend != 'privacy-filter'
run: |
make protogen-go
BACKEND=${{ inputs.backend }} BUILD_TYPE=${{ inputs.build-type }} USE_PIP=${{ inputs.use-pip }} make build-darwin-${{ inputs.lang }}-backend

View File

@@ -1,5 +1,5 @@
# Disable parallel execution for backend builds
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/omnivoice-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio backends/supertonic backends/depth-anything-cpp backends/privacy-filter
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/omnivoice-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio backends/supertonic backends/depth-anything-cpp backends/privacy-filter backends/privacy-filter-darwin
GOCMD=go
GOTEST=$(GOCMD) test
@@ -1129,6 +1129,10 @@ backends/ds4-darwin: build
bash ./scripts/build/ds4-darwin.sh
./local-ai backends install "ocifile://$(abspath ./backend-images/ds4.tar)"
backends/privacy-filter-darwin: build
bash ./scripts/build/privacy-filter-darwin.sh
./local-ai backends install "ocifile://$(abspath ./backend-images/privacy-filter.tar)"
build-darwin-python-backend: build
bash ./scripts/build/python-darwin.sh

View File

@@ -1,5 +1,5 @@
IK_LLAMA_VERSION?=d5507e33ae7ee2b7b41475f08044d3bde3b839ee
IK_LLAMA_VERSION?=b84902d2ad27c34f989f23947200c4b91b1568fd
LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp
CMAKE_ARGS?=

View File

@@ -2,7 +2,7 @@
set -ex
# Get the absolute current dir where the script is located
CURDIR=$(dirname "$(realpath $0)")
CURDIR=$(dirname "$(realpath "$0")")
cd /
@@ -13,28 +13,28 @@ grep -e "flags" /proc/cpuinfo | head -1
# ik_llama.cpp requires AVX2 — default to avx2 binary
BINARY=ik-llama-cpp-avx2
if [ -e $CURDIR/ik-llama-cpp-fallback ] && ! grep -q -e "\savx2\s" /proc/cpuinfo ; then
if [ -e "$CURDIR"/ik-llama-cpp-fallback ] && ! grep -q -e "\savx2\s" /proc/cpuinfo ; then
echo "CPU: AVX2 NOT found, using fallback"
BINARY=ik-llama-cpp-fallback
fi
# Extend ld library path with the dir where this script is located/lib
if [ "$(uname)" == "Darwin" ]; then
export DYLD_LIBRARY_PATH=$CURDIR/lib:$DYLD_LIBRARY_PATH
#export DYLD_FALLBACK_LIBRARY_PATH=$CURDIR/lib:$DYLD_FALLBACK_LIBRARY_PATH
export DYLD_LIBRARY_PATH="$CURDIR"/lib:$DYLD_LIBRARY_PATH
#export DYLD_FALLBACK_LIBRARY_PATH="$CURDIR"/lib:$DYLD_FALLBACK_LIBRARY_PATH
else
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH="$CURDIR"/lib:$LD_LIBRARY_PATH
fi
# If there is a lib/ld.so, use it
if [ -f $CURDIR/lib/ld.so ]; then
if [ -f "$CURDIR"/lib/ld.so ]; then
echo "Using lib/ld.so"
echo "Using binary: $BINARY"
exec $CURDIR/lib/ld.so $CURDIR/$BINARY "$@"
exec "$CURDIR"/lib/ld.so "$CURDIR"/$BINARY "$@"
fi
echo "Using binary: $BINARY"
exec $CURDIR/$BINARY "$@"
exec "$CURDIR"/$BINARY "$@"
# We should never reach this point, however just in case we do, run fallback
exec $CURDIR/ik-llama-cpp-fallback "$@"
exec "$CURDIR"/ik-llama-cpp-fallback "$@"

View File

@@ -1,14 +1,6 @@
LLAMA_VERSION?=8be759e6f70d629638a7eb70db3824cbdcea370b
LLAMA_VERSION?=9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1
LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
# LLAMA_PAGED controls whether the vendored paged-attention patch series
# (patches/paged/) is applied on top of the pinned llama.cpp. Default on; set
# LLAMA_PAGED=off to build a clean-against-upstream backend (e.g. to unblock a
# dep-bump if an upstream change breaks a paged hook - the paged carry is then
# fixed independently). Runtime behaviour stays gated by the LLAMA_KV_PAGED env
# regardless, so an LLAMA_PAGED=on build is byte-identical to stock until that
# env is set.
LLAMA_PAGED?=on
CMAKE_ARGS?=
BUILD_TYPE?=
@@ -177,28 +169,14 @@ llama.cpp:
git remote add origin $(LLAMA_REPO) && \
git fetch --all --tags && \
git checkout -b build $(LLAMA_VERSION) && \
git submodule update --init --recursive --depth 1 --single-branch && \
for p in $(CURRENT_MAKEFILE_DIR)patches/0*.patch; do \
[ -e "$$p" ] || continue; \
echo "applying llama.cpp patch: $$p"; \
git apply --verbose "$$p" || { echo "patch failed: $$p"; exit 1; }; \
done && \
if [ "$(LLAMA_PAGED)" = "off" ]; then \
echo "LLAMA_PAGED=off: skipping paged-attention patch series"; \
else \
for p in $(CURRENT_MAKEFILE_DIR)patches/paged/0*.patch; do \
[ -e "$$p" ] || continue; \
echo "applying llama.cpp PAGED patch: $$p"; \
git apply --verbose "$$p" || { echo "paged patch failed: $$p"; exit 1; }; \
done; \
fi
git submodule update --init --recursive --depth 1 --single-branch
llama.cpp/tools/grpc-server: llama.cpp
mkdir -p llama.cpp/tools/grpc-server
LLAMA_PAGED=$(LLAMA_PAGED) bash prepare.sh
bash prepare.sh
rebuild:
LLAMA_PAGED=$(LLAMA_PAGED) bash prepare.sh
bash prepare.sh
rm -rf grpc-server
$(MAKE) grpc-server

View File

@@ -749,97 +749,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
} else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
params.kv_unified = false;
}
// --- paged KV cache (experimental, off by default) ---
// Enables the on-demand paged KV-cache engine (vendored PagedKVManager
// + paged placement/gather/alloc seams). The engine is gated inside
// llama.cpp by the LLAMA_KV_PAGED env var, evaluated once at first use;
// here we expose it as a per-server model option instead of forcing the
// operator to export a process-wide env. When enabled we set the env
// BEFORE the model/context is created (later in this handler), so the
// engine latches on. When the option is absent we touch nothing, so an
// externally exported LLAMA_KV_PAGED still works as an escape hatch.
// Note: the engine's env check is process-wide and latches on first
// use, so enabling it for one model enables it for the worker process;
// LocalAI runs one model per llama.cpp worker, so this maps cleanly to
// per-server configuration. `kv_paged_debug` turns on the per-slot
// [paged-alloc]/free trace (LLAMA_KV_PAGED_DEBUG).
//
// The continuous-batching serving loop (update_slots) drives paged KV
// transparently through the existing kv-cache seams: each slot's
// sequence allocates paged blocks on arrival (find_slot placement) and
// returns them on slot release (the seq_rm free seam). This is
// token-identical to stock under both the unified and per-sequence
// caches. The per-slot allocate/free capacity benefit, however, only
// materialises with a per-sequence cache, since paged block ownership
// is keyed by stream and the unified cache collapses every slot onto a
// single stream. Operators who want that benefit should pair this with
// `kv_unified:false`; we do NOT flip kv_unified here, to keep the
// default serving behaviour (and the idle-slot prompt cache) unchanged.
} else if (!strcmp(optname, "kv_paged") || !strcmp(optname, "paged_kv") || !strcmp(optname, "paged_attention")) {
if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
setenv("LLAMA_KV_PAGED", "1", 1);
}
} else if (!strcmp(optname, "kv_paged_debug") || !strcmp(optname, "paged_kv_debug")) {
if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
setenv("LLAMA_KV_PAGED_DEBUG", "1", 1);
}
// --- chunked-prefill QoS budget (experimental, off by default) ---
// Caps the number of prompt tokens any single slot may prefill per
// update_slots iteration, so a large prompt cannot monopolise the batch
// and freeze the in-flight decoders. The serving loop reads this budget
// from the LLAMA_PREFILL_BUDGET env var (set BEFORE context init, like
// kv_paged above) and splits oversized prompts across iterations,
// interleaving decode steps for the other slots. A 6k-token prefill that
// stalled 8 decoders ~3.4s drops to ~780ms at budget=512 (4.8x stall
// cut) with zero TTFT cost and no steady-state regression. Unset or a
// non-positive value leaves the env untouched, so the stock unbounded
// prefill behaviour is preserved (an externally exported
// LLAMA_PREFILL_BUDGET still works as an escape hatch).
} else if (!strcmp(optname, "max_prefill_tokens") || !strcmp(optname, "mpt") || !strcmp(optname, "prefill_budget")) {
if (optval != NULL) {
try {
int budget = std::stoi(optval_str);
if (budget > 0) {
setenv("LLAMA_PREFILL_BUDGET", std::to_string(budget).c_str(), 1);
}
} catch (const std::exception& e) {
// If conversion fails, leave the budget unset (stock behaviour)
}
}
// --- dynamic decode-first prefill budget (patch 0016, continuous-batch P1) ---
// Supersedes max_prefill_tokens (the static patch-0013 cap) with the dynamic
// T - D budget read by update_slots(): a single total per-step token budget T
// (max_batch_tokens / mbt, the vLLM max_num_batched_tokens analogue) of which
// decode claims its live load D first and prefill gets the leftover, plus an
// optional per-slot prompt-chunk cap (prefill_cap, the long_prefill_token_
// threshold analogue). Both are set BEFORE context init, like kv_paged /
// max_prefill_tokens above. Unset leaves the env untouched, so the engine stays
// byte-identical to stock (an externally exported LLAMA_MAX_BATCH_TOKENS /
// LLAMA_PREFILL_CAP still works as an escape hatch). When max_batch_tokens is set
// it takes precedence over max_prefill_tokens: the engine honours the legacy
// LLAMA_PREFILL_BUDGET only when the dynamic knob is unset.
} else if (!strcmp(optname, "max_batch_tokens") || !strcmp(optname, "mbt")) {
if (optval != NULL) {
try {
int mbt = std::stoi(optval_str);
if (mbt > 0) {
setenv("LLAMA_MAX_BATCH_TOKENS", std::to_string(mbt).c_str(), 1);
}
} catch (const std::exception& e) {
// If conversion fails, leave the budget unset (stock behaviour)
}
}
} else if (!strcmp(optname, "prefill_cap")) {
if (optval != NULL) {
try {
int cap = std::stoi(optval_str);
if (cap > 0) {
setenv("LLAMA_PREFILL_CAP", std::to_string(cap).c_str(), 1);
}
} catch (const std::exception& e) {
// If conversion fails, leave the per-slot cap unset (engine default)
}
}
} else if (!strcmp(optname, "n_ctx_checkpoints") || !strcmp(optname, "ctx_checkpoints")) {
if (optval != NULL) {
try {

View File

@@ -1,7 +0,0 @@
tests/test_free_block_queue
tests/test_block_pool
tests/test_paged_kv_manager
tests/test_prefix_cache
tests/test_ggml_paged_rw
tests/test_ggml_paged_attn
paged-bench

View File

@@ -1,105 +0,0 @@
# Blackwell (GB10 / sm_121) kernel gaps — measured + the corrected strategy
Supersedes the "greenfield tcgen05 FP4 grouped GEMM" framing in `FP4_GROUPED_MOE_KERNEL.md`. Research +
profiling reframed the problem: the kernels we need **already exist in ggml**; they're just **untuned for
Blackwell**. And the parity target is far lower than the headline vLLM number implied.
## 1. The parity target was wrong — it's ~3,300 t/s single-stream, not 24,444
vLLM's dense "24,444 t/s" is **aggregate concurrent-batch** throughput, not single-sequence. The GB10
compute roofline caps **single-stream** Qwen3-32B prefill at **~3,300 t/s (BF16/INT8 ceiling)** / **~6,600
(FP4 ceiling)**. So: don't chase 24,444 with one kernel. Aggregate parity = (a kernel at the ceiling) +
(batched-prefill scheduling). The *kernel* job is to reach ~3,300 (matches vLLM, which on GB10 also runs at
the BF16 ceiling) or ~6,600 (beats it, via FP4).
## 2. GB10 per-precision DENSE peaks (measured, not spec)
| precision | dense peak | vs BF16 |
|---|---|---|
| BF16 / FP16 | ~213 TFLOP/s | 1.0× |
| INT8 | ~215 TOPS | **1.0×** |
| FP4 (MXFP4/NVFP4) | ~427500 TFLOP/s | **2.0×** |
Memory: ~273 GB/s LPDDR5X (the bottleneck for *decode*; prefill is compute-bound). **Critical:** GB10 is
**1:1:2** (BF16:INT8:FP4), NOT datacenter Blackwell's 1:2:4 — **INT8 gives ZERO speedup over BF16 here.** So
int8-MMQ has no precision advantage; only FP4 does. (NVIDIA spec sheets still claim 1:2:4 — contradicted by
direct GB10 measurement; on-the-record discrepancy.)
## 3. Measured gaps (nsys, GB10)
| path | kernel | % of prefill | achieved | % of ceiling |
|---|---|---|---|---|
| **Dense** Q4_K_M | `mul_mat_q<Q4_K/Q6_K>` (int8 MMQ) | 80% | ~46 TFLOP/s | **~21% of 215** |
| **MoE** MXFP4 | `mul_mat_q<MXFP4>` (FP4 MMA) | 37% | ~22 TFLOP/s | **~45% of 500** (or ~10% of BF16) |
Both kernels are **engaged correctly but untuned for Blackwell** — llama.cpp's MMQ was "tuned primarily for
RTX 3000/4000" (Ampere/Ada). The headroom (45×) is recoverable; it's not an architectural ceiling.
## 4. ggml's current quantized-matmul paths (what exists)
- **MMQ** (int8): quantizes activations to Q8_1, int8 `mma.sync`/`dp4a`. Prefill path. **Untuned for sm_12x.**
- **FP4 MMA** (#17906, merged): native MXFP4/NVFP4 `m16n8k64` block-scaled FP4 mma for cc≥12.0. Works on GB10
for MoE (we measured 3441 t/s MXFP4 prefill) — but underutilized (~5% of FP4 peak). On **sm_121** it's hit
by build-flag (`120f`) + nvcc `-O3` miscompile (#18331) + capability-gating issues.
- **dequant→cuBLAS-FP16**: unfused fallback (materializes FP16 weights, round-trips memory). Not a fused
Marlin. (Our `GGML_CUDA_FORCE_CUBLAS` no-op = this didn't even engage for Q4_K.)
- **NO fused Marlin-style W4A16 kernel** (dequant 4-bit→BF16 in-shared-mem → BF16 tensor cores). Real gap.
## 5. Strategy — match vs beat (this replaces the tcgen05-greenfield plan)
**To MATCH vLLM (~3,300 single-stream): FP4 is NOT required.** Because INT8 == BF16 on GB10, a tuned MMQ and
a BF16 Marlin kernel share the *same* ceiling — and vLLM hits parity via W4A16 Marlin (BF16), since its FP4
is also broken on sm_121.
Ranked, by effort:
1. **Probe: tune the existing int8 MMQ for Blackwell** (dense). Cheapest. We're at 21% of the ceiling —
recover via tile sizes, async copy (`cp.async`), double-buffered shared-mem pipeline, occupancy. Caveat:
the `nwarps*tile_C::I==mmq_y` static_assert (found earlier) couples the constants; and the Q8_1
activation-quant overhead caps pure-MMQ tuning. Bounded upside, but a fast experiment.
2. **Build a Marlin-style W4A16 BF16 GEMM** (dense) — the robust path to ~3,300 (4.3× over today's 765).
Dequant 4-bit→BF16 in shared memory, MMA on BF16 tensor cores, `cp.async` multi-buffer, offline weight
reshuffle. Mirrors vLLM's actual GB10 path; keeps activations BF16 (better quality than int8 MMQ); fills a
genuine ggml gap. **This is the recommended kernel to MATCH.**
**To BEAT vLLM (~6,600, 2×): fix — don't rewrite — the FP4 path on sm_121.**
3. **Get the existing FP4 MMA (#17906/#20644) fully working + tuned on sm_121.** It already works on sm_120
(RTX 5090: +4368% prefill) and on GB10 for MoE. The blockers are the `120f` arch flag, the `-O3`
miscompile (#18331), capability gating — **build/compiler fixes, not a new kernel.** Then tune the FP4 MMQ
(it's at ~5% of FP4 peak). This is where upstream momentum already is, and the only route past vLLM.
**Dropped:** the from-scratch tcgen05/CUTLASS grouped GEMM (the old scaffold). It aimed past the matchable
ceiling, duplicates work the FP4-MMA path already does, and FP4 on sm_121 is a *fix* problem not a *write*
problem. The `fp4-grouped-moe.cu` scaffold/hook stays as a useful dispatch seam, but the kernel behind it
should be one of (1)/(2)/(3), not a greenfield CUTLASS collective.
## 6. Cheap experiment — RESULT: MXFP4 dense = free 1.44×, but not parity (kernel still untuned)
Requantized Qwen3-32B dense → MXFP4 (forced attn+ffn to mxfp4 via `--tensor-type`, `--allow-requantize`,
speed-only test) and benched prefill:
| quant | kernel | pp512 | pp2048 | vs Q4_K |
|---|---|---|---|---|
| Q4_K_M | int8-MMQ | 765 | 763 | 1.0× |
| **MXFP4** | **FP4-MMA** | **1099** | **1153** | **1.44×** |
**Findings:**
- **MXFP4 dense is a real, free 1.44× over Q4_K** — just a requantize, the existing FP4-MMA path engages for
dense weights on GB10. Worth shipping as a **Blackwell dense-quant recommendation** in the gallery (no kernel).
- **But it is NOT parity.** 1153 t/s = **~17% of the FP4 ceiling (~6,600)** / ~35% of the BF16 ceiling. So the
**FP4-MMA kernel is itself untuned** (consistent with the MoE measurement, ~5% of FP4 peak). MXFP4 moves dense
from the int8 path (765) onto the FP4 path (1153), but the FP4 kernel leaves ~46× on the table.
- **So the kernel work is confirmed and now precise: tune the FP4-MMA kernel** (it's the highest-value, since it
serves both dense-MXFP4 and MoE, and FP4 is the only path that can *beat* vLLM). Strategy item (3) — fix +
tune the existing FP4-MMA on sm_121 — is the priority; a Marlin-style W4A16 BF16 kernel (2) is the alternative
to *match* on the BF16 ceiling if FP4 tuning stalls.
Conclusion: the cheap test did NOT collapse the kernel problem (the kernels are untuned, not just the quant), but
it (a) gives a free 1.44× to ship now, and (b) sharpens the target to **tuning the FP4-MMA kernel**.
## Sources
GB10 peaks (measured): forums.developer.nvidia.com/t/351993, /360142, /373618. Marlin: github.com/IST-DASLab/marlin,
arxiv 2408.11743, developers.redhat.com Marlin/Machete. MMQ untuned: llama.cpp docs/build.md, discussions/16578,
DandinPower/llama.cpp_bench. FP4 landing/sm121: llama.cpp PR #17906/#20644, issues #19662/#18331. Roofline:
vllm.ai/blog/2026-06-01-vllm-dgx-spark, lmsys.org DGX Spark.
> **Correction (measured):** the earlier `GGML_CUDA_FORCE_CUBLAS` env test was a no-op because it's a *compile-time* `#ifdef`, not a runtime flag — cuBLAS never engaged. A real rebuild with `-DGGML_CUDA_FORCE_CUBLAS=ON` shows cuBLAS is **slower** than MMQ for dense Q4 (pp2048 690 vs 750) and runs an **Ampere `cutlass_80_tensorop` FP16 kernel** — cuBLAS-13.0 has no sm_121-tuned GEMM and falls back to sm_80. So *both* MMQ and cuBLAS sit at ~46 TFLOP/s (~21% of the 213 BF16 peak); there is **no library shortcut** to the ceiling on GB10 — a hand-tuned sm_120a kernel (Marlin-style) is required.

View File

@@ -1,334 +0,0 @@
# Chunked prefill + n_batch/n_ubatch decouple — implementation plan
Scope: LocalAI's llama.cpp backend (`backend/cpp/llama-cpp/`). Companion to
`PHASED_VLLM_PARITY_PLAN.md` Phase 3. This document is the concrete, file-cited
plan for what the brief called "chunked prefill".
Line numbers below are from two trees:
- LocalAI: `backend/cpp/llama-cpp/grpc-server.cpp`, `core/backend/options.go`,
`backend/backend.proto`, `core/backend/hardware_defaults.go` — exact.
- Vendored upstream scheduler: `llama.cpp/tools/server/server-context.cpp`. The
build copies `llama.cpp/tools/server/*` into `tools/grpc-server/` (`prepare.sh`
lines 15-17) and only overrides `grpc-server.cpp` + `CMakeLists.txt`. So
`update_slots()` is **inherited upstream code, not LocalAI code**. Line numbers
cited for it are from a same-era checkout (`d12cc3d`, 2026-04-09); the pin is
`f3e1828` (Makefile line 2). The structure is identical; exact lines may drift
a few rows at the pin — match on the quoted comment strings, not the integers.
---
## TL;DR — the headline finding
**Chunked prefill with prefill/decode interleaving is ALREADY implemented** in the
llama.cpp server scheduler that LocalAI vendors. It is not a missing feature on
this version. `update_slots()` in `server-context.cpp`:
1. **Adds ongoing decode tokens first** — "first, add sampled tokens from any
ongoing sequences" (≈ line 2088). Every `SLOT_STATE_GENERATING` slot gets its
one sampled token into the shared `llama_batch` before any prefill is added.
2. **Then fills the remaining `n_batch` budget with prompt (prefill) tokens**
"next, batch any pending prompts without exceeding n_batch" (≈ line 2166),
gated by `params_base.cont_batching` (LocalAI sets `cont_batching = true` by
default, `grpc-server.cpp:547`). The per-slot prefill fill loop
(≈ line 2552) is `while (slot.prompt.n_tokens() < slot.task->n_tokens() &&
batch.n_tokens < n_batch)` — i.e. it caps each slot's prefill contribution to
the **remaining** budget and defers the rest to the next iteration.
3. **Decodes the combined batch in one pass** (≈ line 2728-2741): decode tokens
and prefill-chunk tokens go through the **same `llama_decode`**, which then
splits internally into `n_ubatch` physical sub-batches.
This is exactly the behavior the abandoned-looking draft **upstream PR #10718**
("server : chunked prefill support") asked for — "the first task is no longer
blocked by the second long prompt processing task." That PR is still marked OPEN
but its goal was absorbed into the natural evolution of `update_slots()`; we do
**not** need to port it. A long prefill no longer stalls the decode batch: decode
slots are serviced first every iteration, prefill consumes only the leftover
budget.
**Therefore: do not re-implement chunked prefill.** The real LocalAI gap is
narrow and is the rest of this plan:
- **Phase A (the actual gap): the `n_batch`/`n_ubatch` decouple.** LocalAI ties
the scheduler token budget (`n_batch`) to the physical forward width
(`n_ubatch`) at `grpc-server.cpp:515` + `:519`. This forces
`n_batch == n_ubatch`, so the logical scheduling window can never be wider than
one physical ubatch. You cannot keep `n_ubatch` at the Blackwell GEMM sweet
spot (2048) while widening `n_batch` so concurrent prefills + decodes co-batch
into a larger logical window. There is no first-class `batch:`/`ubatch:` split
on the Go side, and there is only a one-directional `ubatch` override on the C++
side (you can shrink ubatch below the coupled value, never grow n_batch above
it).
- **Phase B (optional policy lever): a decode-headroom prefill cap.** Upstream
caps prefill at the full `n_batch` shared with decode. Under heavy mixed load
one fat prefill chunk per iteration still adds inter-token latency (ITL) jitter
to the decoders sharing that forward. vLLM exposes
`long_prefill_token_threshold` / `max_num_partial_prefills` for this. A
LocalAI-specific per-iteration prefill cap (a patch to vendored `update_slots`)
bounds that jitter. This is genuinely not in upstream and is the only place a
scheduler-policy change is warranted.
---
## 1. Current behavior — precise citations
### 1.1 The scheduler is upstream, inherited verbatim
- `prepare.sh:15-17` copies all of `llama.cpp/tools/server/*` into the
`grpc-server` build dir; `grpc-server.cpp` (LocalAI) replaces only the HTTP/gRPC
service + `params_parse` + `parse_options`. `update_slots()`, the slot state
machine, and the batch builder are **upstream `server-context.cpp`**, untouched
by LocalAI today.
- Slot states: `server-context.cpp:36-42`
`SLOT_STATE_IDLE / WAIT_OTHER / STARTED / PROCESSING_PROMPT / DONE_PROMPT /
GENERATING`.
### 1.2 Decode-first, then prefill-fill, one shared batch
- `common_batch_clear(batch)` (≈ 2078) — one batch per `update_slots` iteration.
- Decode phase (≈ 2088-2156): for each `SLOT_STATE_GENERATING` slot,
`common_batch_add(batch, slot.sampled, …, /*logits=*/true)` adds exactly one
token. Decode is guaranteed a seat before prefill runs.
- Budget fetch (≈ 2158-2160): `n_batch = llama_n_batch(ctx)`,
`n_ubatch = llama_n_ubatch(ctx)`.
- Prefill phase (≈ 2166): `if (params_base.cont_batching || batch.n_tokens == 0)`
→ with cont_batching ON, prefill is added to the **same** batch as decode.
- Per-slot prefill fill (≈ 2552-2597):
`while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch)`
— adds prompt tokens until the slot is done **or** the shared budget is hit.
Whatever does not fit stays for the next iteration (the slot remains
`SLOT_STATE_PROCESSING_PROMPT`).
- Whole-prompt completion (≈ 2603-2615): when the slot's prompt is fully consumed
it flips to `SLOT_STATE_DONE_PROMPT`, sets `batch.logits[last] = true`, inits
the sampler. Next iteration it becomes `GENERATING`.
- Budget break (≈ 2693-2695): `if (batch.n_tokens >= n_batch) break;`.
- Decode (≈ 2728-2741): loops `batch_view` slices of `min(n_batch, remaining)` and
calls `llama_decode`; the physical `n_ubatch` split happens inside
`llama_decode`.
### 1.3 The chunking is gated by `can_split()`
- `server-context.cpp:225-231`: `can_split()` returns true unless the task needs
embeddings with non-LAST pooling. So **completion/generation tasks always
chunk-and-interleave**; only embeddings/rerank force the whole prompt into one
ubatch (≈ 2234-2244 raises "input is too large… increase the physical batch
size" — this is exactly why LocalAI bumped `n_ubatch` for rerank, see below).
### 1.4 LocalAI ties n_batch to n_ubatch (the gap)
- `grpc-server.cpp:515``params.n_batch = request->nbatch();`
- `grpc-server.cpp:519``params.n_ubatch = request->nbatch();` with the comment
that this fixes reranking being capped at the 512 default `n_ubatch`.
- `grpc-server.cpp:781-784` — the **only** decouple knob today: an `n_ubatch` /
`ubatch` option that overrides `n_ubatch` alone (added for embeddings/rerank).
There is **no** `batch` / `n_batch` option parse, so `n_batch` cannot be raised
above the coupled value from a model config. Confirmed: `grep '"n_batch"|"batch"'`
in `grpc-server.cpp` returns nothing.
- Options arrive via `request->options(i)` parsed as `optname:optval`
(`grpc-server.cpp:584-585`); these come from `ModelOptions.Options`
`c.Options` (`core/backend/options.go:221`).
### 1.5 Go side sends a single batch number
- `backend/backend.proto:341``int32 NBatch = 4;` is the only batch field; there
is **no** `NUBatch`.
- `core/backend/options.go:108-129` `EffectiveBatchSize`: returns `c.Batch` if set,
else context size for single-pass (score/embed/rerank), else
`hardwareDefaultBatchSize(512)`.
- `core/backend/options.go:228``NBatch: int32(b)` (single value to the
backend; becomes both `n_batch` and `n_ubatch` via 1.4).
- `core/backend/hardware_defaults.go:28,37-40``BlackwellBatchSize = 2048`;
on Blackwell an unset batch defaults to 2048, so today
`n_batch == n_ubatch == 2048` there.
---
## 2. Why the decouple matters for serving (not just rerank)
Invariant: `n_ubatch <= n_batch`. `n_ubatch` is the physical forward-pass GEMM
width (compute efficiency; GB10 sweet spot ≈ 2048). `n_batch` is the per-iteration
**scheduler token budget** — the logical window shared by decode + prefill chunks,
analogous to vLLM's `max_num_batched_tokens`.
With `n_batch == n_ubatch` (today), the scheduling window cannot exceed one
physical ubatch. Consequences:
- Under concurrency, the combined (decode + multiple prefill chunks) logical batch
is capped at the physical ubatch, so aggregate prefill cannot grow past one
ubatch worth of tokens per iteration even when more slots have prompts queued.
- A user who shrinks `batch:` for memory also shrinks the physical ubatch,
degrading prefill GEMM efficiency — and vice versa.
Decoupling lets us hold `n_ubatch = 2048` (efficient GEMM) while setting a larger
`n_batch` (e.g. 4096) so more concurrent prefill+decode tokens co-schedule into one
logical window, lifting aggregate prefill under mixed load — `llama_decode` still
tiles the physical work at 2048.
---
## 3. Phased implementation
### Phase 0 — Verification harness (do first; TDD red)
Bite-sized, no code change to the scheduler.
- **0.1 Token-identical greedy under mixed load.** Script: start the backend with
`n_parallel >= 4`, greedy sampling (temp 0, fixed seed). Fire (a) several short
decode streams and (b) one ~8k-token prompt concurrently (the exact repro from
PR #10718's body works). Capture each stream's full token id sequence. Re-run
with the prefill request absent. **Assert the short streams' token ids are
byte-identical** in both runs — proves interleaving does not perturb decode
numerics (KV/position correctness across chunk boundaries). Wire as a Ginkgo
spec under the backend e2e suite.
- **0.2 Mixed-workload throughput baseline.** Use `llama-batched-bench` (built from
the same tree) or a small driver hitting `/v1/chat/completions`: measure
aggregate prefill tok/s and decode tok/s, and p50/p99 ITL of the decode streams,
under the mixed workload. Record numbers for the current `n_batch==n_ubatch`
config. This is the before of Phase A/B.
Expected result of Phase 0: 0.1 already passes (interleave is correct today);
0.2 gives the baseline the decouple must beat.
### Phase A — Decouple n_batch from n_ubatch
Goal: let model config set the physical ubatch independently of the logical batch,
defaulting to today's behavior (no regression).
- **A.1 C++: accept a `batch`/`n_batch` option (and keep `ubatch`).**
In `grpc-server.cpp`, after the existing `ubatch` branch (`:781-784`), add a
sibling branch:
```cpp
} else if (!strcmp(optname, "n_batch") || !strcmp(optname, "batch")) {
if (optval != NULL) {
try { params.n_batch = std::stoi(optval_str); } catch (...) {}
}
```
This is the missing direction (raise `n_batch` above the coupled value). Order
matters: both `:515/:519` run first (coupling as default), then option parsing
overrides either independently. Add a clamp note: if a user sets
`n_ubatch > n_batch`, llama.cpp will clamp/upbatch; log a warning. Keep the
`:519` aliasing for backward compat (rerank still works with no options).
- **A.2 Proto: add an explicit physical ubatch field.**
`backend/backend.proto:341` add `int32 NUBatch = <next free tag>;` (do not reuse
4). Regenerate with `make protogen-go` + the C++ proto build.
- **A.3 C++: honor `NUBatch` when present.**
In `grpc-server.cpp` `params_parse`, after `:519`, add:
```cpp
if (request->nubatch() > 0) {
params.n_ubatch = request->nubatch();
}
```
so an explicit physical ubatch wins over the `n_batch` alias, with the `ubatch`
string-option as a third path for users who only edit `options:`.
- **A.4 Go: config surface + plumbing.**
- Add `UBatch *int` (yaml `ubatch`) to the llama config struct alongside `Batch`
(search `core/config` for the `Batch` field; mirror it).
- In `core/backend/options.go`: add `EffectiveUBatchSize(c)` mirroring
`EffectiveBatchSize` (return `c.UBatch` if set, else
`min(EffectiveBatchSize(c), BlackwellBatchSize-or-512)` so the physical ubatch
stays at the hardware sweet spot while `n_batch` may be larger). Set
`NUBatch: int32(EffectiveUBatchSize(c))` next to `NBatch:` (`:228`).
- Keep the default such that when neither is set, `NUBatch == NBatch` ⇒
byte-identical to today.
- **A.5 Serving default (the lever).**
In `hardware_defaults.go`, introduce `BlackwellLogicalBatch = 4096` (or a
measured value) and let `EffectiveBatchSize` return it for **multi-slot serving**
configs (when `n_parallel > 1` and the model is a completion model), while
`EffectiveUBatchSize` stays at `BlackwellBatchSize = 2048`. Gate behind the same
Blackwell detection already used at `:37-40`. Single-stream/embedding/rerank
paths keep `n_batch == n_ubatch`. This is the only behavioral change shipped by
Phase A; Phase 0.2 must show it is net-positive before defaulting it on.
- **A.6 Tests.** Extend `hardware_defaults_internal_test.go` with
`EffectiveUBatchSize` cases; add a `grpcModelOpts` test asserting
`NUBatch <= NBatch` and that unset config yields `NUBatch == NBatch`. Re-run
0.1 (must still be token-identical) and 0.2 (must show aggregate-prefill gain or
neutral ITL) at `n_batch=4096, n_ubatch=2048`.
### Phase B — Decode-headroom prefill cap (optional policy, vendored patch)
Only if Phase 0.2 / A shows decode ITL jitter from fat prefill chunks. This is the
one change that touches the inherited scheduler, so it lives as a patch in
`backend/cpp/llama-cpp/patches/` (applied by `prepare.sh:6-11` / Makefile
`:141-145`), never as an edit to a checked-in upstream file.
Policy (pseudocode; insert into `update_slots()` prefill fill loop, the
`while (… && batch.n_tokens < n_batch)` at ≈ `server-context.cpp:2552`):
```
# token budget for THIS iteration, decode already seated:
n_decode_in_batch = batch.n_tokens # set after the decode phase
prefill_budget = n_batch # default == today
if serving_mode and n_decode_in_batch > 0:
# leave room so decoders are not starved/jittered by one giant prefill chunk
# max_prefill_per_iter defaults to n_ubatch (one physical tile) when decode active
prefill_budget = min(n_batch, n_decode_in_batch + max_prefill_per_iter)
# fill loop guard becomes:
while slot.prompt.n_tokens() < slot.task->n_tokens()
and batch.n_tokens < prefill_budget:
...
```
- `max_prefill_per_iter` is a new `common_params` field surfaced as an
`options:` knob (`max_prefill_tokens` / `mpt`) parsed in `grpc-server.cpp`
exactly like A.1, default `0` = disabled = today's behavior.
- Semantics mirror vLLM `long_prefill_token_threshold`: cap the prefill share so
ongoing decodes keep a steady cadence; the remaining prompt rides the next
iteration (already supported by the state machine — slot stays
`PROCESSING_PROMPT`).
- **Correctness:** unchanged KV/position path — chunk boundaries already advance
`slot.prompt.tokens.pos_next()` per added token (≈ 2570) and the slot resumes
from `slot.prompt.n_tokens()` next iteration. Capping the budget only changes
*how many* tokens are added this iteration, not *which* positions, so 0.1 must
remain token-identical.
### Phase C — Docs + defaults rollout
- Document `batch` / `ubatch` (and `max_prefill_tokens` if B ships) in
`docs/content/` model-config reference, with the serving recipe
(`n_parallel>1`, `n_batch=4096`, `ubatch=2048`).
- Note the orthogonality to paged KV (below) in
`PHASED_VLLM_PARITY_PLAN.md` Phase 3.
---
## 4. Risk / correctness
- **KV-cache & positions across chunks:** already handled upstream. Each prefill
token added advances `pos_next()` (≈ 2570) and is pushed to `slot.prompt.tokens`
(≈ 2573); the next iteration resumes from `slot.prompt.n_tokens()`. Chunk
boundaries are transparent to the KV cache because positions are absolute, not
per-chunk. Phase A changes only budgets, not positions; Phase B changes only the
per-iteration count. The 0.1 token-identical test is the guardrail.
- **Unified KV cache (LocalAI default, `n_parallel` slots share one cache):**
unaffected — co-batching prefill+decode across slots is what the unified cache is
for; positions are per-`seq_id` (`{ slot.id }` in `common_batch_add`).
- **`n_ubatch > n_batch`:** invalid; A.4 clamps `EffectiveUBatchSize <=
EffectiveBatchSize` and A.1 logs a warning if options violate it.
- **Embeddings / rerank:** must keep `n_ubatch >= prompt length` (single pass,
`can_split()==false`). The existing `:519` alias + `EffectiveBatchSize`
context-sizing for single-pass usecases (`options.go:119-124`) must be preserved
— do not let the serving `BlackwellLogicalBatch` default leak into single-pass
configs (A.5 gates on completion + `n_parallel>1`).
- **Turboquant fork:** the fork lacks some `common_params` fields (see
`LOCALAI_LEGACY_LLAMA_CPP_SPEC` precedent at `grpc-server.cpp:755`). `n_batch` /
`n_ubatch` are ancient fields and safe; if Phase B adds `max_prefill_per_iter`,
guard the new field behind a `#ifndef` like the checkpoint block does.
## 5. Orthogonality to paged KV (Phase 2)
Keep them independent. Paged KV (the `-kvp` / block-manager effort, draft #22569,
and `paged/`) changes **where** KV blocks live (allocation/utilization). Chunked
prefill / this decouple changes **how many tokens per iteration** the scheduler
batches (the `n_batch` budget and decode/prefill interleave). They compose: paged
KV raises the concurrency ceiling (more slots), the decouple widens the per-iter
scheduling window to feed those slots; neither touches the other's data structures.
The only contact point is `update_slots()` — if both ship a vendored patch to it,
land them as separate, ordered patches in `patches/` and keep the hunks disjoint
(paged touches allocation/seq_rm; chunked-prefill Phase B touches the prefill fill
budget).
---
## 6. Bottom line
- Chunked prefill + decode interleave: **already present and correct** on the
pinned llama.cpp — verify (Phase 0.1), do not rebuild.
- Real work: the **n_batch/n_ubatch decouple** (Phase A) — small, additive,
default-preserving — plus an **optional decode-headroom prefill cap** (Phase B)
if measurements show ITL jitter. Both are LocalAI-side: A in `grpc-server.cpp`
+ proto + `options.go`; B as a vendored `patches/` hunk.

View File

@@ -1,215 +0,0 @@
# llama.cpp multi-user decode overhead on DGX Spark (GB10, sm_121)
Investigation of the Qwen3-32B concurrent-decode throughput gap (llama.cpp ~547 t/s
vs vLLM ~667 t/s) on the GB10 box, build `~/llama.cpp-pr24423/build` (Release,
sm_121, `LLAMA_MAX_SEQ=256`, flash-attn on), model
`~/bench/q3-32b-gguf/Qwen3-32B-Q4_K_M.gguf`.
## TL;DR (the result overturns the brief's premise)
On **this** build the prime suspect is wrong and the host-overhead premise does not
hold:
1. **CUDA graphs are NOT disabled at high concurrency.** At npl=128, 94 of 98
decode `graph_compute` calls **replay a captured CUDA graph** (0 resets, stable
key, no property churn post-warmup). The keyed-warmup gate works.
2. **There is no ~170ms/step host hotspot here.** The GPU is **~96% active during
decode with graphs ON and ~96% active with graphs OFF**. Decode at npl=128 is
**GPU-compute-bound**, not host-bound.
3. The brief's "20% GPU util / 66ms GPU / 170ms host per step" was measured on a
different/earlier build (mainline without these graph fixes). It is not
reproducible on `llama.cpp-pr24423`.
4. Because the GPU is the bottleneck, re-enabling graphs cannot lift the number:
the clean A/B shows graphs ON vs OFF = **+1.5% at npl=128** (and +2.9% at
npl=32 - the benefit shrinks as concurrency rises and the GPU saturates).
5. The real gap to vLLM is the **quantized decode GEMM kernel**: `mul_mat_q`
(Q4_K + Q6_K) is ~68% of decode GPU time and runs ~2.1x above the GB10
memory-bandwidth floor. Closing the gap requires Marlin/Machete-style int4
GEMM kernels, not host-side work. This is a kernel project (the direction the
prior session's uncommitted `marlin-w4a16.cu` / `fp4-grouped-moe.cu` already
started, though those target w4a16/GPTQ-int4, not the K-quants this GGUF uses).
## 1. Why CUDA graphs are (not) disabled - exact code + measurement
### The gate (code)
PR24423 refactored the CUDA-graph path into a keyed, warmup-based scheme in
`~/llama.cpp-pr24423/ggml/src/ggml-cuda/ggml-cuda.cu`:
- `ggml_cuda_graph_get_key(cgraph)` (~L3343) keys the cached CUDA graph by
`cgraph->nodes[0]` (first-node pointer).
- `ggml_cuda_graph_check_compability(cgraph)` (~L3301) disables graphs only for:
- **split buffers** (`ggml_backend_buft_is_cuda_split`), and
- **`GGML_OP_MUL_MAT_ID`** when `src0` is non-quantized **or**
`ne[2] > get_mmvq_mmid_max(...)` (MoE expert routing needs a stream sync).
Qwen3-32B is **dense** -> no `MUL_MAT_ID` -> this condition never fires.
- `ggml_backend_cuda_graph_compute` (~L4514) warmup gate: a graph is used only
after **2 consecutive calls with no property change** (`warmup_complete`); any
property change resets warmup. `ggml_cuda_graph_update_required` (~L3347)
detects change by `memcmp` of the full `ggml_tensor` struct + per-src
data-ptr/ne/nb, with a fast path when `cgraph->uid` is unchanged.
### Why it stays enabled across decode steps
The graph stays stable because llama.cpp's host-side graph reuse holds during
decode, so node pointers/props (and `cgraph->uid`) do not churn:
- `llama_kv_cache::get_n_kv` (`src/llama-kv-cache.cpp` L1223-1233) **pads n_kv to
a multiple of 256** ("so that the graph remains constant across batches and can
be reused"). For ntg<=256 within the first KV block, n_kv is constant.
- `can_reuse_kq_mask` (`src/llama-graph.cpp` L43) keeps the KQ-mask dims stable:
`ne=[n_kv, n_tokens/n_stream, 1, n_stream]` = `[256,1,1,128]` every decode step
at npl=128.
- `can_reuse` (`src/llama-context.cpp` L1283) therefore returns true, so the
scheduler is **not** reset/re-split. `graph->uid` is only reassigned inside
`ggml_backend_sched_split_graph` (`ggml/src/ggml-backend.cpp` L1033, L1485),
which is skipped on the reuse path -> stable uid -> CUDA graph replays.
### Measurement (instrumented build, npl=128, ntg=96)
Env-gated counters added to `ggml_backend_cuda_graph_compute` /
`ggml_cuda_graph_update_required` (since `GGML_LOG_DEBUG` is compiled out in
Release / NDEBUG). End-of-run summary:
```
[GTRACE-SUMMARY] calls=98 notenab=0 warming=3 warmdone=1 RESET=0 USED=94 incompat=0 distinct_keys=1
```
94/98 decode `graph_compute` calls **replayed** a captured CUDA graph; **0**
warmup resets; a **single** distinct graph key for the whole decode; no node
property churn after warmup. Graphs are fully engaged at npl=128.
(The instrumentation was reverted afterwards; the checkout is back to its
pre-task state and the `.so` rebuilt clean.)
## 2. The per-step CPU "hotspot" - there isn't one on this build
GPU utilization during npl=128 decode (ntg=256):
- **Graphs ON** - `nvidia-smi` sampled every 0.7s through the decode phase:
steady **96% GPU util**, SM clock **2184 MHz** (not throttled), 45-47 W.
- **Graphs OFF** (`GGML_CUDA_DISABLE_GRAPHS=1`) - nsys CUDA trace, 8s window:
total GPU kernel time = `3,983,292,128 ns / 0.516` = **~7.72s of the 8s
window = ~96% GPU-active**. Even with every kernel launched individually from
the host, the GPU is still ~96% busy. There are essentially **no host gaps**.
Per-step wall = 60.6s / 256 steps = **~237 ms/step**, and the sum of one decode
graph's kernel times (nsys, graphs-on capture) is ~244 ms -> GPU kernel time per
step ~= wall time per step. The host work between steps is in the low single-digit
ms (the ~4% idle), consistent with graphs ON giving only +1.5% at npl=128.
This directly contradicts the brief's 66ms-GPU / 170ms-host split, which must have
come from a pre-graphs build.
### Per-step GPU breakdown (nsys, npl=128 decode, graphs off, 8s window)
| Kernel | % GPU time | ~ms/step |
|--------|-----------:|---------:|
| `mul_mat_q` Q4_K (type 12) | 51.6 | ~118 |
| `flash_attn_ext_f16` | 19.3 | ~44 |
| `mul_mat_q` Q6_K (type 14) | 16.2 | ~37 |
| `unary_gated` silu | 4.1 | ~9 |
| mmq stream-k fixup + quantize_q8_1 | ~5 | ~12 |
| rms_norm / rope / set_rows / add | ~4 | ~10 |
Quantized matmul = **~68%** of decode GPU time (~155 ms/step). Attention ~19%.
`perf` could not profile the host (kernel `perf_event_paranoid=4`), but it is moot:
the host is ~4% of the wall, so there is no ~170ms host hotspot to chase.
## 3. Fix attempt + measured result
### The requested fix (re-enable graphs / pad the decode batch) is a no-op here
Graphs are already enabled and the batch is already stable (n_kv padded to 256,
kq_mask dims constant). The clean cold A/B (cooldowns between every run):
| npl | graphs ON (t/s) | graphs OFF (t/s) | delta |
|----:|----------------:|-----------------:|------:|
| 32 | 242.60 | 235.75 | +2.9% |
| 64 | 398.59 | 389.06 | +2.5% |
| 128 | 543.95 | 535.71 | +1.5% |
Baseline (separate cold runs, original non-instrumented build):
npl=32 243.9, npl=64 397.1, **npl=128 544.95** (matches the ~546 baseline).
Graphs help, but the benefit **monotonically shrinks** as concurrency rises and
the GPU saturates. At npl=128 there is only ~1.5% of host launch overhead left to
remove, and GPU util is ~96% in both columns. **You cannot lift npl=128 decode
toward 667 by working on graphs/host overhead - the GPU is the bottleneck.**
### Where the number actually is, and the real lever
- vLLM 667 t/s at this concurrency = **192 ms/step**; llama.cpp 547 = **237
ms/step**. The ~45 ms/step gap maps almost entirely onto the quantized matmul.
- GB10 memory-bandwidth floor for a 32B Q4_K_M (~19.8 GB of weights, read once
per step and shared across the 128 sequences) at ~273 GB/s is **~72 ms/step**.
llama.cpp's `mul_mat_q` spends ~155 ms/step on matmul = **~2.1x the bandwidth
floor**. vLLM's Marlin/Machete int4 GEMMs run much closer to the floor; that
efficiency difference is the ~547 -> 667 gap.
- The Q6_K matmul (`mul_mat_q` type 14) also shows pathological tail latency
(median 0.89 ms, max 5.5 ms) - the MMQ kernel is not well-tuned for the skinny
n=128 decode shape.
**The lever to beat 547 is a faster quantized decode GEMM**, i.e. a Marlin-style
int4 kernel for the decode shapes. This is exactly the direction of the prior
session's uncommitted `ggml/src/ggml-cuda/marlin-w4a16.cu` and
`fp4-grouped-moe.cu` (already wired via
`if (!split && ggml_cuda_w4a16_mul_mat(...)) return;` in `ggml_cuda_mul_mat`).
Note those target **w4a16 / GPTQ-int4**, while this GGUF is **K-quant (Q4_K/Q6_K)**,
so they are inert for this model - a Marlin path for K-quants (or shipping the
model in a Marlin-friendly int4 format) would be required. That is a multi-day
kernel effort, out of scope for this session, but it is the only lever that can
move the number.
### Why the "bump LLAMA_MAX_SEQ to 1024 -> 377" data point is consistent
`llama_batch_allocr` keeps `seq_cpl` as an `LLAMA_MAX_SEQ x LLAMA_MAX_SEQ` table
(`src/llama-batch.cpp`), so per-batch seq bookkeeping scales ~O(MAX_SEQ^2). At
MAX_SEQ=1024 that host cost becomes large enough (~70 ms/step) to dominate and
drop decode to 377. At MAX_SEQ=256 the same term is ~4.4 ms/step (the ~1.5% that
graphs reclaim); lowering to 128 would save ~3 ms/step (~1%). So MAX_SEQ tuning
confirms the host term is real but tiny at 256 - not a path to 667.
## How this would land in LocalAI
- **No host/graph patch is warranted** for this build: graphs already engage and
the decode is GPU-bound. A "pad the decode batch / force graph capture" patch
would change nothing measurable at high concurrency.
- The actionable upstream/vendored work is a **Marlin-style int4 decode GEMM**
(extend the prior `marlin-w4a16.cu` to cover K-quants, or quantize the served
model into a Marlin-friendly int4 layout). That is where the ~547 -> 667+ lives.
- If a small host win is still wanted, keep `LLAMA_MAX_SEQ` no larger than the max
concurrency actually used (the per-batch `seq_cpl` table is O(MAX_SEQ^2)).
## Reproduction
```
# baseline / A/B (cold, 30s cooldowns)
llama-batched-bench -m Qwen3-32B-Q4_K_M.gguf -npp 16 -ntg 128 -npl 32,64,128 \
-ngl 99 -b 2048 -ub 2048 -fa on # graphs on
GGML_CUDA_DISABLE_GRAPHS=1 ...same... # graphs off
# GPU util (graphs on): sample nvidia-smi during decode -> ~96%, 2184 MHz
# GPU active (graphs off): nsys profile -t cuda --delay=6 --duration=8 ...
# nsys stats --report cuda_gpu_kern_sum -> sum/0.516 ~= 7.72s of 8s = ~96%
```
## UPDATE: NVFP4 closes most of the decode gap (no Marlin-for-K-quants needed)
The diagnosis above said the lever is "a more bandwidth-efficient int4 decode GEMM"
and feared a multi-day Marlin-for-K-quants kernel. But the FP4-MMA path is already
that kernel. Measured (npl=128, cold A/B, npp=16 ntg=128):
| quant | decode S_TG (t/s) | vs Q4_K | vs vLLM 667 |
|---|---|---|---|
| Q4_K_M | 547 (548/546) | - | 82% |
| **NVFP4** | **619 (617/622)** | **+13%** | **93%** |
NVFP4's `mul_mat_q<NVFP4>` runs closer to the GB10 bandwidth floor at the thin n=128
decode shape than Q4_K's int8-MMQ (which ran ~2.1x above it). So shipping the model
as NVFP4 closes the decode gap from ~22% to ~7% AND wins prefill (1209 vs Q4 767 /
vLLM 800). Net on GB10: llama.cpp+NVFP4 is ahead on prefill (1.5x) and within ~7% on
decode. The remaining ~7% would be incremental FP4-MMA decode-kernel tuning, NOT a
from-scratch Marlin kernel - a much smaller, optional effort. NVFP4 is the answer to
both the prefill and the decode gap.

View File

@@ -1,253 +0,0 @@
# Closing the vLLM Gap on Blackwell (GB10 / DGX Spark) — Living Plan & Results
Target hardware: NVIDIA **GB10** (Grace-Blackwell, `sm_121a`, 119 GiB unified LPDDR5X), `dgx.casa`.
Model under test: **Qwen3-Coder-30B-A3B-Instruct** (MoE, 128 experts, top-8, ~3B active).
Engines: llama.cpp (CUDA, `~/llama.cpp-pr24423`, build `7a6ddc5`, `CMAKE_CUDA_ARCHITECTURES=121`) vs vLLM 0.23.0 (`~/vllm-bench`, torch 2.11.0+cu130).
> This is a working document. Each phase appends measured numbers, what was learned, and what's next.
> Methodology: `llama-bench` (single-stream pp/tg, built-in reps) and `llama-batched-bench` (`-npl` sweep,
> decode-phase aggregate `S_TG`, prefill aggregate `S_PP`); vLLM via `~/bench/vllm_conc.py` (decode-phase
> aggregate matched to `S_TG`). Same model/prompt/seed. Precision matched where possible.
---
## Baseline results (established)
### Single-stream (B=1), matched ~8-bit
| Engine / precision | prefill pp512 (t/s) | decode tg128 (t/s) |
|---|---|---|
| llama.cpp **Q8_0** | 2215 ± 15 | **54.8 / 62.2** * |
| llama.cpp **F16** | 700 ± 24 | 32.9 ± 0.05 |
| vLLM **FP8** | 9155 ± 308 | 52.45 ± 0.05 |
\* two sessions; ~55 right after worker-stop (clocks settling), ~62 steady state. Both ≥ vLLM → **single-stream parity holds**.
### Concurrency sweep (decode-phase aggregate `S_TG`, prefill aggregate)
| B | llama Q8 prefill | vLLM FP8 prefill | llama Q8 decode | vLLM FP8 decode |
|---|---|---|---|---|
| 1 | 1080 | 9644 | 60.1 | 48.0 |
| 8 | 2189 | 33373 | 160.8 | 312.4 |
| 32 | 2198 | 99398 | 357.1 | 1171 |
| 64 | 2194 | 151990 | 519.2 | 2064 |
llama F16 prefill also flat: B=1 452 → B=8 723 → B=32 778. **Prefill flat at both precisions = kernel-throughput ceiling.**
### Our paged patch (LLAMA_KV_PAGED) — concurrency effect: NONE
Same Q8 binary, paged branch confirmed firing (137 placements at B=8), throughput identical within noise:
| | B=1 | B=8 | B=32 |
|---|---|---|---|
| stock decode | 61.2 | 171.7 | 377.0 |
| paged decode | 62.7 | 170.8 | 376.8 |
Patch is placement-only correctness prototype; doesn't implement concurrency mechanics. Single-stream-neutral, concurrency-neutral.
---
## Root-cause diagnosis (nsys + code audit)
- **74.5% of GPU compute = `mul_mat_q`** (Q8_0 int8 MMQ GEMM, the MoE experts). Only cutlass kernel seen is `cutlass_80_tensorop` = **Ampere (sm_80)**, not Blackwell.
- ggml-cuda has **NO FP8 path** (no e4m3/e5m2 GEMM, no cuBLASLt FP8). Q8_0 runs the **Ampere-class int8 `mma.sync s8.s8.s32`** even on GB10 (`mma.cuh:924`, dispatched unconditionally `mmq.cu:307`).
- ggml-cuda **DOES** have a **native Blackwell FP4 path** (MXFP4 + NVFP4, `mma...kind::mxf4...e2m1`, `mma.cuh:1126`, gated `BLACKWELL_MMA_AVAILABLE`). Merged via #17906/#20644/#21074.
- **No fused MoE grouped GEMM**, no tcgen05/wgmma (warp-level `mma.sync` only).
- **Small per-expert GEMMs**: 512-tok ubatch → ~32 tok/expert (128 exp, top-8) → thin GEMMs, memory-bound, can't fill tensor-core tiles. vLLM processes 8192 tok/step → ~512 tok/expert → compute-bound + FP8.
- **The 4569× gap is partly apples-to-oranges**: we compared llama Q8 (Ampere int8) vs vLLM FP8 (Blackwell). Upstream/NVIDIA benches put the *real* FP4-vs-FP8 prefill gap at **~2550% long-context**, not 4569×.
Key upstream refs: discussion #22042 (FP8 design: `ggml_mul_mat_ext` + scale tensors), #17906 (native MXFP4), #18250 (NVFP4-MoE closed not-planned).
---
## The levers (cheap → expensive) — execution log
### Lever 1 — NVFP4/MXFP4 model (use existing Blackwell FP4 path) + ubatch bump
Status: **IN PROGRESS** — single-stream done, concurrency next.
Quant: `llama-quantize F16 -> MXFP4_MOE` (type 38), 15.9 GiB / 4.47 BPW. (No NVFP4 in llama-quantize; MXFP4_MOE puts experts in MXFP4 = Blackwell FP4 MMA.)
Single-stream (llama-bench), MXFP4 vs Q8 vs vLLM-FP8:
| metric | llama Q8 | **llama MXFP4** | vLLM FP8 |
|---|---|---|---|
| prefill pp512 (ub512) | 2215 | **3061 ± 22** | 9155 |
| prefill pp2048 (ub512) | ~2200 | 3137 ± 7 | — |
| prefill pp2048 (**ub2048**) | — | **3441 ± 14** | — |
| decode tg128 | 62.2 | **86.4 ± 0.3** | 52.45 |
Findings:
- **MXFP4 decode 86.4 beats vLLM FP8 52.45 by 1.65×** (4-bit = less memory traffic; decode is memory-bound). llama wins decode outright.
- MXFP4 prefill +38% over Q8; **ub2048 lifts prefill +10%** (3137→3441). Single-stream prefill gap to vLLM: 4.1× (Q8) → **2.7× (MXFP4)**.
- Caveat: MXFP4 is 4-bit vs vLLM FP8 8-bit — not precision-matched. Fair match = vLLM NVFP4 (4-bit); pending.
Concurrency (decode-phase aggregate `S_TG`, ub2048), MXFP4 vs Q8 vs vLLM-FP8:
| B | Q8 dec | **MXFP4 dec** | vLLM dec | Q8 pp | **MXFP4 pp** | vLLM pp |
|---|---|---|---|---|---|---|
| 1 | 60.1 | **83.4** | 48.0 | 1080 | 1625 | 9644 |
| 8 | 160.8 | **267.4** | 312.4 | 2189 | 3634 | 33373 |
| 32 | 357.1 | **551.2** | 1171 | 2198 | 3651 | 99398 |
| 64 | 519.2 | **770.2** | 2064 | 2194 | 3648 | 151990 |
**Lever-1 verdict:** MXFP4 is a large, free win — decode +5066% over Q8, prefill plateau +66% (2200→3650). MXFP4 decode **wins at B=1, near-parity at B=8** vs vLLM; only falls behind at high concurrency. **Prefill still plateaus (~3650)** — the MoE prefill GEMM doesn't scale with batch (no fused grouped GEMM; ubatch-limited). That plateau is the real remaining structural gap → Levers 23. Quality caveat unchanged (MXFP4 4-bit vs vLLM FP8 8-bit; quality not yet evaluated).
### Lever 2 — `n_ubatch` / `n_batch` tuning (standalone)
Status: **DONE + SHIPPED (auto-default implemented)**
MXFP4 pp4096 vs ubatch: ub512=2994, **ub2048=3316**, ub4096=2820(noisy), ub8192=3180.
**Verdict:** prefill saturates at ub=2048; larger ubatch gives nothing. The ~33003650 ceiling is the **MoE GEMM kernel**, not batch size. → No more free config wins; the rest is kernel work (Levers 35).
**Implemented:** `core/backend/hardware_defaults.go``EffectiveBatchSize` now defaults the physical batch
(n_batch→n_ubatch alias) to **2048 on Blackwell** (`xsysinfo.IsNVIDIABlackwell`, cc≥12 / sm_120/121) when the
config leaves `batch:` unset; explicit `batch:` always wins. Detection is a shared Go helper; placed at the
common ModelOptions builder so it covers the C++ llama.cpp backend too. Tests: `hardware_defaults_internal_test.go`.
### Lever 1b — Standard Q4 vs MXFP4 (what's actually MXFP4-specific)
**Q4_K_M** (17.3 GiB) vs **MXFP4** (15.9 GiB), ub2048:
| metric | Q4_K_M | MXFP4 | Q8 |
|---|---|---|---|
| decode tg128 | **93.5** | 86.4 | 62.2 |
| prefill pp512 | 2164 | **3061** | 2215 |
| prefill pp2048 | 2953 | **3441** | ~2200 |
**Verdict:** the **decode win is just "4-bit"** — plain Q4_K_M matches/beats MXFP4 on decode (both memory-bound).
MXFP4's *only* real edge is **prefill (+41% over Q4_K_M)** via Blackwell FP4 tensor cores. So for shipping,
**"4-bit quant + ubatch=2048" captures most of the win portably**; MXFP4 is a Blackwell-only prefill extra.
### Lever 3 — Fused FP4/FP8 MoE grouped GEMM (+ activation-quant fusion)
Status: **DESIGNED + PROFILED, not built** (multi-week kernel R&D). The single biggest remaining prefill win.
**Decisive measurements:**
- Prefill does NOT scale with bigger single prompts (attention O(N²) confounds): MXFP4 pp2048=3295, pp8192=1524,
pp16384=2051. So the plateau is not a batch-size fix.
- Real gap is batched many-sequence prefill: B=32 llama 3651 vs vLLM 99398 = **27×**. llama.cpp MoE prefill runs
at only **~22 effective TFLOP/s** on the GB10 — far below the GPU. Large headroom.
- **nsys (MXFP4 pp2048):** `mul_mat_q<type39>` (MoE FP4 GEMM) = **37.2%**, `quantize_mmq_mxfp4` (act-quant) = 8.0%,
`mul_mat_q<type8>` (dense/attn, still Q8) = 10.1%, flash_attn = 8.8%. The native FP4 MMA *is* engaged — the
inefficiency is the **per-expert thin-tile MMQ scheduler** + **un-fused activation quant**.
**Target (precise):** the ~45% in `mmq.cu`'s grouped MoE path (`ggml_cuda_mul_mat_q` + `ids`, `mmid.cu`). Replace
the per-expert thin-tile scheduler with a CUTLASS-style grouped GEMM (full tiles regardless of tokens/expert) and
fuse `quantize_mmq_mxfp4` into the permute/gather. Dense Q8 matmuls (10%) are the separate Lever-4 (FP8) target.
Problem (measured): the prefill ceiling is the MoE expert GEMM. Today `ggml_cuda_mul_mat_q` with `ids`
(`mmq.cu:127`) launches one grouped MMQ over a 3D grid (z = expert), but each expert's tile is thin
(~tokens/expert columns) so int8/FP4 tensor cores run underfilled; throughput is memory-bound on weight
streaming and flat vs batch.
Approach:
- Replace the per-expert thin-tile scheduler with a **CUTLASS-style grouped GEMM** that concatenates all
experts' token-blocks into one problem with per-group offsets, so tiles are always full (m16n8k64 FP4 /
m16n8k32 FP8) regardless of per-expert token count. Mirrors vLLM's `fused_moe` + cutlass grouped GEMM.
- **Fuse activation quantization into the permute/gather** (the `quantize_mmq_q8_1`/FP4 quantize currently a
separate 3.3% kernel) so the routed activations are quantized as they're scattered into expert order.
- Files: new kernel under `ggml/src/ggml-cuda/` (e.g. `moe-grouped-gemm.cu`) + dispatch hook in
`ggml_cuda_mul_mat_id` (`ggml-cuda.cu:2622`); reuse `mmid.cu` routing/`expert_bounds`.
- Effort: high (24 wks expert CUDA). Risk: numerics + sm_121 tile tuning. Expected payoff: the bulk of the
prefill gap (vLLM's MoE prefill advantage is mostly this). Upstream: #18250 (NVFP4-MoE) was closed
not-planned, so this would be a LocalAI patch or a fresh upstream proposal.
### Lever 4 — FP8 (e4m3) GEMM for dense layers
Status: **DESIGNED, not built** (blocked on a core ggml API change).
Problem: ggml-cuda has no FP8 matmul (only int8/FP4). vLLM runs qkv/o_proj/lm_head in FP8 on Blackwell
tensor cores. Our dense layers run int8-MMQ or f16-cuBLAS.
Approach (two options):
- (a) **cuBLASLt FP8**: route dense `mul_mat` through `cublasLtMatmul` with `CUDA_R_8F_E4M3` A/B and FP32
compute + scale pointers. Lowest kernel effort; gets library-tuned Blackwell FP8 immediately. Needs the
scale-tensor plumbing below.
- (b) **Hand-written sm_121 `mma.sync ...e4m3.e4m3.f32`** kernels in `mma.cuh`/`mmf.cu`. More control, more work.
- Prerequisite (both): the **`ggml_mul_mat_ext` / scale-tensor API** from upstream discussion #22042
per-tensor FP8 scales don't fit the block-scaled quant struct; `MUL_MAT`/`MUL_MAT_ID` must accept optional
scale tensors. This is a cross-cutting ggml change (graph + ops + all backends' fallbacks).
- Effort: high (API change is the hard part; cuBLASLt path is then moderate). Payoff: closes dense-layer
prefill/compute gap; complements Lever 3. Note: for *this* MoE model the experts dominate, so Lever 3 > 4.
### Lever 5 — tcgen05 / wgmma-class kernels for large-prefill tiles
Status: **DESIGNED, not built** (very high effort; last increment).
Problem: ggml's tensor-core path is warp-level `mma.sync` only (no `wgmma`/`tcgen05`). Blackwell's
tensor-memory `tcgen05` MMA (what CUTLASS uses) extracts substantially more throughput at large prefill tiles.
Approach: introduce warpgroup/tcgen05 GEMM main-loops for the FP4/FP8 paths (effectively adopting CUTLASS
3.x collective mainloops for sm_120/121), used when tile size is large enough (prefill). Decode (thin) keeps
`mma.sync`.
- Effort: very high (CUTLASS-class engineering). Payoff: the final slice of large-prefill throughput; only
worth it after Levers 34 land. Realistically: depend on/upstream CUTLASS kernels rather than hand-roll.
---
## Paged attention — complete implementation (after kernels are fair)
The placement prototype is insufficient (measured: zero concurrency benefit). A real implementation needs all
four gaps. CPU foundation already built & verified (`PagedKVManager` P0P3, `README.md`); the in-model parts
are unbuilt. **Build order and concrete design:**
1. **On-demand block allocation from a shared pool** (capacity win — more concurrent seqs before OOM).
- Replace `find_slot`'s ring-buffer (`llama-kv-cache.cpp:818`) with `PagedKVManager` block allocation; the
KV tensor becomes a shared block pool `[n_embd, block_size*num_blocks]`, sequences draw blocks on demand
(already prototyped on CPU: `paged_kv_manager.{h,cpp}`, `test_ggml_paged_rw.cpp`).
- Win measured where it counts: max concurrent sequences before OOM (not yet benchmarked — needs this).
2. **Gather-read** so each seq attends only its own blocks (`get_k`/`get_v` `:1145/1165``ggml_get_rows`
gather into scratch, then existing attention). Numerically proven on CPU (`test_ggml_paged_attn.cpp`,
7.5e-08 vs reference). Needs `build_attn_paged` branch in `llama-graph.cpp` + Gate 0 in a real model.
3. **Continuous batching / scheduler** (no head-of-line blocking on mixed-length traffic). New scheduler in
the server slot path; admit/evict at block granularity; the dimension where paging beats llama.cpp's
current static batching. This is where the *real* concurrency win lives (vs our synthetic uniform test).
4. **Automatic prefix sharing** (block-hash dedup; `PagedKVManager::{compute_block_hashes,get_computed_blocks}`
already implemented & tested). Cross-tenant shared system prompts reuse physical blocks.
Status: design in `2026-06-19-paged-attention-llamacpp-design.md`; CPU P0P3 done; in-model #1#4 unbuilt.
**Then** measure concurrency in paging's real scenarios — **memory-pressured (max seqs before OOM)** and
**mixed-length continuous batching** — on the MXFP4 (fair-quant) footing, not the uniform/over-provisioned
test that (correctly) showed no benefit.
> Reality check from this session's data: paged attention is a **capacity + scheduling** win, not a per-token
> speed win. On GB10 with 119 GB unified memory and uniform requests we are not memory-bound at B≤64, so the
> placement prototype showed nothing. Paging's value appears under memory pressure (many/long sequences) and
> bursty mixed-length traffic. The per-token throughput gap is a **kernel** problem (Levers 13), separate
> from paging.
---
## Implementation plan A — Lever 3: FP4 MoE GEMM to vLLM parity
Goal: lift batched MoE prefill from ~3.65k t/s (B=32) toward vLLM's ~99k. Root cause (profiled):
`mul_mat_q<MXFP4>` runs at ~22 effective TFLOP/s — warp-level `mma.sync`, not Blackwell tcgen05.
Cheap knobs are exhausted (ubatch saturates at 2048; `GGML_CUDA_FORCE_CUBLAS` is a no-op 3419↔3423;
tile width already full at mmq_x=128). So parity needs kernel work, done iteratively on the DGX
(`~/llama.cpp-pr24423`, editable + rebuildable; diffs captured as `patches/`).
Phases (each: hypothesis → edit `ggml/src/ggml-cuda/``cmake --build build --target llama-bench`
`llama-bench` MXFP4 pp/concurrency → record):
1. **Cheap kernel tweaks (low confidence, fast).** nwarps (occupancy), `mmq_y` tile, stream-k on/off,
FP4 load-tile path. Measure each. Likely small (<1.3x) — these don't change the warp-MMA ceiling.
- **Result (nwarps):** DEAD END. `nwarps` is locked by `static_assert(nwarps*tile_C::I == mmq_y)`
(mmq.cuh:3234) → nwarps=8 for mmq_y=128. Can't raise occupancy without co-scaling mmq_y to 256
(nwarps=16), which blows Blackwell shared-memory limits. The MMQ constants are tightly coupled;
it is not freely tunable. Confirms parity needs the kernel rewrite (phase 3), not knobs.
2. **Fuse activation quant** (`quantize_mmq_mxfp4`, 8%) into the permute/gather. Removes a kernel +
a global round-trip. Tractable, ~1.1x.
- **Result:** NOT AVAILABLE as a cheap patch. `quantize_mmq_fp4_cuda` (mmq.cu:200) *already* takes
`ids_src1` — the gather is already fused into the quant. The only remaining fusion is quantize-on-load
*inside* the GEMM hot loop (intricate, ~8% ceiling, risky). ORippler's #24481 fuses the decode (MMVQ)
post-scale and intends a "BS>1" (prefill) follow-up — unwritten. Marginal; skip.
**Upstream survey (2026-06):** there is NO tcgen05/CUTLASS grouped-GEMM MoE kernel in ggml — not merged,
not in-flight, not a draft (Discussion #18369 is talk, no PR; #18250 closed not-planned). CUTLASS is not a
dependency (the profile's `cutlass_80_tensorop` is cuBLAS-internal). No fork has a portable MoE kernel
(croll83/llama.cpp-dgx is GatedDeltaNet-focused). Maintainer signal (woachk on #17906): "the path forward
is to wait for cuTile C++." So **nothing to cherry-pick; phase 3 is genuinely from-scratch.**
3. **The real lever — tcgen05 / CUTLASS FP4 grouped GEMM.** Replace the per-expert MMQ scheduler with a
CUTLASS 3.x collective-mainloop grouped GEMM (sm_120a, `e2m1` block-scaled, tcgen05 tensor-memory MMA),
one problem over all experts with per-group offsets, fused act-quant. This is what vLLM/FlashInfer use.
Multi-week; the honest path to parity. Prefer **upstream ggml** (issue drafted) over a private patch.
4. **Full-model low precision.** Quantize dense layers (qkv/o_proj/lm_head, the 10% Q8) to FP4/FP8 too so
the whole prefill runs on FP4 tensor cores, not int8-MMQ.
Exit per phase: measured t/s recorded here; stop a phase when it's a dead end (recorded as such).
Matching vLLM realistically requires phase 3; phases 12 are the warm-up + de-risking.
## Implementation plan B — Complete paged attention (the pivot)
CPU foundation done (P0P3, `README.md`): vLLM-parity block manager + ggml write/gather + attention
numerics + placement Gate 0 (token-identical in-model). Remaining = make it deliver the multi-tenant wins.
Phases:
1. **On-demand shared-block pool** — replace `find_slot` ring buffer (`llama-kv-cache.cpp:818`) with
`PagedKVManager` block allocation; KV tensor = `[n_embd, block_size*num_blocks]` shared pool. Win:
fit more concurrent seqs before OOM. Test: max concurrent seqs at fixed budget vs contiguous.
2. **Gather-read** (`get_k/get_v` `:1145/1165``ggml_get_rows` into scratch) + `build_attn_paged` branch
in `llama-graph.cpp`. Numerically proven on CPU (7.5e-08). Gate 0: token-identical multi-seq.
3. **Continuous batching / scheduler** — admit/evict at block granularity in the server slot path. The
real concurrency win on mixed-length traffic (where the placement prototype showed nothing).
4. **Automatic prefix sharing** — block-hash dedup (`PagedKVManager::{compute_block_hashes,get_computed_blocks}`
already implemented + tested). Cross-tenant shared system prompts reuse physical blocks.
Then benchmark in paging's real regimes — **memory-pressured** + **mixed-length continuous batching** — on
the MXFP4 (fair-quant) footing. Note: GB10's 119 GB unified memory means win-1 needs genuine pressure
(long/many seqs) to show; the win is capacity + scheduling, not per-token speed.
## Honest scope note
Levers 35 and the complete paged implementation are each substantial (weeks of expert CUDA/systems work). This doc tracks what is **measured** vs **designed** vs **not-yet-built**, and never claims a number that wasn't run on the box.

View File

@@ -1,59 +0,0 @@
# FP4 grouped-GEMM MoE kernel (Lever 3) — scaffold + implementation plan
The one piece of work that actually closes the vLLM gap on Blackwell (GB10/sm_121). Both phases are
bottlenecked by the same kernel: `mul_mat_q<MXFP4>` (warp-level `mma.sync` grouped MMQ, ~22 TFLOP/s) is
**37%** of prefill and **54.6%** of decode-at-B=64 GPU time (`BENCHMARKS.md`). Paged attention can't touch
it (proven). The fix is a CUTLASS-3.x collective-mainloop grouped GEMM with block-scaled `e2m1` operands via
tcgen05 tensor-memory MMA — what vLLM/FlashInfer/TRT-LLM use.
## Scaffold (DONE — builds clean, default byte-identical)
Lives in the DGX checkout `~/llama.cpp-pr24423/ggml/src/ggml-cuda/` (to be rebased onto the pin as a patch /
upstreamed). Captured diff: `patches/kernel/0001-fp4-grouped-moe-scaffold.patch`.
- `fp4-grouped-moe.{cuh,cu}` — entry `ggml_cuda_fp4_grouped_moe(ctx, src0, src1, ids, dst) -> bool`
(true = handled, false = fall back to MMQ). Gated behind env `GGML_CUDA_FP4_GROUPED`. Currently always
returns false → **default build unchanged**.
- Hook in `ggml_cuda_mul_mat_id` (the MoE dispatch), before the `ggml_cuda_mul_mat_q(...ids...)` call:
`if (ggml_cuda_fp4_grouped_moe(...)) return;`. Builds via the `file(GLOB "*.cu")` (re-run cmake configure
after adding the file — GLOB is configure-time).
This is the integration seam. The kernel fills the stub.
## Implementation phases (each: build on GB10 → numerical parity vs `mul_mat_q<MXFP4>` → bench)
1. **Reference grouped GEMM (correctness first, slow OK).** Per-expert problem sizes + offsets from `ids`;
dequant `e2m1`+scales → BF16; loop CUTLASS (or cuBLAS) per group. Gate: output matches MMQ within fp tol
on a 2-expert toy + the real model (token-identical greedy). Establishes the harness + the data plumbing.
2. **CUTLASS GemmGrouped, sm_120a, BF16 operands.** Replace the loop with one `cutlass::gemm::device::
GemmGrouped` launch over all experts (per-group offsets). Measures the grouping win alone.
3. **Block-scaled FP4 operands (the real lever).** `e2m1` A/B with `e8m0`(MX)/`e4m3`(NV) block scales via the
Blackwell scaled-MMA collective (tcgen05 tensor-memory). This is where the TFLOP/s jumps. Needs CUTLASS
3.x + sm_120a; verify the block-scale layout matches ggml's MXFP4/NVFP4 packing.
4. **Fuse activation quant** (the F32→FP4 of src1) into the gather/permute prologue.
5. **Enable by default** on sm_120/121 when parity holds + faster; keep the env as an escape hatch.
## Dependencies / decisions
- **CUTLASS is not currently a ggml dependency** (the profile's `cutlass_80_tensorop` is cuBLAS-internal).
Adding it = submodule/fetch + include dir, gated to CUDA sm_120+. Float the approach with ggml maintainers
early (Discussion #18369 is the home; JohannesGaessler asked to discuss arch before big kernel work).
- Target sm_120a/121a (consumer Blackwell). Datacenter Blackwell (sm_100) is a separate tile config.
- Risk: needs ncu-driven iteration on the GB10; this is multi-week, expert-CUDA. No upstream base to fork
(exhaustive search confirmed). Net-new value upstream.
## DENSE scope — RESOLVED (TODO #28, benchmarked): dense needs an FP4 GEMM too
Benchmarked Qwen3-32B dense, vLLM W4A16 vs llama.cpp Q4_K_M (`BENCHMARKS.md`). **Dense prefill is 7.632×
behind** (llama int8-MMQ plateaus ~765 t/s; vLLM FP4 scales to 24.4k); decode ~parity at B=1, 2.2× at B=64.
So the kernel track is **two kernels, not one**:
- **(a) Dense FP4 GEMM** — a plain non-grouped CUTLASS/tcgen05 block-scaled FP4 GEMM. **Simpler than grouped;
land this FIRST** — it's the easier first kernel, benefits every dense model, and de-risks the FP4 collective
before the grouped variant. Hook: the non-MoE `ggml_cuda_mul_mat_q` (no `ids`) path.
- **(b) MoE grouped FP4 GEMM** — the scaffold above (`ggml_cuda_fp4_grouped_moe`), per-expert offsets.
Both share the same block-scaled `e2m1` collective; (a) is (b) with one group. Suggested order: build (a),
prove the FP4 collective + parity harness, then generalize to (b). (Aside: full W4A4 NVFP4 doesn't run on
GB10 today — FlashInfer ships no FP4 cubins for sm_121, so the dense `mm_fp4` kernel hangs/returns zeros; the
W4A16 Marlin path is the fast, correct one and is the fair comparison. See `BENCHMARKS.md` for the root cause.)

View File

@@ -1,140 +0,0 @@
# MXFP4-dense vs Q4_K_M quality check (Qwen3, GB10 / DGX Spark)
## Question
MXFP4-quantized **dense** Qwen3-32B is measurably faster on GB10 (Blackwell) than
Q4_K_M: ~1.58x concurrent prefill, ~1.2x decode, for free (just a requantize that
routes onto the FP4-MMA kernel). Before LocalAI recommends MXFP4-dense as a Blackwell
default, we must confirm its **quality is acceptable versus Q4_K** (Q4_K is normally the
stronger 4-bit format).
Critical caveat going in: the pre-existing `~/bench/q3-32b-mxfp4-dense.gguf` was built
with `--allow-requantize`, so it was suspected to be **double-quantized** (Q4_K_M ->
MXFP4), which would unfairly penalize MXFP4. The goal here was a *fair* answer.
## Verdict
**Do NOT recommend MXFP4-dense as a quality-equivalent replacement for Q4_K on
Blackwell.** A clean apples-to-apples test (same BF16 source, both 4-bit, no imatrix)
shows MXFP4-dense carries a **large** quality penalty that Q4_K does not:
- Q4_K_M costs **+2.6%** perplexity vs the BF16 baseline.
- MXFP4-dense costs **+30.8%** perplexity vs the BF16 baseline (i.e. **+27.5% worse
than Q4_K**).
The double-quant suspicion was correct but it was **not** the main culprit: even a clean
MXFP4-from-BF16 is dramatically worse than Q4_K. The ~1.58x prefill / ~1.2x decode
speedup is real, but it is not free on quality. MXFP4-dense output is still coherent (not
gibberish), so it is usable where raw throughput dominates and a quality hit is
acceptable, but it must not be presented as a drop-in, quality-neutral Q4_K replacement.
## Evidence
### 1. Provenance of the existing 32B MXFP4 (it is double-quant)
`~/dense_mxfp4.sh` (mtime matches the `q3-32b-mxfp4-dense.gguf` mtime, Jun 20 09:47)
created it:
```
SRC=$HOME/bench/q3-32b-gguf/Qwen3-32B-Q4_K_M.gguf # <-- source is Q4_K_M, not F16/BF16
OUT=$HOME/bench/q3-32b-mxfp4-dense.gguf
$QB --allow-requantize --tensor-type "attn=mxfp4" --tensor-type "ffn=mxfp4" \
"$SRC" "$OUT" MXFP4_MOE
```
Confirmed **double-quantized** (Q4_K_M -> MXFP4). Any PPL measured on this file
overstates MXFP4's true penalty, so the 32B number below is a loose upper bound, not the
fair answer.
### 2. 32B quick read (wikitext-2-raw test, 50 chunks, ctx 512, ngl 99)
`llama-perplexity`, PR build `~/llama.cpp-pr24423/build` (sm_121):
| 32B model | PPL | vs Q4_K |
|---|---|---|
| Qwen3-32B-Q4_K_M | **7.3865** +/- 0.177 | - |
| q3-32b-mxfp4-dense (double-quant) | **8.4638** +/- 0.206 | +14.6% |
MXFP4 is much worse than Q4_K here, **and** it is double-quant, so the quick read is
unfair -> escalated to a clean small-model comparison.
### 3. Fair comparison: clean small dense model (Qwen3-4B BF16)
The MXFP4-vs-Q4_K delta is a *format* property and roughly model-size-independent, so a
small model gives a fast, clean answer. Downloaded `Qwen3-4B-BF16.gguf` (unsloth, ~7.7
GiB) and quantized it **from that same BF16 source** to both formats with the identical
recipe used for the 32B (no `--allow-requantize` needed, no imatrix on either side):
```
llama-quantize q3-4b-bf16.gguf q3-4b-q4km.gguf Q4_K_M
llama-quantize --tensor-type attn=mxfp4 --tensor-type ffn=mxfp4 \
q3-4b-bf16.gguf q3-4b-mxfp4.gguf MXFP4_MOE
```
Perplexity (wikitext-2-raw test, 50 chunks, ctx 512, ngl 99):
| Qwen3-4B | size | PPL | vs BF16 | vs Q4_K |
|---|---|---|---|---|
| BF16 (baseline) | 7672 MiB | **13.3188** +/- 0.416 | - | - |
| Q4_K_M | 2497 MiB | **13.6605** +/- 0.426 | **+2.57%** | - |
| MXFP4 (clean) | 2236 MiB (4.66 BPW) | **17.4183** +/- 0.561 | **+30.78%** | **+27.5%** |
This is the apples-to-apples quality answer: **clean MXFP4-from-BF16 is ~12x more lossy
than Q4_K relative to the BF16 baseline** (30.8% vs 2.6%). Notably the clean-4B MXFP4-vs-
Q4_K gap (+27.5%) is *wider* than the 32B double-quant gap (+14.6%), consistent with
smaller models being more quantization-sensitive - the double-quant did not invent the
problem, it is intrinsic to the format as quantized by `llama-quantize`.
### 4. Coherence spot-check (32B, llama-simple, n=60)
MXFP4-dense 32B is fully coherent, not degraded gibberish:
- "The capital of France is" -> MXFP4: "...Paris, is located near the Seine River..."
(correct); Q4_K similar.
- "Q: What is 17 multiplied by 23? A:" -> MXFP4 reasons via the distributive property
(sound); Q4_K answers 391 directly (correct).
- "def fibonacci(n):" -> both emit valid Python.
So the quality cost shows up as measurably higher perplexity (and would surface on harder
/ longer tasks), not as obviously broken text at short generation lengths.
## Why
`MXFP4_MOE` is a 4-bit float format (E2M1 values, shared E8M0 scale per block of 32,
round-to-nearest) designed for MoE expert tensors (gpt-oss et al.) with a coarse
per-block scale. Q4_K uses 6-bit superblock scales plus per-sub-block mins - materially
better for dense attention/FFN weights. Forcing MXFP4 onto dense layers to reach the FP4
kernel trades ~1.58x prefill for a large accuracy loss. The FP4-MMA speed path is real,
but the weights it accepts (MXFP4 here) are lossy for dense.
## Caveat, stated precisely
This measures **llama.cpp's `llama-quantize` MXFP4** (OCP MX FP4, RTN, **no imatrix**)
against **llama.cpp's Q4_K_M** (k-quant superblocks, also no imatrix here). It is a fair
format-vs-format comparison of exactly what LocalAI would ship if it routed a requantize
through this path. It does **not** claim FP4 is fundamentally unviable on Blackwell:
- An imatrix-aware MXFP4, or a better FP4 format with two-level scaling
(**NVFP4** - there are already `q3-32b-nvfp4` / `q3-32b-nvfp4a16` dirs on the box),
may close much of this gap and is the more promising Blackwell FP4 path to evaluate.
- The result is for Qwen3 dense; other families may differ in magnitude but the
format-level disadvantage of plain MXFP4 RTN vs Q4_K is expected to hold.
## Recommendation
- **Do not** ship a blanket "use MXFP4-dense on Blackwell" recommendation as a Q4_K
quality equivalent. The ~1.58x prefill / ~1.2x decode win comes with a real ~30% PPL
inflation (vs ~2.6% for Q4_K). Q4_K_M stays the right dense default on Blackwell.
- If exposing MXFP4-dense at all, gate it as an explicit **throughput-over-quality**
option with the perplexity caveat surfaced, not a default.
- MXFP4/FP4 remains correct where the model is trained for it (MoE / gpt-oss-style).
Pursue **NVFP4** (and/or imatrix-aware FP4) as the quality-competitive Blackwell FP4
format before making any FP4-dense recommendation.
## Reproduction (DGX Spark, GB10, build `~/llama.cpp-pr24423/build`, sm_121)
- Dataset: `~/wikitext-2-raw/wiki.test.raw` (wikitext-2-raw-v1 test).
- 32B: `~/ppl32b.sh` -> `~/ppl32b.out`; coherence `~/coh32b.sh` -> `~/coh32b.out`.
- Clean 4B: `~/fair4b.sh` -> `~/fair4b.out` (quantize + 3x perplexity).
- All runs `-ngl 99`, `--chunks 50`, `-c 512`. GB10 thermal-throttles but PPL is a
correctness metric, so thermal state does not affect these numbers.

View File

@@ -1,41 +0,0 @@
CXX ?= g++
CXXFLAGS ?= -std=c++17 -O2 -Wall -Wextra -I.
TESTS = test_free_block_queue test_block_pool test_paged_kv_manager test_prefix_cache
BINS = $(addprefix tests/,$(TESTS))
all: $(BINS)
tests/%: tests/%.cpp paged_kv_manager.cpp paged_kv_manager.h
$(CXX) $(CXXFLAGS) -o $@ $< paged_kv_manager.cpp
check: all
@for t in $(BINS); do echo "== $$t =="; ./$$t || exit 1; done
paged-bench: paged-bench.cpp paged_kv_manager.cpp paged_kv_manager.h
$(CXX) $(CXXFLAGS) -o $@ paged-bench.cpp paged_kv_manager.cpp
bench: paged-bench
./paged-bench
# --- Optional ggml integration test (Phase 1: paged write/gather mechanism) ---
# Requires a built ggml. Override these to point at your checkout / build:
# make ggml-check GGML_SRC=<llama.cpp>/ggml GGML_BUILD=<ggml-build>
GGML_SRC ?= ../../llama-cpp-fallback-build/llama.cpp/ggml
GGML_BUILD ?= /tmp/ggml-build
GGML_LIBDIR = $(GGML_BUILD)/src
GGML_TESTS = test_ggml_paged_rw test_ggml_paged_attn
GGML_BINS = $(addprefix tests/,$(GGML_TESTS))
tests/test_ggml_%: tests/test_ggml_%.cpp paged_kv_manager.cpp paged_kv_manager.h
$(CXX) $(CXXFLAGS) -I$(GGML_SRC)/include -o $@ $< paged_kv_manager.cpp \
-L$(GGML_LIBDIR) -lggml -lggml-base -lggml-cpu -Wl,-rpath,$(GGML_LIBDIR)
ggml-check: $(GGML_BINS)
@for t in $(GGML_BINS); do echo "== $$t =="; ./$$t || exit 1; done
clean:
rm -f $(BINS) $(GGML_BINS) paged-bench
.PHONY: all check ggml-check clean

View File

@@ -1,114 +0,0 @@
# NVFP4-dense on DGX Spark (GB10, sm_121): is it the quality-preserving FP4 win MXFP4 wasn't?
Test rig: DGX Spark GB10 (sm_121), `~/llama.cpp-pr24423/build` (PR #24423, FP4 MMA + NVFP4
kernel), wikitext-2-raw, clean BF16 source `q3-4b-bf16.gguf` (the same source used for the
established MXFP4 / Q4_K fair test). NVFP4 and all comparison quants were produced clean from
BF16, no imatrix.
## Verdict (short)
YES on all the load-bearing questions, with one honest caveat:
1. llama.cpp CAN produce an NVFP4 GGUF.
2. NVFP4 quality is Q4_K-class, NOT MXFP4-class: +7.4% PPL vs BF16 (MXFP4 was +30.8%). It is
slightly behind Q4_K (+4.8% relative) but in the same ballpark, not on the quality cliff.
3. NVFP4 routes onto the FP4 MMA kernel and gets the FP4 prefill speedup: ~1.29x Q4_K on the
4B, tracking MXFP4 to within 5% (MXFP4 hit 1.58x on the 32B; NVFP4 should track it there too).
4. Output is coherent.
Bottom line: NVFP4-dense IS the quality-preserving FP4 win MXFP4 wasn't. It delivers
essentially the full FP4 prefill speedup at roughly Q4_K quality, where MXFP4 paid a 27% quality
tax for the same speed. LocalAI can support/recommend NVFP4-dense on Blackwell for prefill-bound
workloads, with the caveat that it is marginally (~5%) behind Q4_K on perplexity; an imatrix-guided
NVFP4 quant would likely close most of that remaining gap.
## 1. Feasibility: can llama-quantize produce an NVFP4 GGUF? YES
- The type exists with a full quantize path, not just a kernel:
- `GGML_TYPE_NVFP4 = 40` (`ggml.h`), `GGML_FTYPE_MOSTLY_NVFP4 = 26`
- `quantize_nvfp4` / `quantize_row_nvfp4_ref` / `dequantize_row_nvfp4` registered in `ggml.c`
- type_name is `"nvfp4"`, block `QK_NVFP4` (per-16 FP8/E4M3 block scale + global scale)
- NVFP4 is NOT a top-level `llama-quantize` ftype (no `NVFP4` entry in the allowed-types list,
no reference in `tools/quantize/quantize.cpp` or `src/llama-quant.cpp`), BUT
`--tensor-type name=nvfp4` resolves it: `parse_ggml_type` matches the arg against
`ggml_type_name(...)`, which returns `"nvfp4"`. This is the exact same mechanism that produced
MXFP4-dense.
- Recipe used (mirrors the MXFP4-dense GGUF byte-for-byte in structure: token_embd Q8_0, all
norms F32, all 2D attn+ffn weights to FP4):
```
llama-quantize --tensor-type "attn=nvfp4" --tensor-type "ffn=nvfp4" \
q3-4b-bf16.gguf q3-4b-nvfp4.gguf Q8_0
```
Result: `q3-4b-nvfp4.gguf`, 2343.93 MiB, 4.89 BPW, ~5 s. (MXFP4-dense was 2350 MiB; same shape.)
Every `blk.N.attn_*` and `blk.N.ffn_*` reported `converting to nvfp4`; token_embd Q8_0; norms F32.
The on-box `~/bench/q3-32b-nvfp4*` dirs are vLLM HF safetensors (already 4-bit), not GGUF, and
do not feed llama.cpp - confirmed and irrelevant.
## 2. Quality (decisive): NVFP4 is Q4_K-class, not MXFP4-class
`llama-perplexity -f wiki.test.raw --chunks 50 -c 512 -ngl 99`, all clean from the same BF16 4B:
| Quant | PPL | vs BF16 | vs Q4_K |
|---------|--------|----------|----------|
| BF16 | 13.32 | - | - |
| Q4_K_M | 13.66 | +2.6% | - |
| NVFP4 | 14.31 | +7.4% | +4.8% |
| MXFP4 | 17.42 | +30.8% | +27.6% |
(NVFP4 measured this run: Final PPL = 14.3097 +/- 0.4457.)
NVFP4 lands much closer to Q4_K (gap 0.65 PPL) than to MXFP4 (gap 3.11 PPL). MXFP4's finer
sibling delivers: the two-level scaling (per-16 FP8 block scale + global scale) recovers almost
all of the quality MXFP4's coarse per-32 E8M0 scale threw away. It is not quite Q4_K, but it is
firmly in the "acceptable 4-bit" regime, not the lossy one.
## 3. Speed: NVFP4 routes onto the FP4 MMA kernel
No clean BF16 32B was on the box (only the vLLM NVFP4 safetensors and the Q4_K/MXFP4 32B GGUFs),
so per the brief this is the 4B speed signal - a 3-way cold A/B on the SAME 4B model, 45 s
cooldowns between runs (`-npp 512 -ntg 128 -npl 8,32,64 -b 2048 -ub 2048 -ngl 99`):
Prefill S_PP (t/s):
| B | Q4_K | NVFP4 | MXFP4 | NVFP4 / Q4_K | NVFP4 / MXFP4 |
|-----|--------|--------|--------|--------------|---------------|
| 8 | 4862 | 6313 | 6602 | 1.30x | 0.96x |
| 32 | 5020 | 6497 | 6836 | 1.29x | 0.95x |
| 64 | 5031 | 6490 | 6831 | 1.29x | 0.95x |
- NVFP4 prefill is within ~5% of MXFP4 at every batch size -> both land on the same FP4 MMA
kernel. NVFP4 does NOT fall back to a slow path.
- NVFP4 beats Q4_K's int8-MMQ prefill by ~1.29x on the 4B. The established 32B figures were
Q4_K S_PP ~767 and MXFP4 ~1209 (1.58x); since NVFP4 tracks MXFP4 to within 5%, NVFP4 on the
32B should likewise approach ~1.5x. (The 4B shows a smaller multiplier than the 32B because a
smaller model spends proportionally less time in the matmul the FP4 kernel accelerates.)
- Token-gen (S_TG) is comparable across all three (memory-bound), as expected.
## 4. Coherence
`llama-simple` (llama-cli hangs - avoided), NVFP4 4B:
- "The capital of France is" -> "...Paris. ...Germany is in Berlin. ...Italy is in Rome.
...Spain is in Madrid. ...Netherlands is in Amsterdam." (all correct)
- "Q: What is 17 plus 25? A:" -> "42." (correct)
Coherent and factually accurate.
## Recommendation for LocalAI on Blackwell
Support and recommend NVFP4-dense as the FP4 prefill option on Blackwell (sm_120/121), produced
via `--tensor-type attn=nvfp4 --tensor-type ffn=nvfp4` over a BF16 source (token_embd Q8_0,
norms F32). It gives ~the full FP4 prefill speedup (FP4 MMA kernel, ~1.3x Q4_K on 4B and
expected ~1.5x on larger models) at roughly Q4_K quality (+7.4% PPL vs BF16). This is the win
MXFP4 failed to deliver: MXFP4 paid a +30.8% quality tax for the same speed and was rejected.
Caveats / follow-ups:
- NVFP4 is still ~4.8% behind Q4_K on PPL. For quality-first deployments where the prefill win
does not matter, Q4_K_M remains the better pick.
- These NVFP4/Q4_K numbers are clean (no imatrix). An imatrix-guided NVFP4 quant is the obvious
next step and would likely close most of the remaining gap to Q4_K - worth measuring before a
blanket recommendation.
- A direct 32B NVFP4-vs-Q4_K speed run (needs a clean BF16 32B GGUF, not on the box) would
confirm the projected ~1.5x; the 4B signal plus the MXFP4-tracking already make this very likely.

View File

@@ -1,115 +0,0 @@
# Paged KV at high concurrency on a single GB10 - the datacenter-scale test
Closes the open question left by `PR22569_EVAL.md`: that eval could not test the
"paged KV unlocks thousands of sequences" thesis because **both** KV paths hit the
`LLAMA_MAX_SEQ=256` compile cap, and the 32B-dense model it used is compute-bound
(plateaus by npl=128 for an unrelated reason). This run removes both confounders:
**recompiled `LLAMA_MAX_SEQ=2048`** and used a **bandwidth-bound model (Qwen3-1.7B-Q8_0)**
where decode aggregate is free to keep climbing with concurrency.
Hardware: NVIDIA GB10 (sm_121, 119 GiB unified LPDDR5X, ~273 GB/s). Build:
`~/llama.cpp-pr22569` (PR #22569 paged path + the reshape fix), `LLAMA_MAX_SEQ=2048`,
sm_121 Release. Contiguous = `llama-batched-bench` (unified KV) `S_TG`. Paged =
`llama-paged -kvp --fit off` `aggregate tps`. `npp=16, ntg/n_predict=128, b=ub=2048,
-ngl 99`. Cold runs, 12 s cooldowns.
## TL;DR for the decision
**On a single GB10, paged KV does NOT deliver a throughput or concurrency win - the
aggregate-decode ceiling is set by the hardware, not the KV layout, and contiguous KV
already reaches it.** Measured across two model regimes and concurrency up to 2048
sequences:
- Aggregate decode **plateaus** once the GPU saturates - for both KV layouts:
- 32B-dense (compute-bound): ~540 t/s, flat from npl=128 (prior eval).
- 1.7B (bandwidth-bound): ~3,200-3,700 t/s, flat from npl=512 (this run).
- Paged and contiguous land at the **same ceiling**; PR #22569's paged op was 12-13%
*slower* than the mature contiguous flash-attention path at equal concurrency on 32B.
- Pushing concurrency past the plateau is **actively harmful to UX**: per-sequence
throughput collapses (23 -> 1.9 tok/s) and TTFT explodes (0.6 s -> 4.3 s avg, **64 s
max**) while aggregate stays flat.
**vLLM's ~24k aggregate headline is unreachable on a single GB10 with these models
regardless of KV layout** - it needs aggregate memory bandwidth / compute that one GB10
does not have (i.e. many GPUs). Paged KV is a **memory-capacity / anti-fragmentation /
prefix-sharing** feature, not a single-node throughput-ceiling feature. The static
single-model benchmark deliberately does not create the memory-pressure regime where
paging pays off, which is exactly why no win appears.
## The numbers
### Aggregate decode vs concurrency, Qwen3-1.7B-Q8_0 (bandwidth-bound), `LLAMA_MAX_SEQ=2048`
| npl | contiguous `S_TG` (t/s) | paged `aggregate tps` (t/s) | paged per-seq tps | paged TTFT avg / max |
|----:|------------------------:|----------------------------:|------------------:|---------------------:|
| 128 | 2,643 | 2,887 | 23-25 | - |
| 256 | 2,925 | - | - | - |
| 512 | 3,215 | 3,637 | 7.2-7.8 | 0.57 s / 0.90 s |
| 1024 | 3,118 | 3,695 | 3.7-4.2 | 1.17 s / 2.37 s |
| 2048 | (not run) | 3,608 | 1.9-14.6 | 4.28 s / **63.8 s** |
Both paths flatten by npl~512. 8x more concurrency (128->1024) buys contiguous only
**+18%** and paged **+28%**, then both stop. (The two tools meter slightly differently -
`llama-paged` aggregate vs `batched-bench` decode-only `S_TG` - so the small paged-vs-
contiguous offset is not a real paged advantage; the prior apples-to-apples 32B eval had
paged 12-13% *behind*.)
### Why it plateaus (the hardware ceiling, not the KV layout)
Decode is memory-bandwidth-bound: each step reads the model weights once and shares that
read across the whole batch. Once concurrency is high enough that the shared weight-read
is amortized, the per-step cost is dominated by KV reads + attention + host work, none of
which paging makes cheaper. The GB10's ~273 GB/s sets the floor; at the plateau the GPU
is ~saturated. Adding sequences past that point cannot raise aggregate - it only divides
the same throughput across more users (per-seq tps falls, TTFT rises). The 32B-dense case
plateaus even earlier (npl=128) because it saturates on **compute** (weight matmuls), not
bandwidth - the kernel decomposition is in `VLLM_DECOMPOSITION.md`.
## What paged KV is actually for (the honest, deliverable value)
Paging never helps a static, uniform-length, single-model benchmark on a GPU with memory
to spare - there is no fragmentation and no over-reservation to reclaim. Its real wins,
which require the regime this hardware+benchmark does not exercise, are:
1. **Concurrent-tenant capacity under memory pressure.** Block KV fits more *diverse*
in-flight sequences (variable, dynamically arriving/leaving contexts) without the
contiguous path's per-slot reservation/fragmentation. Pays off when KV memory, not
compute/bandwidth, is the binding constraint - i.e. at multi-GPU datacenter scale or
with very long/variable contexts.
2. **Cross-request prefix sharing.** A chained-hash block cache shares identical system
prompts / RAG preambles across requests (vLLM's `block_pool.py` + block-hash map). A
real token-budget win for shared-prefix workloads; PR #22569 defers this to a
non-existent Phase 2 (our from-scratch P0 has the machinery).
These are measured as **max concurrent distinct tenants** and **KV memory saved**, not as
aggregate tok/s on one model. They do not move the single-GB10 throughput ceiling.
## Recommendation
- **Do not pitch paged KV as a single-GB10 throughput lever** - it is measured flat to
the contiguous ceiling (and PR #22569 is slower). Doing so would not survive a
benchmark.
- **The single-GB10 throughput story is already strong without paging:** llama.cpp is
ahead of vLLM single-stream (MXFP4 1153 > 800) and at ~70-81% of vLLM aggregate at
npl<=128 with a near-identical batching multiplier (`VLLM_DECOMPOSITION.md`). Ship the
MXFP4/NVFP4-dense prefill win (`NVFP4_TEST.md`) - that is the cheap, real, defensible
Blackwell number.
- **If datacenter-scale (thousands of concurrent tenants) is the genuine target,** the
lever is **multiple GPUs** plus paged KV's **capacity + prefix-sharing** features -
framed and measured as concurrent-tenant capacity and KV memory saved, on a
variable-context / shared-prefix workload. A single GB10 cannot produce the ~24k
aggregate regardless of KV layout; that is a fleet-level result.
## Reproduction (DGX, `~/llama.cpp-pr22569`, `LLAMA_MAX_SEQ=2048`)
```sh
M=~/bench/draft17/Qwen3-1.7B-Q8_0.gguf
# contiguous
for NPL in 128 256 512 1024; do
./build/bin/llama-batched-bench -m $M -npp 16 -ntg 128 -npl $NPL -ngl 99 \
-b 2048 -ub 2048 -fa on -c $((NPL*160)); done
# paged
for NPL in 512 1024 2048; do
./build/bin/llama-paged -m $M -kvp --fit off -ngpub 32768 -ncpub 128 \
-np $NPL -ns $NPL -n 128 -b 2048 -ub 2048 -ngl 99; done
```

View File

@@ -1,170 +0,0 @@
# Paged KV: target-readiness (correctness, dynamic benchmark, 2xH200 projection)
Target hardware: **~2x H200** (281 GB HBM3e total, ~4.8 TB/s per GPU). The GB10 box is
the *test* rig, not the target - and several earlier "no win" findings are GB10-specific
artifacts (low bandwidth caps throughput before KV memory ever binds). This document
delivers the three things needed to push paged KV toward the real target:
1. **Correctness** of the paged path - verified (and a blocking bug found + fixed).
2. **A dynamic-load benchmark** that actually exercises where paging wins (`paged-loadgen.cpp`).
3. **A projection** of the paged-KV payoff on 2x H200, grounded in measured GB10 numbers.
---
## 1. Correctness: PASS (after fixing the auto-fit OOM)
`test-paged-kv-e2e` checks the paged decode path against the contiguous reference
(greedy argmax + top-5 set overlap >= 4). On the box it was previously **unverified** -
it aborted at context creation. Root cause found:
- `common_fit_paged_kv_blocks` (`common/common.cpp:1144`) **unconditionally overrides**
`n_gpu_blocks` from `ggml_backend_dev_memory`, which **over-reports free VRAM on the
GB10 integrated/unified device** (it sized **~245 GB of KV on a 119 GB box** ->
`cudaMalloc` OOM -> `GGML_ASSERT` abort in `llama-kv-cache-paged.cpp:74`). The test's
explicit `n_gpu_blocks=64` was being clobbered because `params.fit_params` defaults on.
**Fix (item-1 patch, applied on the box):**
```diff
--- a/tests/test-paged-kv-e2e.cpp
+++ b/tests/test-paged-kv-e2e.cpp
@@ run_paged()
params.kv_paged = true;
+ params.fit_params = false; // honor explicit n_gpu_blocks; GB10 dev_memory over-reports free VRAM
params.n_gpu_blocks = 64;
```
**Result (Qwen3-0.6B-Q8_0, GB10):**
```
test-paged-kv-e2e: top-5 argmax match: ref=3743 paged=3743
test-paged-kv-e2e: top-5 set overlap: 5/5 (require >= 4)
test-paged-kv-e2e: PASSED
```
The paged op is **numerically greedy-equivalent to the contiguous path**. The reshape
bug from `PR22569_EVAL.md` (decoupled head_dim) is already applied in the checkout.
**Target-readiness caveat (the durable fix, not just the test):** the auto-fit itself is
brittle and must be hardened before it runs on a real serving box - even though
`ggml_backend_dev_memory` reports correctly on a discrete H200, the function should still
(a) early-return when `!params.fit_params`, (b) **clamp** the computed `n_gpu_blocks` so
`n_gpu_blocks * block_bytes <= free_vram - margin` using the *actual* KV element size, and
(c) not override an explicitly-set value. One-screen change in `common_fit_paged_kv_blocks`.
---
## 2. Dynamic-load benchmark - `paged-loadgen.cpp`
**Why the existing tools show no paged win:** `llama-batched-bench` and the stock
`examples/paged/paged.cpp` both run **fixed-length, all-arrive-at-once, single-prompt**
load. That has no over-reservation and no fragmentation, so contiguous KV is already
memory-optimal and paging has nothing to reclaim (`PAGED_KV_HIGH_CONCURRENCY.md`). The
paged win only exists under **variable lengths + continuous arrival + shared prefixes** -
the real serving regime. No tool in the tree creates it.
`paged-loadgen.cpp` (committed here) does, via the confirmed `llama_paged_scheduler_*`
API:
- **shared system prefix** (`LG_PREFIX` tokens) prepended to every request -> exercises
cross-request prefix sharing,
- **variable prompt length** (`LG_SUFMIN..LG_SUFMAX` unique suffix),
- **bimodal generation length** (`LG_GENLONG` for `LG_LONGPCT`% of requests, else
`LG_GENSHORT`) - the over-reservation driver,
- **continuous arrival**: keeps `LG_INFLIGHT` requests live, admitting a new one each time
one finishes.
It reports the load-bearing number for the buy decision - the **capacity ratio**:
```
paged peak KV = sum over live seqs of ceil(used/block)*block * kv_bytes_per_token
contiguous reserve = peak_inflight * max_ctx * kv_bytes_per_token (worst-case per slot)
CAPACITY RATIO = contiguous_reserve / paged_peak (+ prefix sharing on top)
```
`kv_bytes_per_token = 2 * n_layer * n_head_kv * head_dim * sizeof(f16)` - confirmed against
`llama-kv-cache-paged.cpp` (e.g. Qwen3-32B: 2*64*8*128*2 = **256 KiB/token**).
**How to run (on the target):** drop into PR #22569's `examples/paged/`, add to its
CMakeLists next to `llama-paged`, build, then e.g.
`LG_INFLIGHT=2048 LG_LONGPCT=15 paged-loadgen -m <model> -kvp --fit off -ngpub <N> -ncpub <M> -ngl 99`.
Sweep `LG_INFLIGHT` to the throughput plateau and read the capacity ratio at that point.
It is written to run on the target (2x H200) where the regime exists; on GB10 it runs but
the ratio is uninteresting because throughput plateaus before memory binds (see below).
---
## 3. Projection to 2x H200 (grounded in measured GB10 numbers)
### Measured on GB10 (this work)
| model | decode plateau (aggregate) | plateau concurrency | bound by |
|---|---|---|---|
| Qwen3-32B-Q4_K_M (dense) | ~540 t/s | npl ~128 | compute |
| Qwen3-1.7B-Q8_0 | ~3,200 t/s | npl ~512 | bandwidth |
### Hardware ratios (per GPU, then 2x TP at ~85% scaling)
| | GB10 | H200 | per-GPU x | 2x H200 (TP) x |
|---|---|---|---|---|
| mem bandwidth | 273 GB/s | ~4.8 TB/s | 17.6 | ~30 |
| BF16 compute | ~213 TFLOP | ~989 TFLOP | 4.6 | ~8 |
| HBM | 119 GB | 141 GB | 1.18 | 2.4 (281 GB) |
Decode is bandwidth-bound, so **both the aggregate ceiling and the concurrency at which it
is reached scale with bandwidth (~30x on 2x H200)**:
- **32B-dense aggregate decode ceiling:** 540 x 30 ~= **16,000 t/s**, reached at
~128 x 30 ~= **3,800 concurrent sequences**.
### Why paged KV becomes the binding lever on 2x H200 (and didn't on GB10)
To reach that ~16k t/s ceiling you must hold **~3,800 sequences** of KV. The memory math:
- 32B weights (FP8) ~= 32 GB, sharded over 2 GPUs -> ~250 GB HBM free for KV.
- 32B KV = 256 KiB/token. At an avg held context of 2,000 tokens, **per seq = 512 MiB**.
- Contiguous unified KV (reserve for the live set) fits ~250 GB / 512 MiB ~= **~490
sequences** - **8x short of the 3,800 needed to reach the throughput ceiling.**
So on 2x H200 **KV memory is the binding constraint at the throughput-optimal concurrency**,
and contiguous KV strands most of the bandwidth (you'd run at a fraction of 16k t/s). This
is the gap paged KV closes. On GB10 it never appeared because GB10's 30x-lower bandwidth
caps decode at npl ~128, whose KV fits in memory trivially - the constraint order is
inverted on the real target.
### Magnitude of the paged win
Paging recovers concurrency two ways, both multiplicative on achievable throughput:
1. **No over-reservation.** Contiguous must back `max_ctx` per slot; paging uses
`ceil(actual/block)`. For a realistic bimodal workload (most generations short, ~15%
long, prompts ~512) the average held context is several-fold below `max_ctx` ->
`paged-loadgen` capacity ratio typically **~4-10x** (it measures the exact number for
your workload's length distribution).
2. **Cross-request prefix sharing** of shared system prompts / RAG preambles - additional,
workload-dependent (chained-hash block cache; vLLM's `block_pool.py`).
Net: on 2x H200, paged KV is plausibly the difference between serving **~500 and ~3,800**
concurrent 32B sequences in HBM, i.e. between a fraction of and ~all of the **~16k t/s**
decode ceiling. **That is the datacenter payoff, and it is real on the target even though
GB10 cannot exhibit it.**
### Honest caveats for the buy case
- These are **projections** from GB10 + spec ratios; the capacity multiplier depends on the
workload's context-length distribution (more variable -> bigger paged win) and TP
efficiency. `paged-loadgen` measures it directly once you have target-GPU time.
- The **paged op itself still needs work**: PR #22569's `ggml_paged_attn` was 12-13%
*slower* than the mature contiguous flash-attention path at equal concurrency
(`PR22569_EVAL.md`), lacks prefix sharing (deferred to a non-existent Phase 2), and has
the fit-robustness bug above. Adopting paged KV for the target means either hardening
#22569 or finishing the from-scratch P4 - the capacity win above assumes a *correct,
competitive* op, which is the remaining engineering.
- Prefill on either KV layout is compute-capped, not a paged concern.
**Bottom line for the decision:** paged KV **is** the right lever for the 2x H200 target -
the GB10 "no win" result is a bandwidth artifact, not a verdict. The paged path is now
**correctness-verified**, the **benchmark to size the win exists**, and the projection
says the payoff is **~5-10x concurrent-tenant capacity -> several-fold higher aggregate
decode** on the target. The remaining work is hardening/finishing the paged op, not
proving the thesis.

View File

@@ -1,55 +0,0 @@
# Making llama.cpp/LocalAI a viable vLLM alternative — phased plan
Goal: close the practical gap to vLLM for both single-user *speed* and multi-user *throughput*, while keeping
quality (no lossy quant). Grounded in measured benchmarks + research (`BENCHMARKS.md`, `BLACKWELL_KERNEL_GAPS.md`,
`VLLM_THROUGHPUT_GAP.md`). The gap is NOT one thing — each phase targets a distinct, independent lever.
## Where vLLM actually leads (measured, GB10 / Qwen3-32B)
- **Single-user decode:** ~parity (10.2 vs 11.7) — bandwidth-bound. vLLM's edge is **spec-dec** (lossless).
- **Multi-user decode:** gap grows to ~2.2× at B=64 (kernel + scheduler).
- **Prefill aggregate:** llama plateaus ~765, vLLM scales to 24k — **paged KV + chunked prefill + kernel**.
- Note: on GB10 vLLM's FP4 trump card is *broken* (falls back to Marlin); llama.cpp runs reliably — a real
viability point. vLLM is structurally ahead mainly via **paged KV, chunked prefill, cross-request prefix cache**.
## Phases
### Phase 1 — Hardware-tuned config (PR #10411) — DONE
Folded into the hardware-defaults path (`core/config/hardware_defaults.go`):
- Blackwell physical batch (n_ubatch) = 2048.
- **VRAM-scaled `n_parallel` default** (>=32GiB→8, >=8→4, >=4→2): turns on concurrency + continuous batching,
which the backend leaves OFF at its `n_parallel=1` default. Unified KV → slots share the budget (no extra
KV memory). Single-host (local GPU) + distributed router (per node). Already-good defaults confirmed:
flash-attn=auto, context=4096.
### Phase 2 — Paged / block KV cache ← biggest structural multi-user lever
vLLM's PagedAttention lifts KV utilization ~20-38% → ~96%. llama.cpp's own A10G data (draft PR #22569):
contiguous OOMs at 26 seqs / 496 t/s → paged 247 seqs / 1256 t/s (**~9.5× concurrency, 2.5× aggregate**).
- Build on / complete **upstream draft PR #22569** (`-kvp`, block manager + paged-attn ggml op, FCFS scheduler)
rather than the from-scratch series we prototyped (`paged/`). Our CPU-verified block manager + gather-read
design informs the review/port; the upstream momentum is the place to land it.
- Phase 2b: cross-request prefix sharing (block-hash dedup) — our `PagedKVManager` already implements it.
### Phase 3 — Prefill amortization (chunked prefill + n_batch/n_ubatch split)
llama aggregate prefill plateaus because (a) one prompt saturates compute, (b) the per-forward GEMM M-dim is
capped at `n_ubatch`=512, (c) no scheduler chunked prefill (draft #10718 abandoned).
- Split logical `n_batch` from physical `n_ubatch` (LocalAI ties them today) so concurrent prefills batch into
a larger logical batch while keeping ubatch at the Blackwell sweet spot (2048).
- Chunked prefill + prefill/decode co-batching in the server slot scheduler.
### Phase 4 — Batched-GEMM kernel tuning (the decode 2.2× + prefill height)
Per `BLACKWELL_KERNEL_GAPS.md`: dense int8-MMQ at ~21% of ceiling, MoE FP4-MMA at ~5%. Both untuned for
Blackwell. To MATCH: tune MMQ or a Marlin-style W4A16 BF16 GEMM (FP4 not required — GB10 is INT8==BF16). To
BEAT (2×): fix+tune the existing FP4-MMA on sm_121 (build-flag/`-O3`-miscompile, not greenfield).
### Phase 5 — Backend GPU sampling
CPU per-sequence sampling caps GPU util ~60% beyond n_parallel ~8-16 (upstream PR #17004). Track/adopt.
### Cross-cutting — Speculative decoding (single-user speed, quality-preserving)
Dense ≥14B: lossless ~1.8-3×. llama.cpp has `-md`/`--spec-draft-*`. Wire a draft-model field in the model
config + ship Qwen3 target+draft (1.7B) pairs in the gallery. NOT for MoE-A3B (nothing to amortize).
## Sequencing rationale
Phase 1 (config) ships now — biggest immediate multi-user win for zero kernel work (concurrency was OFF).
Phase 2 (paged KV) is the highest-leverage structural build and has upstream momentum. Phases 3-4 are deeper
(scheduler + kernel). Spec-dec is independent and can land any time for single-user speed.

View File

@@ -1,90 +0,0 @@
# PR #17004 (backend / GPU sampling) evaluation on DGX Spark (GB10, sm_121)
Date: 2026-06-21. Hardware: NVIDIA GB10 (GB10, sm_121), CUDA 13.0, cmake 3.28.
Model: `Qwen3-32B-Q4_K_M.gguf`. LocalAI pin: `LLAMA_VERSION=f3e182816421c648188b5eab269853bf1531d950` (2026-06-17).
## TL;DR (clean negative)
1. **PR #17004 is MERGED and is ALREADY present in our pinned llama.cpp `f3e1828`.** There is nothing to apply / cherry-pick / patch. The `-bs/--backend-sampling` CLI arg, the `llama_set_sampler` / `llama_get_sampled_*` API, and the GPU argsort/top-k/cumsum/softmax kernels are all in the pin.
2. **The prescribed benchmark cannot test the fix.** `llama-batched-bench` does ZERO sampling - it feeds random tokens (`std::rand() % n_vocab`). Its ~540 t/s plateau is therefore **not** sampling-bound, and enabling backend sampling does nothing to it. The valid tool is `llama-batched` (examples/batched), which the PR updated to drive per-sequence sampler chains and which actually exercises `-bs`.
3. **In a controlled real-sampling A/B (same `llama-batched` harness, CPU vs GPU sampler), GPU sampling gave only +25% at np=32, +3% at np=64, and CRASHED (`GGML_ASSERT(obj_new)`, graph-context alloc) at np=128 and np=256** - exactly the multi-user regime the investigation cares about.
4. **nsys at np=64: GPU kernel profile and GPU-busy time are essentially identical with and without the fix** (CPU 392.5 t/s / GPU 404.2 t/s; total GPU kernel+memop time ~4.05 s in both). Sampling kernels do not even appear among the top GPU contributors. GPU utilization did **not** rise.
5. **Conclusion: PR #17004, in the state shipped by our pin, does NOT break the ~540 plateau and does not move decode aggregate toward the ~2700 GPU-bound ceiling or past vLLM's 667.** It is modest at low parallelism and unusable (crash) at the high parallelism in question. The PR's own guidance ("recommended `--parallel 1`", "will take time to mature") matches what we measured.
## 1. What PR #17004 does + state
- Title: "sampling : add support for backend sampling". **State: MERGED** into `master` (PR head branch `gpu-sampling`). 44 files, +4133/-296.
- `libllama`: new `llama_context_params.samplers` / `n_samplers`, `llama_set_sampler`, `llama_get_sampled_*`, `llama_sampler_seq_config`, updated `llama_sampler_i`. Sampler chain can now run inside the compute graph on the backend (GPU) instead of on the CPU after `llama_decode`.
- CUDA: optimized/new `argsort`, `top-k`, `cumsum`, `softmax` kernels; CMake option `-DGGML_CUDA_CUB_3DOT2=ON` (builds a CCCL v3.2 prerelease for faster top-k).
- Tools: new `-bs, --backend-sampling` arg in `common/arg.cpp` (line 1921); server (`server-context.cpp`) per-slot wiring; `examples/batched/batched.cpp` updated.
- Supported backend samplers: `top-k`, `top-p`, `min-p`, `temp` (+ dist). **Limitations (from the PR): not compatible with grammar sampling; single output per sequence per batch; no save/load of sampling state; recommended only with `--parallel 1` and CUB_3DOT2.** Open follow-ups: #18547 (avoid graph reallocations), #18550 (skip inactive samplers in parallel decode).
- It DOES target the CPU-side per-sequence sampling stall we hypothesised - the mechanism is correct. Maturity is the problem.
Note: the GitHub API reports `mergedAt: 2026-01-04`, but the PR contains June 2026 upstream-merge commits and the feature is verified present in our 2026-06-17 pin, so treat the date field as a metadata quirk. What matters: the code is in `f3e1828`.
## 2/3. Apply + build
No apply needed (already in pin). Built from a clean `git worktree` at `f3e1828` (`~/llama-pr17004`), to avoid disturbing the existing diffusion build:
```
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES=121 -DLLAMA_MAX_SEQ=256 \
-DGGML_CUDA_CUB_3DOT2=ON -DLLAMA_CURL=OFF
cmake --build build --target llama-batched llama-batched-bench -j20
```
**Build: SUCCESS** (CUB_3DOT2=ON FetchContent fetched and compiled despite flaky net; sm_121; LLAMA_MAX_SEQ=256). `-bs/--backend-sampling` confirmed present in `llama-batched --help`.
## 4. Decode aggregate: fix vs baseline vs vLLM
### 4a. `llama-batched-bench` (NO sampling - reconfirms the plateau, unaffected by the fix)
`-npp 16 -ntg 128 -npl 32,64,128,256 -c 40960 -b 2048 -ub 2048`
| npl | S_TG t/s |
|-----|----------|
| 32 | 241.8 |
| 64 | 395.1 |
| 128 | 542.6 |
| 256 | 567.2 |
Reproduces the ~540 plateau. Because this tool never samples, `-bs` is irrelevant here - the plateau is decode/host-overhead-bound, not sampling-bound.
### 4b. `llama-batched` real-sampling A/B (CPU sampler vs `-bs` GPU sampler, identical harness)
`-kvu -n 128 -np {32,64,128,256} -c 40960 --seed 1` (samplers: top-k 40 / top-p 0.95 / temp 0.8)
| np | CPU sampling t/s | GPU `-bs` sampling t/s | delta |
|-----|------------------|------------------------|-------|
| 32 | 174.1 | 217.5 | +25% |
| 64 | 390.5 | 403.4 | +3.3% |
| 128 | 497.9 | **CRASH** `GGML_ASSERT(obj_new) ggml.c:1768` | - |
| 256 | 396.7 | **CRASH** `GGML_ASSERT(obj_new) ggml.c:1768` | - |
(`llama-batched` absolute t/s is lower than `batched-bench` because it does real sampling plus per-token detokenize/string/stream work; the A/B *within* this harness isolates the sampler cost.)
**Does the fix break the plateau? No.** GPU sampling helps only at low parallelism and the gain shrinks as np rises (+25% -> +3%), then the path crashes at np>=128 - i.e. it fails in exactly the multi-user regime where the plateau matters. It does not approach the ~2700 ceiling and does not pass vLLM's 667. The CPU-sampling curve itself peaks at np=128 (498) and *drops* at np=256 (397), confirming CPU sampling is a scaling wall - but PR #17004 as shipped does not lift it because the GPU path is unstable there.
## 5. GPU-utilization mechanism (nsys, np=64, the highest np where `-bs` survives)
`nsys profile -t cuda ... -n 96 -np 64`
| mode | decode t/s | total GPU kernel+memop time | top GPU contributors |
|------|-----------|------------------------------|----------------------|
| CPU sampling | 392.5 | ~4.07 s | mul_mat_q (55%+17%), flash_attn (5.7%), mul_mat_vec (2%) |
| GPU `-bs` | 404.2 | ~4.04 s | identical set; sampling kernels not in top contributors |
GPU-busy time and the kernel mix are **essentially unchanged** between modes. The argsort/top-k/cumsum/softmax sampling kernels are negligible in the timeline; the only visible difference is H2D memcpy *instances* rising 1,495 -> 7,076 (pinned-memory sampler transfers) at ~unchanged total memcpy time. **GPU utilization did not rise.** This directly refutes the idea that, at this workload, the GPU idle is dominated by CPU sampler arithmetic - moving the sampler onto the GPU barely changed throughput (+3%) and did not raise GPU occupancy. The ~80% idle measured elsewhere is dominated by something other than the sampler math (host-side batch construction / synchronization / detokenize), which PR #17004 does not address.
(np=256 nsys "with fix" could not be captured: `-bs` aborts there. Fixing the crash needs the unmerged follow-ups #18547/#18550, not in our pin.)
## LocalAI adoption path
**The code arrives transparently with a version bump; enabling it is not transparent.**
- `backend/cpp/llama-cpp/prepare.sh` copies all of upstream `llama.cpp/tools/server/*` (including the #17004-modified `server-context.cpp` / `server-task.cpp` / `server-common.cpp`) into `tools/grpc-server/`, and `grpc-server.cpp` `#include`s them. So once `LLAMA_VERSION` points at a commit containing #17004 (our pin `f3e1828` already does), the backend-sampling machinery compiles into `grpc-server` automatically. **No vendored patch in `patches/` is required for the code.**
- The vendored `server-context.cpp` already does the per-slot wiring (around line 1615): `backend_sampling &= task.params.sampling.backend_sampling`, also disabled for speculative decode and for pre-sampling logits (`n_probs>0`), then `llama_set_sampler(ctx_tgt, slot.id, common_sampler_get(slot.smpl))`.
- **But it is OFF unless `task.params.sampling.backend_sampling == true`.** LocalAI's `grpc-server` builds `params` itself from the gRPC request and never sets this flag (and does not pass the upstream `--backend-sampling` CLI arg). So as-is, LocalAI compiles the feature but never uses it. **A small grpc-server change is needed**: read a LocalAI model option / env and set `params.sampling.backend_sampling = true` (global or per-request).
- For performant CUDA top-k, add `-DGGML_CUDA_CUB_3DOT2=ON` to the llama-cpp CUDA `CMAKE_ARGS` in the Makefile (optional; a non-CUB fallback exists).
- **Caveats that blunt the benefit for LocalAI specifically:** grammar-constrained requests (JSON-schema / tool calls - a large share of LocalAI traffic), `logprobs`/`n_probs>0`, and speculative decoding all fall back to CPU sampling by the gating above; and the GPU path crashes at np>=128 in this pin. So even after wiring the flag, the multi-user throughput case would not benefit (and would crash) until the follow-up PRs (#18547/#18550) land and stabilise high-parallelism backend sampling.
### Recommendation
Do **not** adopt PR #17004 as the multi-user throughput fix yet. It is already in the tree but is immature at the parallelism that matters (crashes at np>=128, modest gains below). The measured bottleneck at this workload is not the sampler arithmetic (nsys shows GPU-busy unchanged when sampling moves to GPU). Re-evaluate after #18547/#18550 merge into a future pin; revisit the host-side decode/batch-construction overhead as the more likely real lever.

View File

@@ -1,136 +0,0 @@
# Evaluation: llama.cpp PR #22569 (paged KV cache, `-kvp`) on DGX Spark (GB10, sm_121)
Question: is upstream draft PR #22569 the right base to give LocalAI vLLM-class
high-concurrency GPU throughput, or should we finish our own from-scratch P4
(`backend/cpp/llama-cpp/paged/`)?
Date: 2026-06-21. Hardware: NVIDIA GB10 (compute 12.1 / sm_121), 122502 MiB unified
memory, CUDA 13.0, gcc 13.3. Models: `Qwen3-32B-Q4_K_M.gguf` (18.4 GB, 64 layers,
n_head 64 / n_head_kv 8 / head_dim 128 / n_embd 5120) and `Qwen3-0.6B-Q8_0.gguf` for
the correctness gate.
## TL;DR verdict: DO NOT adopt #22569. Finish our own P4.
On GB10 with a 32B dense model, PR #22569 delivers **no throughput win and no concurrency
win** - it is ~12% *slower* than the existing contiguous path and hits the *same*
256-sequence ceiling. The "scale to thousands of sequences like vLLM" premise does not
hold for this PR or this hardware/model. On top of that it is broken out of the box,
wired to the wrong integration surface, and a contested draft.
## 1. Builds? Correct?
- **Builds: YES.** Cloned `matiaslin/llama.cpp@paged_attention` (PR #22569, single commit
`0b0f7bd...`, base = current master). Clean CUDA build for sm_121
(`-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121 -DCMAKE_BUILD_TYPE=Release`).
`llama-paged`, `llama-batched-bench`, `test-paged-kv`, `test-paged-kv-e2e` all link.
It is self-contained (ships its own CPU+CUDA `ggml_paged_attn` op) and does **not**
depend on the competing CUDA PR #17579 (ericcurtin, `--pagedattention`).
- **Runs out of the box: NO.** `llama-paged -kvp` on Qwen3-32B *and* Qwen3-0.6B crashes
at context creation:
`build_attn(llm_graph_input_attn_kv_paged*) -> ggml_reshape_2d ->`
`GGML_ASSERT(ggml_nelements(a) == ne0*ne1)` (src/llama-graph.cpp:2556). Same crash with
`--fit off` (so it is the real graph, not just the memory probe).
**Root cause:** the paged path hardcodes `ggml_reshape_2d(cur, hparams.n_embd, ...)`,
wrong for any model where `n_head*head_dim != n_embd`. Qwen3 decouples head_dim:
32B = 64*128 = **8192** vs n_embd 5120; 0.6B = 16*128 = **2048** vs 1024. The PR's
"qwen3 verified" claim does **not** hold against current Qwen3 GGUFs. Fix is ~1 line
(use the real attention width `cur->ne[0]*cur->ne[1]`); applied for the rest of the eval.
- **`fit_params` (`-ngpub` auto-sizing) also crashed on GB10** in the same reshape path
during the device-memory probe (before the fix). After the reshape fix, paged
auto-fit works (sized 96624 GPU blocks on the 0.6B from 85 GiB free).
- **Correctness after the reshape fix:** paged decode runs and produces **coherent**
output on Qwen3-32B (sensible mercury / miso-soup / Starry-Night answers across 128 and
256 concurrent sequences), indicating the `ggml_paged_attn` op is functionally roughly
correct. PR's own greedy/top-K equivalence test (`test-paged-kv-e2e`, top-K argmax +
top-5 overlap >= 4 + first-4-token greedy match vs non-paged) on Qwen3-0.6B did
**not** reach a PASS/FAIL verdict on GB10: its paged auto-fit grabs ~88 GiB
(96531 blocks) and the run then stalls at cache init (a third GB10 fit-robustness
issue, distinct from the reshape bug). So the formal greedy-equivalence gate is
**unverified on this box**, but the qualitative evidence (coherent multi-sequence 32B
output with explicit small `-ngpub`) indicates the fixed op is roughly correct. This
does not change the verdict, which is decided by throughput below.
## 2. Throughput: paged vs contiguous on GB10 (Qwen3-32B-Q4_K_M)
Contiguous = `llama-batched-bench` (unified KV, continuous batching), S_TG decode tok/s.
Paged = `llama-paged -kvp --fit off` (its scheduler-driven continuous-batching loop),
`aggregate tps`. Both `npp~16, ntg/n_predict=128, n_batch=n_ubatch=2048, -ngl 99`.
| npl | contiguous (S_TG t/s) | paged `-kvp` (agg t/s) | outcome |
|------|----------------------|------------------------|---------|
| 128 | **537** (S 553) | **477** | both run; paged ~12% slower |
| 256 | **541** (S 550) | **471** | both run; paged ~13% slower; neither gains over 128 |
| 512 | FAIL | FAIL | **both** die: `n_seq_max must be <= 256` |
| 1024 | FAIL | FAIL | **both** die: `n_seq_max must be <= 256` |
### The decisive facts
1. **PR #22569 does NOT lift the 256-sequence ceiling.** Both contiguous and paged fail
identically at npl 512/1024 with `n_seq_max must be <= 256` (llama.cpp's compile-time
`LLAMA_MAX_SEQ`). It is **not** an OOM - GB10 has 119 GiB and at npl=256 contiguous KV
is only 16 GiB. Paging gives **zero** concurrency headroom over contiguous here. The
"paged unlocks thousands of seqs" premise is false for this PR.
2. **Paged is slower, not faster.** The fresh `ggml_paged_attn` op (477/471 t/s) loses to
the mature CUDA flash-attention contiguous path (537/541 t/s) by ~12-13% at equal
concurrency. The PR's A10G "2.5x" came entirely from contiguous OOMing at 26 seqs on a
24 GiB card; that lever does not exist on GB10's 119 GiB.
3. **The 32B dense model is compute-bound and plateaus by npl=128 on GB10.** Aggregate is
flat from 128->256 (contiguous 537->541; paged 477->471). Doubling concurrency buys
nothing because the GPU is already saturated on the 32B weight matmuls. Even if we
recompiled with a larger `LLAMA_MAX_SEQ`, aggregate would not climb - so vLLM-class
~24k aggregate is **unreachable for 32B-dense on a single GB10 regardless of KV
layout**. The throughput gap to vLLM at this model/hardware is a compute/bandwidth
problem, not a KV-fragmentation problem.
## 3. Verdict and reasoning: finish our own P4
**Do not adopt #22569 as the base.** Reasons:
- **No win on target hardware.** Even fully completed, on GB10 + 32B it is slower than
what we already have and capped at the same 256 seqs. There is no throughput or
concurrency dividend to harvest here.
- **Wrong integration surface.** Paged is driven only by a brand-new parallel C API
(`llama_paged_scheduler_init/add_request/prepare_batch/get_batch_info/update/...`) and a
bespoke `examples/paged` loop. `-kvp`/`--kv-paged` is gated to `LLAMA_EXAMPLE_PAGED`
only - it is NOT wired into `llama-server`/`batched-bench`/`parallel`, i.e. NOT the path
LocalAI's grpc-server derives from. Adopting it means rewriting LocalAI's serving loop
around the new scheduler API.
- **Broken / restricted.** Crashes out of the box on all current Qwen3 (and any
decoupled-head-dim model); fit_params crashed; Phase-1 restrictions enforced at context
creation: single CUDA device, full offload only, `n_batch == n_ubatch`, no SWA
(gemma3/llama4/etc. unsupported), no CoW / prefix-caching, no
`seq_cp`/`seq_keep`/`seq_div`/`seq_add`, no state save/load.
- **Contested draft.** Unmerged; the author is openly asking maintainers whether the C
API is even the right design; maintainers are skeptical of paged for single-node use.
**What P4 should actually target (re-scoped by this data).** The aggregate-throughput
gap to vLLM on a compute-bound dense model on one GB10 is not addressable by paged KV.
The durable, real LocalAI wins from paging are the ones our from-scratch P0 already
implements the machinery for and that #22569 explicitly omits:
- **on-demand KV sizing** (fit more *diverse* concurrent tenants without per-seq
over-reservation), and
- **automatic cross-tenant prefix sharing** (chained-hash block cache - shared system
prompts / RAG preambles), which #22569 defers to a non-existent Phase 2.
Finish our own P4 (CPU gather-read + a CUDA gather-read) against these capacity/
prefix-sharing objectives - measured as max concurrent *distinct* tenants and KV memory
saved, not single-model aggregate tok/s. To chase raw aggregate, the levers are lifting
`LLAMA_MAX_SEQ` and smaller/MoE models in memory-bandwidth-bound regimes - orthogonal to
paged attention. The ~1-line reshape fix found here (and the GB10 fit_params crash) are
worth upstreaming to #22569 regardless, but the PR is not our base.
### Reproduction (DGX, `~/llama.cpp-pr22569`)
```sh
export PATH=/usr/local/cuda/bin:$PATH
# contiguous
./build/bin/llama-batched-bench -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -npp 16 -ntg 128 \
-npl 128 -c 20480 -b 2048 -ub 2048 # 256/512/1024 -> n_seq_max must be <= 256
# paged (needs the src/llama-graph.cpp:2556 reshape fix: hparams.n_embd -> cur->ne[0]*cur->ne[1])
./build/bin/llama-paged -m Qwen3-32B-Q4_K_M.gguf -kvp --fit off -ngpub 2048 -ncpub 128 \
-np 128 -ns 128 -n 128 -b 2048 -ub 2048 -ngl 99 # 512/1024 -> n_seq_max must be <= 256
```

View File

@@ -1,95 +0,0 @@
# Paged Attention for llama.cpp (vLLM-parity), CPU-first
A from-scratch port of vLLM V1's paged KV-cache model into the llama.cpp / ggml
world, built CPU-first and verified incrementally. The host-side block manager is
a faithful port of vLLM; the compute stays in ggml (no new op — the read path
gathers blocks with `ggml_get_rows` and feeds the existing attention ops).
Design: `docs/superpowers/specs/2026-06-19-paged-attention-llamacpp-design.md`
Plan: `docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md`
## Status
| Phase | What | State |
|------|------|-------|
| P0 | vLLM-parity host block manager (`FreeBlockQueue`, `BlockPool`, `PagedKVManager`, chained-hash prefix cache) | ✅ verified — `make check`, 4/4 suites |
| P1 | ggml paged write/gather mechanism (`set_rows` by slot_mapping → `get_rows` gather) | ✅ verified — `make ggml-check`, non-contiguous blocks `[2,1,5]` round-trip + isolation |
| P2 (core) | attention over gathered paged KV matches independent host reference | ✅ verified — max abs err **7.5e-08** |
| P3 (partial) | capacity & prefix-sharing wins | ✅ measured — `make bench`: **9.2×** more concurrent seqs, **11.3×** less KV memory |
| **P3 (in-model placement)** | **paged, non-contiguous block KV placement in the real model** | ✅ **Gate 0 PASSED** — Qwen3-0.6B token-identical (`patches/0001-paged-kv-block-placement.patch`) |
| P4 (in-model compute) | gather-read (`build_attn_paged`, read only a seq's blocks) + win-2 throughput + multi-seq | ⛔ remaining |
The design's central risk — *does paged (non-contiguous) KV produce correct attention?*
is **retired at two levels**: (1) at the ggml-op level (P2, 7.5e-08 vs reference) and
(2) **in a real model** (P3): with KV physically scattered across permuted, non-contiguous
blocks (cells `0-15, 144-159, 32-47, …`), Qwen3-0.6B greedy generation is **token-for-token
identical** to the contiguous cache. Reproduce:
```sh
# from backend/cpp/llama-cpp-fallback-build/llama.cpp (patch applied, CPU build)
B=build-cpu/bin/llama-simple; M=<Qwen3-0.6B.Q4_K_M.gguf>; P="...long prompt..."
"$B" -m "$M" -n 40 "$P" > base.txt
LLAMA_KV_PAGED=1 "$B" -m "$M" -n 40 "$P" > paged.txt
diff base.txt paged.txt && echo TOKEN-IDENTICAL
# LLAMA_KV_PAGED_DEBUG=1 prints the permuted physical cells per step
```
This proves the **storage/placement** layer of paged attention in-model. What remains (P4)
is the **compute** optimization that yields the throughput win: a gather-read that attends
only a sequence's own blocks (instead of scanning `[0,n_kv)` with a mask), plus the
multi-sequence driver to measure tok/s vs concurrency. The patch is single-sequence scope.
## Build & test
```sh
make check # P0 host-manager unit suites (pure C++, no deps)
make ggml-check GGML_SRC=<llama.cpp>/ggml GGML_BUILD=<ggml-build> # P1/P2 ggml tests
make bench # P3 capacity + prefix-sharing numbers
```
`ggml-check` needs a built ggml. To build one CPU-only from a llama.cpp checkout:
`cmake -S <llama.cpp>/ggml -B /tmp/ggml-build -DGGML_CUDA=OFF -DCMAKE_BUILD_TYPE=Release && cmake --build /tmp/ggml-build -j`
(if it complains about a missing `ggml.pc.in`, add a minimal pkg-config stub).
## Files
- `paged_kv_manager.{h,cpp}` — the vLLM-parity block manager (no ggml/llama dep).
- `tests/test_free_block_queue.cpp` — intrusive LRU free list.
- `tests/test_block_pool.cpp` — alloc/touch/free/evict/cache.
- `tests/test_paged_kv_manager.cpp` — allocate/block_table/slot_mapping/free.
- `tests/test_prefix_cache.cpp` — chained block hashing + first-miss cache hit.
- `tests/test_ggml_paged_rw.cpp` — paged write/gather through real ggml ops.
- `tests/test_ggml_paged_attn.cpp` — attention over paged KV vs host reference.
- `paged-bench.cpp` — capacity (win 1) + prefix-sharing (win 3) measurements.
## Remaining work — integration map (for the next session)
Target: a paged read path active behind a flag, producing **token-identical** greedy
output vs the contiguous cache on a real model (Gate 0), then `paged-bench` win 2.
Exact seams in the vendored llama.cpp (`backend/cpp/llama-cpp-fallback-build/llama.cpp`,
the pinned build fetches `LLAMA_VERSION=f3e182816421…`):
1. **Memory type**`src/llama-model.cpp:2070` `create_memory()` constructs `llama_kv_cache`.
Add a paged variant (or a flag on the existing cache) implementing `llama_memory_i`
(`src/llama-memory.h`), backed by `PagedKVManager`.
2. **Allocation**`src/llama-kv-cache.cpp:818` `find_slot()` produces `slot_info.idxs`.
Replace the ring-buffer scan with block-aligned allocation from `PagedKVManager`.
3. **Read path**`src/llama-kv-cache.cpp:1145/1165` `get_k`/`get_v` return a contiguous
`[0,n_kv)` view. For paged, gather the sequence's blocks (`ggml_get_rows`) into scratch.
The new branch lives alongside `build_attn` in `src/llama-graph.cpp` (`build_attn_mha`).
4. **Mask**`src/llama-graph.cpp` `build_attn_inp_kq_mask` sizes the mask to the gathered
length per sequence.
5. **Gate 0 driver**`build-cpu/bin/llama-simple` (greedy argmax) on
`Qwen3-0.6B.Q4_K_M.gguf`; assert paged output == contiguous output token-for-token.
### Honest caveats (from the maintainer discussion + reading `find_slot`)
- llama.cpp's **unified cache already shares one KV pool** across sequences and already
tolerates non-contiguous slots. So win-1 vs *unified* is smaller than vs per-seq
reservation (stream mode). The durable LocalAI wins are **on-demand sizing** and
**automatic cross-tenant prefix sharing** (P0 implements the block-hash machinery).
- vLLM's classic `paged_attention_v1/v2` CUDA kernel is **deprecated**; the live path is
FlashAttention/FlashInfer over a block table. The port targets that pattern, not the
old kernel. Upstream draft PRs #22569 (new `ggml_paged_attn` op) and #17579 (CUDA) are
unmerged; maintainers are skeptical for single-user use.

View File

@@ -1,78 +0,0 @@
# Upstream ggml issue draft: MXFP4 MoE prefill underutilizes Blackwell (GB10) — ~22 TFLOP/s, ~27× behind vLLM
**Title:** CUDA: MXFP4 MoE prefill runs the Ampere-class warp `mma.sync`, far below Blackwell FP4 peak (GB10 / sm_121)
## Summary
On a GB10 (DGX Spark, sm_121), MXFP4 MoE prefill for Qwen3-Coder-30B-A3B is bottlenecked by
`mul_mat_q<MXFP4>` (the per-expert grouped MMQ), which runs at only **~22 effective TFLOP/s** — a small
fraction of the GPU's FP4 capability. Batched prefill plateaus at ~3.65k tok/s (B=32) vs vLLM FP8 ~99k
on the same box (~27×). The native FP4 block-scaled `mma.sync` path (PR #17906 et al.) *is* engaged — the
limit is that it's a warp-level MMA kernel, not a tcgen05/CUTLASS-class grouped GEMM.
## Hardware / build
- NVIDIA GB10, compute capability 12.1, 119 GiB unified LPDDR5X.
- llama.cpp built `-DCMAKE_CUDA_ARCHITECTURES=121` (sm_121a/compute_121a confirmed in cubins).
- Model: Qwen3-Coder-30B-A3B-Instruct, `MXFP4_MOE` (15.9 GiB, 4.47 BPW).
## Measurements
Single-stream (`llama-bench`, ub2048):
| metric | Q8_0 | MXFP4 | vLLM FP8 |
|---|---|---|---|
| prefill pp2048 | ~2200 | 3441 | — |
| decode tg128 | 62 | 86 | 52 |
Batched (decode-phase aggregate `S_TG`; prefill aggregate `S_PP`):
| B | llama MXFP4 prefill | vLLM FP8 prefill | llama MXFP4 decode | vLLM FP8 decode |
|---|---|---|---|---|
| 1 | 1625 | 9644 | 83 | 48 |
| 8 | 3634 | 33373 | 267 | 312 |
| 32 | 3651 | 99398 | 551 | 1171 |
| 64 | 3648 | 151990 | 770 | 2064 |
Decode is competitive (we win at B=1). **Prefill plateaus and is the gap.**
## Profiling (nsys, MXFP4 pp2048 kernel time)
| kernel | % |
|---|---|
| `mul_mat_q<(ggml_type)39>` (MXFP4 MoE GEMM) | **37.2** |
| `mul_mat_q<(ggml_type)8>` (dense/attn, still Q8) | 10.1 |
| `flash_attn_ext_f16` | 8.8 |
| `quantize_mmq_mxfp4` (activation quant) | 8.0 |
Only cutlass kernel present is `cutlass_80_tensorop` (Ampere). No tcgen05 / wgmma anywhere.
## What we ruled out (so it's the kernel, not config)
- **ubatch**: saturates at 2048 (pp4096: ub512 2994 → ub2048 3316 → ub8192 3180).
- **tile width**: `mmq_x` already selects the full 128-wide tile at ub2048 (~128 tokens/expert).
- **cuBLAS fallback**: `GGML_CUDA_FORCE_CUBLAS` is a no-op (3419 ↔ 3423 t/s) — dequant→cuBLAS-FP16 neither
helps nor hurts, i.e. the FP4 MMQ kernel isn't worse than FP16 cuBLAS, both hit a common ceiling.
- prefill does **not** scale with bigger single prompts (attention O(N²) confounds): pp2048 3295, pp8192
1524, pp16384 2051 — so it's the many-sequence batched MoE GEMM, not batch size.
## Proposal
A tcgen05 / CUTLASS-3.x grouped-GEMM path for FP4 (MXFP4 + NVFP4) MoE on sm_120/121:
- One grouped GEMM over all experts with per-group token offsets (full tiles regardless of tokens/expert),
vs today's per-expert MMQ scheduler.
- Block-scaled `e2m1` operands via tcgen05 tensor-memory MMA (`mma.sync.aligned.kind::mxf4…` is the
warp-level form; the collective-mainloop/tcgen05 form is what extracts Blackwell throughput at prefill
tile sizes).
- Fuse activation quantization (`quantize_mmq_mxfp4`, ~8%) into the permute/gather.
- Optionally extend to dense layers (qkv/o_proj/lm_head) so full-model prefill is FP4/FP8.
This mirrors what vLLM/FlashInfer/TensorRT-LLM do for Blackwell MoE. Happy to test iterations on the GB10.
## Repro
```sh
llama-quantize qwen3coder-f16.gguf qwen3coder-mxfp4.gguf MXFP4_MOE
llama-bench -m qwen3coder-mxfp4.gguf -ngl 99 -p 2048 -n 0 -ub 2048
llama-batched-bench -m qwen3coder-mxfp4.gguf -ngl 99 -c 45056 -b 2048 -ub 2048 -npp 512 -ntg 128 -npl 1,8,32,64
```

View File

@@ -1,83 +0,0 @@
# What makes vLLM fast on GB10 — kernel vs scheduler (code-grounded, measured)
Decisive analysis (vLLM v0.23.0, torch 2.11+cu130, sm_121, model `RedHatAI/Qwen3-32B-NVFP4A16`, source at tag
`v0.23.0`). **Answer: it's the scheduler, not the kernel.** This closes the kernel track and opens the
scheduler track.
## The decomposition (measured on the DGX, prefix-cache OFF, unique prompts)
| | vLLM W4A16 Marlin | llama.cpp | verdict |
|---|---|---|---|
| **single-stream prefill** | ~800 t/s (~52 TFLOPS) | 718 MMQ / **1153 MXFP4** | **tied; llama.cpp MXFP4 wins** |
| decode batch-1 | 11.8 t/s | ~similar | bandwidth-bound (≈190/273 GB/s); no kernel helps |
| **aggregate decode** | 328 (N32) / 569 (N64) / **667 (N128)** | the gap | **~56× multiplier = scheduler** |
vLLM's single-stream Marlin is **not** at the roofline — it's in the same ~4×-under regime as MMQ. The 24k
headline is entirely the aggregate decode multiplier.
## The kernel vLLM actually runs on sm_121 (W4A16, forced)
Dispatch (vLLM v0.23.0): `compressed_tensors.py:704` (NVFP4 + no input-quant → `W4A4Fp4(use_a16=True)`) →
`compressed_tensors_w4a4_nvfp4.py:28``kernels/linear/__init__.py:894` (`if use_a16: force_kernel =
MarlinNvFp4LinearKernel`, **unconditional, no cc gate**) → `nvfp4/marlin.py``marlin_utils_fp4.py:182`
`ops.marlin_gemm(b_q_type=float4_e2m1f)`, activations FP16/BF16. csrc: `csrc/quantization/marlin/marlin.cu`
+ `marlin_template.h` + `marlin.cuh`.
Techniques = **exactly the playbook we proved loses on GB10**: XOR shared swizzle (`marlin_template.h:722
^ (row%8)`), 4-stage cp.async pipeline (`marlin.cu:396 stages=4`, `cp_async_wait<stages-2>`), ldmatrix+mma,
FP16/BF16 acts. Native FP4 (`FlashInferB12xNvFp4LinearKernel`) needs `Sm120BlockScaledDenseGemm` cubins absent
on GB10 → W4A4 hangs → forced W4A16 Marlin fallback. **Nothing to port; vLLM's kernel is occupancy-blocked too.**
## The scheduler (the real multiplier) — what llama.cpp lacks
- **Paged KV cache** (`vllm/v1/core/kv_cache_manager.py`, `block_pool.py`): block KV, no fragmentation → very
high concurrent batch. **llama.cpp: NO** (contiguous per-slot KV → fragmentation caps real concurrency).
- **Chunked prefill** (`config/scheduler.py:84 enable_chunked_prefill=True`, default ON): interleaves prefill
chunks with decode so decode batches stay full. **llama.cpp: NO** (a long prefill stalls the decode batch).
- **Continuous batching** (`v1/core/sched/scheduler.py`): per-step admit/evict. **llama.cpp: YES** (`n_parallel`,
rudimentary — we enabled VRAM-scaled slots in #10411).
## Sizing the scheduler gap — MEASURED (llama.cpp aggregate, the surprise)
`llama-batched-bench` Qwen3-32B-Q4_K_M, npp=128 ntg=128, npl scaling (DGX):
| npl | S_PP (agg prefill) | **S_TG (agg decode)** | vLLM decode | llama % of vLLM |
|---|---|---|---|---|
| 1 | 628 | 10.2 | 11.8 | 86% |
| 8 | 773 | 59.8 | - | - |
| 32 | 763 | **235** | **328** | **72%** |
| 64 | 761 | **391** | **569** | **69%** |
| 128 | 762 | **540** | **667** | **81%** |
**The "30x gap" headline is wrong for realistic concurrency.** llama.cpp's continuous batching already
captures **~70-81% of vLLM's aggregate decode** at npl<=128, with a near-identical multiplier (10.2 -> 540 =
**53x**, vs vLLM's 56x). And it is still climbing linearly at 128 (not plateaued). Combined with llama.cpp being
*ahead* single-stream (MXFP4 1153 > vLLM 800), **llama.cpp is already broadly competitive with vLLM on GB10 at
self-hosted concurrency.**
Two real findings remain:
1. **Aggregate prefill is flat ~760** regardless of npl - but that is the **GB10 compute roofline** (vLLM single-
stream is ~800; neither can prefill faster aggregate, it is compute-bound). So prefill is **not a throughput
gap**; chunked prefill is a **latency/TTFT** win (stop a long prefill stalling the decode batch), not a
throughput one.
2. **vLLM's ~24k headline lives at thousands-of-sequences concurrency**, which **paged KV** unlocks (block KV,
no fragmentation). llama.cpp's contiguous KV caps how far npl can scale before memory/fragmentation bite. So
paged KV is the **high-concurrency (datacenter) lever**, not a moderate-concurrency one.
## Recommendation
**Pivot to the scheduler; treat the GEMM kernel as good-enough / roofline-blocked on GB10.**
Now that the gap is measured, ROI-ordered:
1. **Ship the MXFP4-dense win** — 1153 t/s single-stream beats vLLM's 800; a Blackwell dense-quant
recommendation (requantize, no kernel work). Already documented in `BLACKWELL_KERNEL_GAPS.md` §6. Cheapest.
2. **Chunked prefill** — the tractable scheduler win: interleave prefill chunks with decode so a long prompt
doesn't stall the decode batch. Payoff is **latency/TTFT under mixed load** (and steadier decode batches),
not aggregate prefill throughput (that's GB10-compute-capped at ~760-800 for both engines). A grpc-server
scheduler change; no KV-layout rewrite.
3. **Paged KV** — the **high-concurrency (thousands-of-seqs) lever** that unlocks vLLM's 24k regime. Heavy
(block KV manager; contested upstream PR #22569 / vendored `patches/`). Worth it only if datacenter-scale
concurrency is a target; at self-hosted concurrency (npl<=128) llama.cpp is already ~75-80% of vLLM.
**Reframed expectation:** llama.cpp on GB10 is NOT 30x behind vLLM. It is ahead single-stream (MXFP4) and
~70-81% of vLLM aggregate at npl<=128. The genuine differentiator vLLM still has is **scaling to very high
concurrency via paged KV**. Kernel tracks (W4A16 178 t/s; FP4-MMA) stay **banked** - not the lever.

View File

@@ -1,59 +0,0 @@
# Where vLLM beats llama.cpp on a DGX Spark (GB10), and how to close it — keeping quality
The question: "vLLM is faster at the end — what do we improve, while keeping good quality?" Answer: the
gap is **three independent things**, and the biggest *per-user, quality-preserving* one is **speculative
decoding**, which llama.cpp already supports.
## Decomposition (measured + researched)
| vLLM advantage | helps single user? | llama.cpp answer | quality cost | status |
|---|---|---|---|---|
| **Per-user decode speed** | **yes** | **speculative decoding** (Qwen3 draft / EAGLE3) | **none** (target-verified, lossless) | mature in llama.cpp; **the main lever** |
| Prefill / TTFT | no (it's first-token latency) | tune FP4-MMA / Marlin W4A16 kernel | none | hard; `BLACKWELL_KERNEL_GAPS.md` |
| Aggregate throughput @ concurrency | no (per-user = 0) | continuous batching (paged engine) | none | also kernel-bound |
Key measured fact: **single-user decode is already at parity** (Qwen3-32B: llama 10.2 vs vLLM 11.7 t/s) —
both hit GB10's ~273 GB/s bandwidth wall (~15 t/s ceiling) **without** spec-dec. So vLLM's real per-user
speed edge is spec-dec, not architecture.
## Why spec-dec is THE lever here (and quality-safe)
- **Lossless:** the 32B target verifies every drafted token (accept/reject) — output distribution is
identical to no-drafting. So you keep **Q4_K_M quality** (no lossy MXFP4 needed) *and* get speed.
- **GB10 is best-case for it:** decode is bandwidth-bound (one ~17 GB weight-read per token) with huge idle
compute. Spec-dec verifies K drafted tokens in **one** weight-read → converts the loop to compute-bound,
where GB10 has headroom. Realized speedup ≈ mean accepted length.
- **Measured (others, same model class):** llama.cpp Qwen2.5-32B dense + 0.5B draft = **2.9×** (13→38 t/s);
vLLM EAGLE3 on Qwen3-32B = ~1.82.5× general, up to ~3× code/structured. **Competitive.**
- **Regime caveat:** spec-dec gives **~nothing for MoE-A3B** models (only ~3B active → not bandwidth-bound,
nothing to amortize). It shines for **dense** 2732B — the opposite regime. So this lever is *dense-model*
specific.
## Qwen3-32B specifics
- **No native MTP head** (MTP is a Qwen3-*Next*/MoE feature). Options: a **same-family draft**
(Qwen3-0.6B or **1.7B** — same tokenizer, llama.cpp vocab check passes) or an external **EAGLE3 head**
(RedHatAI/AngelSlim Qwen3-32B-eagle3, accept length 2.152.49).
- Draft pick: **lean Qwen3-1.7B** (0.6B had ~60% lower acceptance in AWS's test; on a bandwidth-bound box the
32B weight-read dwarfs the draft cost, so maximize acceptance). `--spec-draft-n-max 58`.
## Recommended LocalAI actions (quality-preserving, ranked)
1. **Make speculative decoding easy/recommended for dense ≥14B models on Blackwell** — a draft-model field in
the model config (`-md` / `--spec-draft-*`), with a suggested Qwen3-1.7B draft for the Qwen3 family. This
is the biggest per-user speed win, lossless, available **now** (no kernel). Gallery: ship target+draft pairs.
2. Kernel work (FP4-MMA tuning / Marlin W4A16) — improves **prefill/TTFT**, separate metric.
3. Continuous batching (paged engine) — **aggregate** concurrency only; per-user = 0.
## Honesty / status
The research conclusion is solid (sources below). **Our own empirical spec-dec run on the DGX is pending**
the box rebooted mid-session and `llama-cli` now hangs at 0% GPU (while `llama-bench` works), plus the network
is dropping ssh mid-command. Drafts (Qwen3-0.6B/1.7B Q8) are downloaded and the spec-dec flags are confirmed;
re-run `llama-cli -m Qwen3-32B-Q4_K_M -md Qwen3-1.7B-Q8_0 -ngl 99 -ngld 99 --spec-draft-n-max 8` when the box
is stable to confirm the ~2× locally. The conclusion does not depend on it (it's measured-reproducible by
others on this exact model class), but we should bank our own number.
Sources: llama.cpp Discussion #10466 (Qwen2.5-32B+0.5B = 2.9×), #16578 (DGX Spark), DandinPower/llama.cpp_bench
(32B = 10.7 t/s, bandwidth-bound); vLLM MTP docs + Red Hat EAGLE3 article (lossless, up to 2.5×); AWS spec-dec
blog (Qwen3-32B+1.7B up to 3×, 0.6B ~60% lower accept); RedHatAI/AngelSlim Qwen3-32B-eagle3 heads.

View File

@@ -1,176 +0,0 @@
# W4A16 Marlin-style GEMM for ggml-cuda on Blackwell (sm_120/121) — implementation plan
> **STOPPED (2026-06-21): the kernel is NOT the lever — validated by a code-grounded vLLM analysis.**
> Measured on the DGX: vLLM's single-stream W4A16 prefill on GB10 = **~800 t/s (~52 TFLOPS), statistically TIED
> with llama.cpp MMQ (718/47)** — and vLLM uses the *exact* XOR-swizzle + 4-stage cp.async Marlin we proved
> collapses GB10 occupancy (vLLM even warns at load that Marlin "may degrade performance for compute-heavy
> workloads"). There is no kernel trick to port. Moreover llama.cpp's **MXFP4 path (1153 t/s) already BEATS
> vLLM single-stream (800)** — vLLM has no FP4 cubins on sm_121 and falls back to slower W4A16 Marlin, so
> llama.cpp is *ahead* on the kernel. **vLLM's entire 24k headline is the aggregate decode multiplier (~56×)
> from paged KV + chunked prefill + continuous batching — a SCHEDULER win.** llama.cpp lacks paged KV +
> chunked prefill. **Effort pivots to the scheduler** (see the paged-attention work). This kernel work is
> banked + resumable (178 t/s, P0/P1/P2/P3/P3b committed) but is not the throughput lever on GB10. Detail:
> `VLLM_DECOMPOSITION.md`.
The committed multi-week kernel. Goal: get 4-bit-weight dense matmul to the GB10 **BF16 ceiling (~213
TFLOP/s ≈ ~3,300 t/s prefill on Qwen3-32B)**, ~4.3× over today's 765. This is the *match-vLLM* path; vLLM's
own GB10 dense throughput runs on W4A16 Marlin (its FP4 path is broken on sm_121).
## Why a custom kernel (validated, not assumed)
On GB10 (sm_121), measured: **both** llama-MMQ (int8, Ampere-tuned) **and** cuBLAS-FP16 sit at ~46 TFLOP/s
(~21% of peak). cuBLAS falls back to an Ampere `cutlass_80_tensorop` kernel (CUDA-13 has no sm_121 GEMM for
these shapes); rebuilt with `-DGGML_CUDA_FORCE_CUBLAS=ON` it's *slower* than MMQ (690 vs 750). **No library
path reaches the ceiling on consumer Blackwell** — a hand-tuned sm_120a kernel is required. `mmapeak` measures
the 213 BF16 peak as reachable, and vLLM's Marlin hits it, so the ceiling is real; the work is reaching it.
## What Marlin does (the design we mirror)
Weights stored 4-bit, **dequantized in-register/shared-mem** in-flight; GEMM math on **FP16/BF16 tensor
cores** (`mma.sync m16n8k16`). Speed comes from: `cp.async` global→shared with a **multi-stage double-buffered
pipeline**, **offline weight reshuffle** into the MMA-friendly layout, activations kept resident in registers,
and **Stream-K** partitioning. Sources: IST-DASLab/marlin, arXiv 2408.11743, vLLM machete (Hopper successor).
## Phases (each ends with: numerical parity vs MMQ + a prefill benchmark)
### P0 — Harness + baseline — DONE
- **Correctness gate (GREEN):** `test-backend-ops test -o MUL_MAT -b CUDA0`**1103/1103 passed** (CUDA vs CPU
reference, covers Q4_0/Q4_K at the real FFN shapes m=4096,k=14336,n=1..512). This is *the* parity check the
W4A16 kernel must keep green at every phase — it tests the CUDA MUL_MAT path the kernel will hook. The
`not supported` lines are `type_b=f16` combos (irrelevant; prefill uses f32 activations).
- **Perf baseline:** `llama-bench` dense Q4_K prefill = **~750 t/s (pp512 718 / pp2048 750) ≈ 46 TFLOP/s ≈ 21%
of the 213 BF16 ceiling**. The kernel must beat this toward ~3,300. (`test-backend-ops perf -o MUL_MAT` gives
per-shape GFLOPS too; build it once with the harness.)
- **Op-level baseline (the canonical kernel target), `test-backend-ops perf -o MUL_MAT`, m=4096 k=14336 (FFN):**
| n (tokens) | q4_0 | q4_K | regime |
|---|---|---|---|
| 1 | 817 GFLOPS | 761 GFLOPS | decode / mat-vec (memory-bound) |
| 8 | 5.77 TFLOPS | 4.11 TFLOPS | small-batch |
| **512** | **49.5 TFLOPS** | **47.1 TFLOPS** | **prefill GEMM — ~22% of the 213 ceiling** |
So the prefill GEMM target: lift q4_K n=512 from **47 → toward ~213 TFLOPS** (~4.5×). This per-shape number
is cleaner than end-to-end for kernel iteration.
- **Harness script:** `~/p0harness.sh` on the DGX (build test-backend-ops + correctness + perf). Reusable each
phase: `test-backend-ops test -o MUL_MAT -b CUDA0` must stay 1103/1103; the q4_K n=512 perf must climb from 47.
- test-backend-ops needed `-DLLAMA_BUILD_TESTS=ON`; now built in `~/llama.cpp-pr24423/build`.
### P1 — Dispatch seam (no behavior change) — DONE
- `marlin-w4a16.{cuh,cu}` + a gated hook in `ggml_cuda_mul_mat` (dense, non-ids path), behind
`GGML_CUDA_W4A16` + sm_120/121 (`cc >= GGML_CUDA_CC_BLACKWELL`) + type∈{Q4_0,Q4_K} + f32 activations.
Returns false → falls back to MMQ. Source + apply instructions: `kernel/w4a16/` (`HOOK.md`).
- **Verified on GB10:** clean build; `test-backend-ops MUL_MAT` = **1103/1103** (byte-identical default);
`llama-bench` dense Q4 pp512 unchanged (717.77 default / 718.26 with flag); `GGML_CUDA_W4A16=1` reaches the
seam (stderr `[w4a16] ... P1 seam - using MMQ`) and falls back. The empty frame P2/P3 fills.
### P2 — Correctness-first kernel (slow OK) — DONE
- **Kernel:** `marlin-w4a16.cu` replaces the P1 TODO with a real W4A16 GEMM. In-kernel dequant Q4→BF16 into
shared mem, `mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32` via ggml's `mma.cuh` tile abstractions
(`tile<16,8,nv_bfloat162>` A, `tile<8,8,nv_bfloat162>` B, `tile<16,8,float>` C), F32 accumulate, F32 write.
One warp per 16(M)x8(N) output tile, K looped in steps of 16. Both src0 (weights, row m) and src1 (acts,
row n) are row-major `[row][k]`, so A and B load symmetrically via `load_generic`; the mma does the dot over k.
- **Types handled:** Q4_0 and Q4_K. Q4_0 dequant `w=d*(q-8)` inline; Q4_K via the superblock decode mirrored
from `convert.cu` (`get_scale_min_k4`, 8x32 sub-blocks, `d*q-m`).
- **Shape classes handled:** contiguous 2D GEMM (the prefill path), `ne2==ne3==1`, f32 activations, K%16==0
(always true: Q4_0 K%32, Q4_K K%256). **Falls back to MMQ (returns false)** for batched (bs!=[1,1]),
broadcast (nr!=[1,1]), permuted / non-contiguous (per!=[0,1,2,3]), and any non-f32 activation (e.g. f16) -
keeps the gate green. M / N boundaries are zero-padded in-kernel (handles M not %16, N not %8).
- **Parity (the gate):** `GGML_CUDA_W4A16=1 test-backend-ops test -o MUL_MAT -b CUDA0` = **1103/1103 passed**
(the Q4_0/Q4_K f32 contiguous shapes run the kernel and match the CPU reference; batched/permuted/f16 fall
back). Default (flag-unset) build still **1103/1103** (byte-identical, seam returns false).
- **Model sanity / P2 perf:** `GGML_CUDA_W4A16=1 llama-bench -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -p 512 -n 16
-ub 2048` runs clean: **pp512 = 31.75 t/s**, tg16 = 6.28 t/s. Slow as expected (naive 1-warp/tile, weights
re-dequantized per n-tile, no pipeline) - this is the correctness checkpoint; P3 brings the speedup. The real
Q4_K model matmul path engages the kernel without error.
### P3 — The Marlin pipeline (the speedup) — STEP 1 + SKEW-PAD/TILING LANDED; PREPACK + PIPELINE + STREAM-K DEFERRED
Goal: `cp.async` double/triple-buffered global->shared; offline weight reshuffle (a one-time repack of the Q4
tensor into the mma+pipeline layout); register-resident activation tiles; Stream-K split for the prefill M.
Target: >=150 TFLOP/s (>=~2,300 t/s), then ~213. **MMQ baseline to beat: 47.1 TFLOPS (q4_K n=512) / pp512 718.**
**Kernel structure now (committed P3b):** block-tiled multi-warp GEMM with a CONFLICT-FREE shared feed via skew
padding. `blockDim=(32, WM*WN)` so `threadIdx.x` is the warp lane (required by `mma.cuh` get_i/get_j) and
`threadIdx.y` is the warp index; the original 1-warp P2 launch put 128 threads on `threadIdx.x` and exploded
`get_j` into an out-of-bounds shared read (found via compute-sanitizer). `WM*WN` warps compute a
`BM(=WM*FM*16) x BN(=WN*FN*8)` output tile; each warp owns an `FM x FN` grid of m16n8k16 mma fragments
accumulated in F32. Per k-step (16-deep): all warps cooperatively dequant the `BM x 16` Q4 weight strip + load
the `BN x 16` f32->bf16 activation strip into shared, one `__syncthreads`, then `ldmatrix.x4` (A) / `ldmatrix.x2`
(B) fragments + `FM*FN` mmas. The shared rows hold 8 bf162 of data but are stored at a PADDED stride of 12 bf162
(`W4A16_SPAD`): ldmatrix's per-lane address is `row*stride`, and the natural stride 8 (a divisor of the
32-bank / 128-byte cycle) collides rows 0,4,8,12 into a 2-way bank conflict; skewing to 12 (4-byte aligned, so
ldmatrix's 16-byte alignment holds) makes `{r*12 mod 32}` hit 8 distinct bank-quads for r in 0..7, so both
halves of ldmatrix are conflict-free at only +50% on the small staged tile (~12 KB at the shipping tile).
Shipping config `WM=4,WN=4,FM=2,FN=4` -> `BM=128, BN=128`, 16 warps, 8 m16n8 C-tiles per warp (keeping
register pressure low is what lets BN grow without an occupancy cliff). M/N tails zero-padded in-kernel; still
gated to contiguous 2D Q4_0/Q4_K f32 prefill, else falls back to MMQ.
**Per-step results (q4_K n=512 via `test-backend-ops perf`; pp512/pp2048 via llama-bench Qwen3-32B-Q4_K_M):**
| step | q4_K n=512 | q4_0 n=512 | pp512 | pp2048 | vs MMQ 47 / 718 | notes |
|---|---|---|---|---|---|---|
| P2 (1 warp/tile) | ~2 TFLOPS | - | 31.75 | - | 0.04x | correctness checkpoint |
| Step 1: block tiling (load_generic, BM64/4w) | 6.63 (cold) | 7.53 | 119 | 123 | 0.14x | original committed kernel |
| P3b-1: skew-pad ldmatrix + BM128/8w | 8.50 (cold) | 10.56 | 148.5 | 153.9 | 0.18x | +28% q4_K, +40% q4_0 over step 1 |
| **P3b-2: + BN128/16w (current)** | **9.92 (cold)** | **11.68** | **177.6** | **185.0** | **0.21x** | +17% q4_K, +20% pp512 over P3b-1 (+49% pp512 over step 1) |
Parity gate **1103/1103** at every step, flag set and unset (byte-identical when unset). All P3b numbers above
are from thermally-bracketed cold A/B sessions (committed measured immediately before AND after each candidate,
identical both times -> the deltas are real, not thermal). P3b-1 cold A/B: 6.63/7.53 vs 8.52/10.49. P3b-2 cold
A/B: BN64/8w 10.56/8.50 then 10.51/8.45 (bracket) vs BN128/16w 11.68/9.92.
**What landed / what was tried (honest):**
- **P3b - LANDED (committed).** Two combined changes lift the prior committed kernel: (1) **skew-pad
conflict-free ldmatrix** (shared row stride 8->12 bf162; makes `ldmatrix.x4`/`.x2` bank-conflict-free at near
zero occupancy cost) and (2) **bigger tile / more warps** (`BM=128, BN=64`, 8 warps). Cold A/B: q4_K
6.63->8.52 (+28%), q4_0 7.53->10.49 (+40%), pp512 119->148.5 (+25%). **Still ~5.5x under MMQ (47) per-op and
~4.8x under pp512 718 - does NOT beat MMQ.** This is forward progress, not the finish line.
- **The XOR-swizzle-FIRST plan was tested and is WRONG for this GPU - documented so it is not re-tried.** A
wide-row (BK=64, 128-byte rows) XOR swizzle `seg ^ (row&7)` IS conflict-free, but the 16 KB shared it needs
collapsed occupancy and dropped q4_K n=512 to **2.84 TFLOPS** (worse than the unswizzled 6.63) - the same
occupancy cliff P3 hit with a 32 KB pipeline. The conflict-free feed must be bought WITHOUT widening shared:
skew padding (above) does exactly that (6 KB), which is why it is the committed form. Lesson: on GB10 occupancy
dominates bank-conflict latency; never trade occupancy for a conflict-free layout.
- **Conflict-free feed alone did NOT beat the unswizzled kernel - the limiter moved.** At the SAME BM64/4w tile,
skew-pad ldmatrix (6.70) ~= load_generic (6.63): removing bank conflicts bought ~nothing. The win came only
when the tile grew (BM128/8w). A 5-config tile sweep then split the two quant types:
- **q4_0 SCALES with warps/tiles** (7.7 -> 10.5 -> **15.8 TFLOPS at BM128/16w**): feed/global-traffic bound,
helped by cutting redundant activation re-reads (more BM = fewer M-blocks each re-reading the act column).
- **q4_K is largely DEQUANT-COMPUTE bound** (the BM64/16w tile gives q4_0=15.8 but q4_K=6.8 - they diverge
hard). This **refines P3's "within 12%" finding**: that held only in the low-throughput memory-bound regime;
once the feed is unblocked, q4_K's per-element 6-bit superblock decode (`get_scale_min_k4` + superblock
indexing, redone every k-step AND re-done by every N-block) becomes the wall. BM256 regressed both (too few
blocks / register pressure).
- **Growing BN partly relieves the q4_K dequant wall (P3b-2).** Because every N-block re-decodes the same
weight strip, halving the N-block count (BN 64->128) halves that redundant q4_K decode - but only when BN is
spread across MORE WARPS (16w, 8 C-tiles/warp), not more fragments-per-warp: the FN=8 / FM=4 variants (16
C-tiles/warp) regressed to ~6.6 on register pressure, while WM=4,WN=4,FM=2,FN=4 (16w, 8 tiles/warp) lifted
q4_K 8.5->9.9 and q4_0 10.6->11.7 cold. BN=256 was no better and costs more shared. **BN128/16w is the
shipping tile.**
- **Next blocker (the remaining q4_K unlock) = offline prepack.** BN growth only divides the redundant decode by
the N-block count; it cannot remove the per-k-step decode itself. The full fix is the **one-time offline
repack** - decode the Q4 tensor ONCE into a cached device buffer keyed off the tensor data pointer, in a layout
with the scale/min pre-applied (store reshuffled 4-bit + per-subblock bf16 d,m, ~1.25x the q4 size, NOT a full
bf16 blow-up which would be ~4x), so the in-kernel path becomes a cheap `q*d - m` with coalesced loads. Then
`cp.async` multi-stage (sized to NOT widen shared past the occupancy cliff) and **Stream-K** over M. These
remain the multi-week core; **prepack is the highest-value next step for q4_K specifically** (it should let
q4_K join q4_0 on the feed-bound scaling curve instead of plateauing at ~10).
- **Methodology note (unchanged):** the box thermally throttles under sustained perf+bench runs (identical code
~8.8 cold vs ~6.6 hot earlier), so only same-session A/Bs are trustworthy. The P3b deltas above were taken in
one bracketed cold session for exactly this reason.
### P4 — Tune
- Tile (mmq_x/y analogues), warps, pipeline depth, occupancy. We have nsys (throughput) but **not ncu** on the
DGX — tuning is empirical (sweep configs, measure t/s). Note ncu would need sudo/driver perms we lack.
### P5 — Enable
- Default on for sm_120/121 + Q4_0/Q4_K dense when parity holds + faster; keep the flag as an escape hatch.
Ship as a LocalAI llama.cpp patch (the patches/ series) and/or upstream (ggml has no Marlin-equivalent —
issue #1519 — so it's net-new upstream value; float it with maintainers first).
## Risks / notes
- **Multi-week, expert-CUDA, DGX-only** (GB10 is the only sm_121). The session's network flakiness +
`llama-cli` hang make `llama-bench`/`test-backend-ops` the reliable verification tools (both work).
- Quantization correctness: Q4_K's superblock structure (256-elem, 6-bit scales) is more complex to dequant
in-kernel than Q4_0; consider landing Q4_0 first, then Q4_K.
- **Beat-path follow-on:** the FP4-MMA path (`mul_mat_q<MXFP4>`, ~5% of FP4 peak) tuned/fixed on sm_121 reaches
~6,600 (2× BF16). Separate track; this W4A16 kernel is the match-path foundation.
- Reuse ggml's `mma.cuh` tile abstractions (MMQ already uses them) rather than raw PTX where possible.

View File

@@ -1,31 +0,0 @@
# W4A16 seam — how to apply to a llama.cpp / ggml-cuda checkout
Two source files + two one-line edits to `ggml/src/ggml-cuda/ggml-cuda.cu`. The build picks up the
new `.cu` via the existing `file(GLOB)` after a `cmake -S . -B build` reconfigure (no CMakeLists edit).
## Files (copy into `ggml/src/ggml-cuda/`)
- `marlin-w4a16.cuh`
- `marlin-w4a16.cu`
## Edit `ggml/src/ggml-cuda/ggml-cuda.cu`
1. **Include** — after the existing `#include "ggml-cuda/fp4-grouped-moe.cuh"` (sibling-header style):
```cpp
#include "ggml-cuda/marlin-w4a16.cuh"
```
2. **Dispatch hook** — immediately before the dense dispatch chain, i.e. before
`if (!split && use_mul_mat_vec_f) {` in `ggml_cuda_mul_mat(...)` (after `const int cc = ...`):
```cpp
if (!split && ggml_cuda_w4a16_mul_mat(ctx, src0, src1, dst)) { return; }
```
## Verify (P1 acceptance — met)
- `cmake --build build --target test-backend-ops llama-bench` → builds clean.
- `test-backend-ops test -o MUL_MAT -b CUDA0` → **1103/1103** (byte-identical default).
- `llama-bench` dense Q4 pp512 → unchanged (~718, MMQ).
- `GGML_CUDA_W4A16=1 llama-bench` → unchanged + stderr `[w4a16] ... P1 seam - using MMQ` (seam reached,
gating passes on sm_121, falls back).
The kernel body (P2 correctness → P3 Marlin pipeline) replaces the `TODO(P2/P3)` block in `marlin-w4a16.cu`
and returns `true` once parity holds.

View File

@@ -1,66 +0,0 @@
# W4A16 kernel - subagent dispatch briefs (P3, P4, P5)
**Dispatch strategy.** Each phase = one fresh **Opus-4.8** subagent handed a complete zero-context brief.
Phases are **sequential** (P3 needs P2's correct kernel; P4 needs P3's pipeline; P5 needs P4's tuned kernel),
so dispatch phase N+1 only after phase N's commit lands, and before dispatching, splice phase N's *actual*
deliverable (final kernel shape, configs, fallback set) into the next brief. P2's brief (already dispatched)
is the template; reuse the COMMON section below verbatim in every dispatch.
---
## COMMON (paste into every phase brief)
- **Kernel dev is on the remote DGX** (GB10, sm_121): `ssh -o ConnectTimeout=25 -o ServerAliveInterval=10 -o ServerAliveCountMax=10 dgx.casa '<cmd>'`. Network is FLAKY (re-poll on drop; nohup jobs survive). `llama-cli` HANGS - never use it. Only `llama-bench` + `test-backend-ops` work.
- Checkout `~/llama.cpp-pr24423`, build `~/llama.cpp-pr24423/build` (sm_121, `-DLLAMA_BUILD_TESTS=ON`). Kernel file `ggml/src/ggml-cuda/marlin-w4a16.cu`. Build auto-GLOBs it; no CMakeLists edits. Hook already in `ggml-cuda.cu`, gated behind env `GGML_CUDA_W4A16`.
- Dense test model: `~/bench/q3-32b-gguf/Qwen3-32B-Q4_K_M.gguf`.
- **Builds run detached + poll** (never blocking foreground): write a `~/pN.sh` that builds `--target test-backend-ops llama-bench`, echoes `RC=$?`, runs the gate, echoes `PN_DONE`; `nohup` it; poll `for i in $(seq 1 90); do grep -q PN_DONE ~/pN.out && break; sleep 20; done; tail ~/pN.out`.
- **GPU hygiene:** check `docker ps | grep local-ai` + `nvidia-smi`; `docker stop` a running localai worker if present (authorized); never pkill native procs; never start model servers.
- **Parity gate (must stay green every step):** `GGML_CUDA_W4A16=1 CUDA_VISIBLE_DEVICES=0 ./build/bin/test-backend-ops test -o MUL_MAT -b CUDA0` = **1103/1103**; and flag-unset stays 1103/1103 (byte-identical). A wrong result is worse than a fallback - return false for any shape you can't do correctly.
- **Perf measurement:** `test-backend-ops perf -o MUL_MAT -b CUDA0` (per-shape GFLOPS; the canonical target is q4_K m=4096 k=14336 **n=512**, baseline **47.1 TFLOPS**, ceiling ~213) + `llama-bench -m <model> -ngl 99 -p 512,2048 -n 0 -ub 2048` (baseline pp512 ~718).
- **LocalAI repo (commit here; you do NOT inherit cwd - `cd` explicitly):** `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`. Plan: `backend/cpp/llama-cpp/paged/W4A16_MARLIN_KERNEL_PLAN.md`. Source mirror: `backend/cpp/llama-cpp/paged/kernel/w4a16/`. After a phase passes: fetch the final `marlin-w4a16.cu` from the DGX (`ssh ... 'cat ...'`), overwrite the mirror, update the plan (mark the phase DONE with numbers), `git commit -s` (DCO sign-off; user is Ettore Di Giacinto <mudler@localai.io>). **No `Co-Authored-By`. No em-dashes anywhere. Trailer `Assisted-by: Claude:opus-4.8 [Claude Code]`. Do NOT push.**
- Final message = the result (gate ?/1103, the perf delta, blockers + resolutions, commit hash). A precise partial result beats a vague success claim.
---
## P3 brief - the Marlin pipeline (the speedup)
**Goal.** Take P2's correct-but-slow kernel from ~47 toward ~150+ TFLOPS (then ~213) on the q4_K n=512 prefill GEMM, **without ever breaking parity**. This is the Marlin design: the math is the same BF16 mma; the speed comes from feeding the tensor cores without stalling.
**Implement, incrementally (re-run the parity gate after each):**
1. **`cp.async` multi-stage pipeline** - double/triple-buffer global->shared loads of both the Q4 weight tiles and the activation tiles so dequant+mma on stage k overlaps the load of stage k+1. (Study `mma.cuh` + how `mmq.cu`/`mmf.cu` stage shared memory; ggml already uses `cp.async`/`__pipeline_*`.)
2. **Offline weight reshuffle** - repack the Q4 weights once into the mma+pipeline-friendly layout (Marlin's interleave) so loads are coalesced and the mma fragment maps directly. Do this as a load-time transform of src0 (a new prepacked buffer keyed off the tensor) - NOT per-call. Document where the repack lives + its memory cost.
3. **Register-resident activation tiles + Stream-K** split of the M dimension across blocks for the prefill (large-M) case so all SMs stay busy.
**Acceptance.** Parity gate stays **1103/1103** at every commit; `test-backend-ops perf` q4_K n=512 climbs materially above 47 TFLOPS (target >=150) and `llama-bench` pp512 climbs above ~718. Report the TFLOPS + t/s after each of the 3 steps so the contribution of each is visible. If a step regresses parity, revert it and report why.
**Reference.** IST-DASLab/marlin (github), arXiv 2408.11743, vLLM machete. Mirror `mmf.cu`'s BF16 GEMM structure; Marlin = that + Q4 dequant-on-load + the pipeline/reshuffle.
**Splice before dispatch:** P2's final kernel structure (tile sizes, which types/shapes it handles vs falls back, helper functions it defined).
---
## P4 brief - tune to the ceiling
**Goal.** Drive the P3 kernel as close to the ~213 TFLOPS ceiling as empirical tuning allows. **No `ncu` on this box** (no driver perms) - tune by throughput: `test-backend-ops perf` + `llama-bench` + `nsys` (throughput only).
**Do.** Parametrize the kernel (template params / constants) over: tile M/N/K, warps per block, pipeline depth (stages), and occupancy (regs, shared-mem budget). Sweep systematically (a script that rebuilds + benches each config, logs q4_K n=512 TFLOPS + pp512/pp2048 t/s), pick the best, hard-set it (with a short comment on the sweep). Check both prefill shapes (n=512 and n=2048) and confirm decode (n=1) didn't regress (it should still route to mat-vec, not this kernel - verify the gating).
**Acceptance.** Best config maximizes q4_K n=512 TFLOPS (stretch ~150-213) with parity **1103/1103** intact; the sweep table (config -> TFLOPS/t-s) is recorded in the plan's P4 section. Report the chosen config + the final pp512/pp2048 t/s vs the 718/750 baseline and vs vLLM's ~3300 single-stream target.
**Splice before dispatch:** P3's pipeline structure + the perf it reached + which knobs are already fixed vs free.
---
## P5 brief - enable + package + (maybe) upstream
**Goal.** Make W4A16 the default dense-Q4 path on Blackwell and ship it through LocalAI.
**Do.**
1. **Flip the gate:** default-ON for sm_120/121 + Q4_0/Q4_K dense when faster, keep an opt-out env (e.g. `GGML_CUDA_W4A16=0`) as an escape hatch. The existing return-false-on-unhandled-shape path is the correctness safety net; keep it. Verify the default (no env) build now runs W4A16 for dense Q4, gate green, faster than the old MMQ baseline.
2. **Package as a LocalAI llama.cpp patch:** produce `backend/cpp/llama-cpp/paged/patches/kernel/0002-w4a16-marlin.patch` (the new files + the `ggml-cuda.cu` hook + the gate flip) that applies cleanly to the pinned llama.cpp, mirroring the existing `patches/kernel/0001-fp4-grouped-moe-scaffold.patch`. Confirm LocalAI's `make backends/llama-cpp` build path can consume it (read `.agents/llama-cpp-backend.md` + the build memory: `make -C backend/cpp/llama-cpp clean` before rebuilds).
3. **Docs:** update `BLACKWELL_KERNEL_GAPS.md` + the plan with the shipped result; add a short note to the LocalAI docs if there's a Blackwell/performance page.
4. **Upstream decision (do NOT open without surfacing first):** ggml has no Marlin-equivalent (issue #1519) so this is net-new upstream value. Draft (do not submit) an upstream PR description + note the sm_121 build-flag caveats; report it for the user to decide.
**Acceptance.** Default Blackwell build uses W4A16 for dense Q4, parity 1103/1103, measurably faster than MMQ; the patch applies + the LocalAI llama-cpp backend builds with it (verify or, if the full backend build is too heavy, document the exact build command + that the patch applies cleanly). Report the end-to-end LocalAI dense-Q4 prefill number vs the start-of-project 765 t/s.
**Splice before dispatch:** P4's final kernel + config + the measured ceiling reached; the exact enable condition decided.

View File

@@ -1,258 +0,0 @@
#include "marlin-w4a16.cuh"
#include "mma.cuh"
#include <cstdio>
#include <cstdlib>
#include <cuda_bf16.h>
// W4A16 Marlin-style GEMM.
//
// In-kernel dequantize Q4 weights -> BF16, multiply against BF16-converted F32
// activations using mma.sync m16n8k16 BF16 tensor-core ops, accumulate in F32,
// write F32 output. Handles only the contiguous 2D GEMM (prefill) case for
// Q4_0 / Q4_K; everything else returns false and falls back to MMQ.
//
// ggml MUL_MAT convention: dst[m,n] = sum_k src0[k,m] * src1[k,n].
// src0 (weights): ne0=K (contiguous), ne1=M -> row m is K contiguous quants.
// src1 (acts,f32): ne0=K (contiguous), ne1=N -> row n is K contiguous floats.
// dst (f32): ne0=M (contiguous), ne1=N -> element (m,n) at m + n*M.
// Both operands are row-major [row][k]; m16n8k16 computes C[m,n] += sum_k A[m,k]*B[n,k].
//
// Thread layout: blockDim = (32, WM*WN). threadIdx.x is the warp lane (0..31,
// required by mma.cuh get_i/get_j), threadIdx.y is the warp index.
//
// P3b step 1 - conflict-free shared layout via SKEW PADDING:
// - WM*WN warps compute a BM(=WM*FM*16) x BN(=WN*FN*8) output tile; each warp
// owns an FM x FN grid of m16n8k16 mma fragments accumulated in F32.
// - Per 16-deep k-step the warps cooperatively dequant the BM x 16 Q4 weight
// strip + load the BN x 16 f32->bf16 activation strip into shared, then feed
// the tensor cores with ldmatrix.x4 (A) / ldmatrix.x2 (B).
// - The shared rows are PADDED to SPAD(=12) bf162 instead of the natural 8.
// ldmatrix's per-lane address is row*stride; with the natural stride 8 (a
// divisor of the 32-bank / 128-byte cycle) rows 0,4,8,12 collide -> 2-way
// bank conflict on every fragment load (this is why P3 measured a plain
// ldmatrix swap as neutral). Skewing the stride to 12 (4-byte aligned, so
// ldmatrix's 16-byte alignment holds) makes {r*12 mod 32} hit 8 distinct
// bank-quads for r in 0..7, so both halves of ldmatrix.x4 and ldmatrix.x2 are
// conflict-free. The pad costs only +50% on the small (~4 KB) staged tile, so
// unlike a 128-byte-row XOR swizzle it does NOT collapse occupancy on GB10
// (a wide-row swizzle pushed shared to 16 KB and dropped this to ~2.8 TFLOPS).
//
// Dead-ends already proven (do not re-try): a double-buffered KSTAGE=64 cp.async
// pipeline collapsed occupancy (32 KB shared -> 2.7 TFLOPS); a plain ldmatrix on
// the UNpadded layout was neutral (bank conflicts); a wide-row (BK=64) XOR swizzle
// was conflict-free but occupancy-starved (16 KB shared -> 2.8 TFLOPS). Skew
// padding gets the conflict-free feed at near-zero occupancy cost.
using namespace ggml_cuda_mma;
typedef tile<16, 8, nv_bfloat162> tile_A; // 16(M) x 16(K)
typedef tile< 8, 8, nv_bfloat162> tile_B; // 8(N) x 16(K)
typedef tile<16, 8, float> tile_C; // 16(M) x 8(N)
// bf162 columns actually live per shared row (16 k-values = 8 bf162) ...
#define W4A16_KP 8
// ... padded to this stride to bank-skew the ldmatrix row addresses.
#define W4A16_SPAD 12
static bool w4a16_enabled() {
static const bool en = (std::getenv("GGML_CUDA_W4A16") != nullptr);
return en;
}
// 6-bit packed scale/min decode for Q4_K (mirrors convert.cu get_scale_min_k4).
static __device__ __forceinline__ void w4a16_scale_min_k4(int j, const uint8_t * q, uint8_t & d, uint8_t & m) {
if (j < 4) {
d = q[j] & 63; m = q[j + 4] & 63;
} else {
d = (q[j+4] & 0xF) | ((q[j-4] >> 6) << 4);
m = (q[j+4] >> 4) | ((q[j-0] >> 6) << 4);
}
}
// Dequantize a single Q4_0 weight at column k of a row.
static __device__ __forceinline__ float w4a16_dq_q4_0(const char * row, int k) {
const block_q4_0 * blk = (const block_q4_0 *) row + (k / QK4_0);
const int j = k % QK4_0;
const float d = __half2float(blk->d);
const int q = (j < QK4_0/2) ? (blk->qs[j] & 0xF) : (blk->qs[j - QK4_0/2] >> 4);
return (q - 8) * d;
}
// Dequantize a single Q4_K weight at column k of a row.
static __device__ __forceinline__ float w4a16_dq_q4_K(const char * row, int k) {
const block_q4_K * blk = (const block_q4_K *) row + (k / QK_K);
const int e = k % QK_K;
const int il = e / 64; // 0..3
const int within = e % 64;
const int half = within / 32; // 0..1
const int pos = within % 32;
const int ir = pos / 4; // 0..7
const int l = pos % 4; // 0..3
const int is = 2*il + half;
const float dall = __low2half (blk->dm);
const float dmin = __high2half(blk->dm);
uint8_t sc, mn;
w4a16_scale_min_k4(is, blk->scales, sc, mn);
const float d = dall * sc;
const float m = dmin * mn;
const uint8_t qb = blk->qs[32*il + 4*ir + l];
const int q = (half == 0) ? (qb & 0xF) : (qb >> 4);
return d * q - m;
}
template <bool IS_Q4_K, int WM, int WN, int FM, int FN>
static __global__ void __launch_bounds__(WM*WN*32, 1)
w4a16_gemm_kernel(
const char * __restrict__ src0,
const char * __restrict__ src1,
float * __restrict__ dst,
const int M, const int N, const int K,
const int64_t nb01, const int64_t nb11, const int64_t dst_ne0) {
constexpr int KP = W4A16_KP; // 8 bf162 = 16 k per row
constexpr int SPAD = W4A16_SPAD; // padded row stride (bank skew)
constexpr int BM = WM*FM*16;
constexpr int BN = WN*FN*8;
constexpr int NTH = WM*WN*32;
const int m0 = blockIdx.x * BM;
const int n0 = blockIdx.y * BN;
const int warp_id = threadIdx.y; // 0 .. WM*WN-1
const int warp_n = warp_id % WN;
const int warp_m = warp_id / WN;
const int tid = threadIdx.y*32 + threadIdx.x;
__shared__ nv_bfloat162 sW[BM*SPAD]; // [m][kpair], padded row stride SPAD
__shared__ nv_bfloat162 sB[BN*SPAD]; // [n][kpair], padded row stride SPAD
tile_C C[FM][FN]; // zero-initialized accumulators
for (int k0 = 0; k0 < K; k0 += 16) {
// Dequantize the BM x 16 weight strip once; reused across the block's BN span.
#pragma unroll
for (int idx = tid; idx < BM*KP; idx += NTH) {
const int m = idx / KP;
const int kk = idx % KP;
const int k = k0 + 2*kk;
float w0 = 0.0f, w1 = 0.0f;
if (m0 + m < M) {
const char * row = src0 + (int64_t)(m0 + m) * nb01;
if (IS_Q4_K) { w0 = w4a16_dq_q4_K(row, k); w1 = w4a16_dq_q4_K(row, k + 1); }
else { w0 = w4a16_dq_q4_0(row, k); w1 = w4a16_dq_q4_0(row, k + 1); }
}
sW[m*SPAD + kk] = __floats2bfloat162_rn(w0, w1);
}
// Load the BN x 16 activation strip (f32 -> bf16).
#pragma unroll
for (int idx = tid; idx < BN*KP; idx += NTH) {
const int n = idx / KP;
const int kk = idx % KP;
const int k = k0 + 2*kk;
float a0 = 0.0f, a1 = 0.0f;
if (n0 + n < N) {
const float * arow = (const float *)(src1 + (int64_t)(n0 + n) * nb11);
a0 = arow[k]; a1 = arow[k + 1];
}
sB[n*SPAD + kk] = __floats2bfloat162_rn(a0, a1);
}
__syncthreads();
tile_A Af[FM];
tile_B Bf[FN];
#pragma unroll
for (int fm = 0; fm < FM; ++fm) {
const int mrow = (warp_m*FM + fm) * 16;
load_ldmatrix(Af[fm], sW + mrow*SPAD, SPAD);
}
#pragma unroll
for (int fn = 0; fn < FN; ++fn) {
const int ncol = (warp_n*FN + fn) * 8;
load_ldmatrix(Bf[fn], sB + ncol*SPAD, SPAD);
}
#pragma unroll
for (int fm = 0; fm < FM; ++fm) {
#pragma unroll
for (int fn = 0; fn < FN; ++fn) {
mma(C[fm][fn], Af[fm], Bf[fn]);
}
}
__syncthreads();
}
#pragma unroll
for (int fm = 0; fm < FM; ++fm) {
#pragma unroll
for (int fn = 0; fn < FN; ++fn) {
const int mbase = m0 + (warp_m*FM + fm) * 16;
const int nbase = n0 + (warp_n*FN + fn) * 8;
#pragma unroll
for (int l = 0; l < tile_C::ne; ++l) {
const int m = mbase + tile_C::get_i(l);
const int n = nbase + tile_C::get_j(l);
if (m < M && n < N) {
dst[(int64_t)n * dst_ne0 + m] = C[fm][fn].x[l];
}
}
}
}
}
bool ggml_cuda_w4a16_mul_mat(
ggml_backend_cuda_context & ctx,
const ggml_tensor * src0,
const ggml_tensor * src1,
ggml_tensor * dst) {
if (!w4a16_enabled()) {
return false;
}
if (src0->type != GGML_TYPE_Q4_0 && src0->type != GGML_TYPE_Q4_K) {
return false;
}
if (src1->type != GGML_TYPE_F32 || dst->type != GGML_TYPE_F32) {
return false;
}
const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
if (!GGML_CUDA_CC_IS_NVIDIA(cc) || cc < GGML_CUDA_CC_BLACKWELL) {
return false; // consumer Blackwell (sm_120/121) only
}
if (src0->ne[2] != 1 || src0->ne[3] != 1 ||
src1->ne[2] != 1 || src1->ne[3] != 1 ||
dst->ne[2] != 1 || dst->ne[3] != 1) {
return false;
}
if (!ggml_is_contiguous(src0) || !ggml_is_contiguous(src1) || !ggml_is_contiguous(dst)) {
return false;
}
const int64_t K = src0->ne[0];
const int64_t M = src0->ne[1];
const int64_t N = src1->ne[1];
if (src1->ne[0] != K || dst->ne[0] != M || dst->ne[1] != N) {
return false;
}
if (K % 16 != 0) {
return false;
}
cudaStream_t stream = ctx.stream();
// Block tile config: WM*WN warps compute BM(=WM*FM*16) x BN(=WN*FN*8).
constexpr int WM = 4, WN = 4, FM = 2, FN = 4; // BM=128, BN=128, 16 warps
constexpr int BM = WM*FM*16;
constexpr int BN = WN*FN*8;
const dim3 grid((unsigned)((M + BM - 1) / BM), (unsigned)((N + BN - 1) / BN), 1);
const dim3 block(32, WM*WN, 1);
if (src0->type == GGML_TYPE_Q4_K) {
w4a16_gemm_kernel<true, WM, WN, FM, FN><<<grid, block, 0, stream>>>(
(const char *) src0->data, (const char *) src1->data, (float *) dst->data,
(int) M, (int) N, (int) K, src0->nb[1], src1->nb[1], dst->ne[0]);
} else {
w4a16_gemm_kernel<false, WM, WN, FM, FN><<<grid, block, 0, stream>>>(
(const char *) src0->data, (const char *) src1->data, (float *) dst->data,
(int) M, (int) N, (int) K, src0->nb[1], src1->nb[1], dst->ne[0]);
}
return true;
}

View File

@@ -1,14 +0,0 @@
#pragma once
#include "common.cuh"
// W4A16 Marlin-style BF16 GEMM for NVIDIA Blackwell consumer GPUs (sm_120/121).
// Dense (non-MoE) 4-bit-weight matmul run on BF16 tensor cores, the path that
// reaches the GB10 BF16 ceiling where MMQ (int8, Ampere-tuned) and cuBLAS (sm_80
// fallback) both plateau at ~22% of it. Returns true if it handled the op; false
// to fall back to MMQ. Gated behind GGML_CUDA_W4A16 until correct + faster.
bool ggml_cuda_w4a16_mul_mat(
ggml_backend_cuda_context & ctx,
const ggml_tensor * src0, // 4-bit weights (Q4_0/Q4_K)
const ggml_tensor * src1, // F32 activations
ggml_tensor * dst); // F32 output

View File

@@ -1,129 +0,0 @@
// paged-bench: quantify the multi-tenant wins of paged KV allocation that are
// properties of the host-side block model (vLLM-parity), independent of the
// in-model compute path.
//
// Win 1 (capacity): on-demand block allocation vs contiguous per-seq
// reservation, under a fixed KV block budget.
// Win 3 (prefix sharing): automatic cross-tenant prefix dedup via block
// hashing.
//
// Win 2 (throughput) is intentionally NOT here: it requires the paged read
// path wired into llama-graph.cpp (Gate 0). Measuring it at this layer would
// be dishonest, so it is reported as pending.
#include "paged_kv_manager.h"
#include <cstdio>
#include <vector>
#include <numeric>
using namespace paged;
// A deterministic LCG so sequence lengths vary without Math.random-style nondeterminism.
struct Lcg {
uint64_t s;
explicit Lcg(uint64_t seed) : s(seed) {}
uint32_t next() { s = s * 6364136223846793005ULL + 1442695040888963407ULL; return (uint32_t)(s >> 33); }
int range(int lo, int hi) { return lo + (int)(next() % (uint32_t)(hi - lo + 1)); }
};
static size_t cdiv(size_t a, size_t b) { return (a + b - 1) / b; }
int main() {
const int block_size = 16;
const int n_ctx = 2048; // max context a sequence could use
const int num_blocks = 512; // fixed KV budget: 512 blocks * 16 = 8192 cells
printf("paged-bench (block_size=%d, n_ctx=%d, budget=%d blocks = %d cells)\n\n",
block_size, n_ctx, num_blocks, num_blocks * block_size);
// ---------------------------------------------------------------------
// WIN 1: concurrency capacity. Sequences have realistic, VARYING lengths
// (most short, a few long) - the regime where reserving n_ctx per seq
// wastes the most. Count how many fit under the same block budget.
// ---------------------------------------------------------------------
{
Lcg rng(12345);
const int blocks_per_ctx = (int) cdiv(n_ctx, block_size); // contiguous reserves this per seq
// Contiguous (stream-style) reservation: every seq reserves n_ctx worth.
int contiguous_fit = num_blocks / blocks_per_ctx;
// Paged on-demand: draw real lengths until the pool is exhausted.
PagedKVManager m(num_blocks, block_size, /*enable_caching=*/false);
int paged_fit = 0;
long total_tokens = 0;
for (int seq = 0; ; ++seq) {
// 80% short (8-128 tok), 20% long (up to n_ctx)
int len = (rng.range(0, 99) < 80) ? rng.range(8, 128) : rng.range(128, n_ctx);
if (!m.allocate(seq, (size_t) len)) break;
paged_fit++;
total_tokens += len;
}
printf("WIN 1 concurrency capacity @ %d-block budget\n", num_blocks);
printf(" contiguous (reserve n_ctx/seq): %d sequences\n", contiguous_fit);
printf(" paged (on-demand blocks): %d sequences (avg %ld tok/seq)\n",
paged_fit, paged_fit ? total_tokens / paged_fit : 0);
printf(" --> paged fits %.1fx more concurrent sequences\n\n",
contiguous_fit ? (double) paged_fit / contiguous_fit : 0.0);
}
// ---------------------------------------------------------------------
// WIN 3: cross-tenant prefix sharing. N tenants share a long system
// prompt / RAG context, then diverge. Compare physical blocks consumed
// with prefix caching on vs off.
// ---------------------------------------------------------------------
{
const int n_tenants = 32;
const int shared_len = 1024; // shared system prompt (64 blocks)
const int distinct_len = 64; // per-tenant suffix (4 blocks)
// Shared prefix token ids (identical across tenants -> identical block hashes).
std::vector<int> shared(shared_len);
for (int i = 0; i < shared_len; ++i) shared[i] = 1000 + i;
// --- prefix caching OFF: every tenant pays for the whole prefix ---
long blocks_off = 0;
{
PagedKVManager m(num_blocks * 8, block_size, /*enable_caching=*/false);
for (int t = 0; t < n_tenants; ++t) {
m.allocate(t, (size_t) (shared_len + distinct_len));
blocks_off += m.block_table(t).size();
}
}
// --- prefix caching ON: shared blocks are deduped to one physical copy ---
long blocks_on = 0;
{
PagedKVManager m(num_blocks * 8, block_size, /*enable_caching=*/true);
// tenant 0 fills + caches the shared prefix
auto h = m.compute_block_hashes(shared);
m.allocate(0, (size_t) (shared_len + distinct_len));
m.cache_blocks(0, h, (size_t) shared_len);
long physical = m.block_table(0).size();
// tenants 1..N-1 hit the cached prefix; only their distinct suffix is new
for (int t = 1; t < n_tenants; ++t) {
size_t cached_tokens = m.get_computed_blocks(h); // shared blocks reused
size_t new_tokens = (shared_len - cached_tokens) + distinct_len;
m.allocate(t, (size_t) (shared_len + distinct_len));
// physically new blocks = only what wasn't already resident
physical += (long) cdiv(new_tokens, block_size);
}
blocks_on = physical;
}
printf("WIN 3 cross-tenant prefix sharing (%d tenants, %d-tok shared prefix)\n",
n_tenants, shared_len);
printf(" prefix-cache OFF: %ld physical blocks\n", blocks_off);
printf(" prefix-cache ON: %ld physical blocks\n", blocks_on);
printf(" --> %.1fx less KV memory for the shared workload\n\n",
blocks_on ? (double) blocks_off / blocks_on : 0.0);
}
printf("WIN 2 aggregate throughput under load: PENDING\n");
printf(" Requires the paged gather-read path wired into llama-graph.cpp\n");
printf(" (Gate 0) to measure tok/s vs concurrency. Not measurable at the\n");
printf(" allocation layer; not reported here to avoid overclaiming.\n");
return 0;
}

View File

@@ -1,169 +0,0 @@
// paged-loadgen: a dynamic-load benchmark for paged KV that actually exercises the
// regime where paging wins - variable prompt lengths, variable generation lengths,
// staggered (continuous) arrival, and a shared system prefix. The stock
// examples/paged/paged.cpp adds all requests up front with a fixed n_predict from a
// 20-prompt pool, so it never creates KV-memory pressure or fragmentation and
// therefore never shows a paged advantage (see PAGED_KV_HIGH_CONCURRENCY.md).
//
// Build: drop into PR #22569's examples/paged/ and add to its CMakeLists.txt next to
// llama-paged (it uses the same llama_paged_scheduler_* API). Run on the TARGET GPU
// (e.g. 2xH200) where bandwidth lets decode scale to thousands of sequences and KV
// memory becomes the binding constraint - that is where paged KV pays off and where
// this harness produces a meaningful number. On a low-bandwidth box (GB10) throughput
// plateaus long before memory binds, so the win is not observable there regardless.
//
// Metrics reported:
// - goodput (decode tokens/s aggregate) under the dynamic load
// - peak concurrent in-flight sequences actually sustained
// - paged peak KV bytes used vs the contiguous reservation a unified cache needs
// (n_seq_peak * max_ctx), i.e. the capacity ratio = the headroom paging unlocks
//
// The capacity ratio is the load-bearing number for the buy decision: it is how many
// more concurrent tenants a fixed HBM budget serves with paging than without.
#include "common.h"
#include "llama.h"
#include <cmath>
#include <cstdio>
#include <cstring>
#include <random>
#include <string>
#include <vector>
// ---- workload knobs (env-overridable so the harness is sweepable without rebuilds) ----
static int env_int(const char * k, int dflt) { const char * v = getenv(k); return v ? atoi(v) : dflt; }
struct workload_cfg {
int total_requests = env_int("LG_TOTAL", 2000); // total requests to serve
int target_inflight = env_int("LG_INFLIGHT", 256); // continuous-batching concurrency target
int prefix_tokens = env_int("LG_PREFIX", 512); // shared system-prompt prefix (prefix-cache target)
int suffix_min = env_int("LG_SUFMIN", 16); // per-request unique prompt suffix range
int suffix_max = env_int("LG_SUFMAX", 768);
int gen_short = env_int("LG_GENSHORT", 32); // bimodal generation: most short...
int gen_long = env_int("LG_GENLONG", 1024); // ...some long (the over-reservation driver)
int gen_long_pct = env_int("LG_LONGPCT", 15); // % of requests that are long
int block_size = env_int("LG_BLOCK", 16); // must match -kvbls
unsigned seed = (unsigned) env_int("LG_SEED", 1234);
};
// Per-request plan drawn from the workload distribution.
struct req_plan { int prompt_len; int gen_len; };
int main(int argc, char ** argv) {
common_params params;
params.n_predict = -1; // per-request, controlled by the plan below
if (!common_params_parse(argc, argv, params, LLAMA_EXAMPLE_PAGED)) {
fprintf(stderr, "usage: %s -m <model> -kvp --fit off -ngpub N -ncpub M -ngl 99\n", argv[0]);
return 1;
}
params.kv_paged = true;
common_init_result init = common_init_from_params(params);
llama_model * model = init.model.get();
llama_context * ctx = init.context.get();
if (!model || !ctx) { fprintf(stderr, "load failed\n"); return 1; }
const llama_vocab * vocab = llama_model_get_vocab(model);
workload_cfg cfg;
std::mt19937 rng(cfg.seed);
std::uniform_int_distribution<int> suf(cfg.suffix_min, cfg.suffix_max);
std::uniform_int_distribution<int> pct(1, 100);
// KV bytes/token = 2(K,V) * n_layers * n_head_kv * head_dim * sizeof(f16). Confirmed
// against llama-kv-cache-paged.cpp (block_bytes formula). Used for the capacity ratio.
const int n_layers = llama_model_n_layer(model);
const int n_head_kv = llama_model_n_head_kv(model);
const int head_dim = llama_model_n_embd(model) / llama_model_n_head(model);
const size_t kv_bytes_per_token = (size_t)2 * n_layers * n_head_kv * head_dim * sizeof(uint16_t);
// A long shared system prefix that every request reuses (the prefix-cache target).
std::vector<llama_token> prefix = common_tokenize(ctx, std::string(cfg.prefix_tokens, 'x'), true);
// Pre-draw all request plans so paged peak usage and the contiguous reservation are
// computed from the SAME workload.
std::vector<req_plan> plans(cfg.total_requests);
int max_ctx = 0;
for (auto & p : plans) {
p.prompt_len = cfg.prefix_tokens + suf(rng);
p.gen_len = (pct(rng) <= cfg.gen_long_pct) ? cfg.gen_long : cfg.gen_short;
max_ctx = std::max(max_ctx, p.prompt_len + p.gen_len);
}
llama_paged_scheduler * sched = llama_paged_scheduler_init(ctx);
if (!sched) { fprintf(stderr, "scheduler init failed\n"); return 1; }
// ---- continuous-arrival loop: keep ~target_inflight requests live at all times ----
int next_req = 0, done = 0, inflight = 0, peak_inflight = 0;
long total_decoded = 0;
size_t peak_kv_bytes_paged = 0; // sum over live seqs of ceil(used/block)*block*kv_bytes
size_t live_used_tokens = 0; // running sum of actual KV tokens held by live seqs
auto admit = [&](int rid) {
const req_plan & p = plans[rid];
std::vector<llama_token> toks = prefix; // shared prefix...
std::vector<llama_token> suff = common_tokenize(ctx, std::string(p.prompt_len - cfg.prefix_tokens, 'y'), false);
toks.insert(toks.end(), suff.begin(), suff.end()); // ...+ unique suffix
if (llama_paged_scheduler_add_request(sched, toks.data(), toks.size(), rid)) {
inflight++; peak_inflight = std::max(peak_inflight, inflight);
live_used_tokens += p.prompt_len;
}
};
const int64_t t0 = ggml_time_us();
for (int i = 0; i < cfg.target_inflight && next_req < cfg.total_requests; ++i) admit(next_req++);
llama_batch batch = {};
std::vector<llama_token> sampled; std::vector<int8_t> stop_flags;
while (done < cfg.total_requests) {
if (!llama_paged_scheduler_prepare_batch(sched, &batch)) break;
const llama_paged_batch_info * info = llama_paged_scheduler_get_batch_info(sched);
sampled.assign(info->n_seq, 0); stop_flags.assign(info->n_seq, 0);
// (decode is done inside the scheduler/update path in PR #22569; greedy here)
for (int i = 0; i < info->n_seq; ++i) {
const int rid = info->seq_ids[i];
llama_paged_seq_state st{};
llama_paged_scheduler_get_seq_state(sched, rid, &st);
// greedy argmax from the i-th row of logits
const float * lg = llama_get_logits_ith(ctx, i);
int best = 0; float bv = lg[0];
for (int t = 1; t < llama_vocab_n_tokens(vocab); ++t) if (lg[t] > bv) { bv = lg[t]; best = t; }
sampled[i] = best;
const bool stop = llama_vocab_is_eog(vocab, best) || st.n_decoded + 1 >= plans[rid].gen_len;
stop_flags[i] = stop ? 1 : 0;
if (!stop) { total_decoded++; live_used_tokens++; }
if (stop) {
done++; inflight--;
live_used_tokens -= (plans[rid].prompt_len + st.n_decoded);
if (next_req < cfg.total_requests) admit(next_req++); // continuous arrival
}
}
// paged peak KV: blocks are allocated per live seq = ceil(used/block); approximate
// current paged footprint from live_used_tokens rounded up per the block size.
const size_t paged_now = (size_t)std::ceil((double)live_used_tokens / cfg.block_size)
* cfg.block_size * kv_bytes_per_token;
peak_kv_bytes_paged = std::max(peak_kv_bytes_paged, paged_now);
llama_paged_scheduler_update(sched, &batch, sampled.data(), stop_flags.data());
}
const double secs = (ggml_time_us() - t0) / 1e6;
// Contiguous unified-KV reservation needed to serve the SAME peak concurrency without
// mid-generation eviction: every live slot must be backed for the worst-case context.
const size_t contig_reserve = (size_t)peak_inflight * max_ctx * kv_bytes_per_token;
printf("\n==== paged-loadgen ====\n");
printf("requests served : %d (target inflight %d, peak inflight %d)\n", done, cfg.target_inflight, peak_inflight);
printf("goodput (decode) : %.1f tok/s (%ld tokens / %.2f s)\n", total_decoded / secs, total_decoded, secs);
printf("kv bytes / token : %zu (n_layer=%d n_head_kv=%d head_dim=%d f16)\n", kv_bytes_per_token, n_layers, n_head_kv, head_dim);
printf("paged peak KV : %.2f GiB (allocated on demand)\n", peak_kv_bytes_paged / 1073741824.0);
printf("contiguous reserve : %.2f GiB (peak_inflight * max_ctx %d)\n", contig_reserve / 1073741824.0, max_ctx);
printf("CAPACITY RATIO : %.2fx <- tenants-per-HBM paging unlocks\n",
peak_kv_bytes_paged ? (double)contig_reserve / peak_kv_bytes_paged : 0.0);
printf(" (plus cross-request prefix sharing of the %d-token shared prefix, not counted above)\n", cfg.prefix_tokens);
llama_paged_scheduler_free(sched);
return 0;
}

View File

@@ -1,296 +0,0 @@
#include "paged_kv_manager.h"
#include <cassert>
#include <stdexcept>
namespace paged {
// ---------------------------------------------------------------------------
// FreeBlockQueue (port of kv_cache_utils.py FreeKVCacheBlockQueue)
// ---------------------------------------------------------------------------
FreeBlockQueue::FreeBlockQueue(const std::vector<KVCacheBlock*>& blocks) {
num_free_blocks = blocks.size();
for (size_t i = 0; i < blocks.size(); ++i) {
if (i > 0) blocks[i]->prev_free = blocks[i - 1];
if (i + 1 < blocks.size()) blocks[i]->next_free = blocks[i + 1];
}
if (!blocks.empty()) {
fake_head.next_free = blocks.front();
blocks.front()->prev_free = &fake_head;
fake_tail.prev_free = blocks.back();
blocks.back()->next_free = &fake_tail;
} else {
fake_head.next_free = &fake_tail;
fake_tail.prev_free = &fake_head;
}
}
KVCacheBlock* FreeBlockQueue::popleft() {
KVCacheBlock* first = fake_head.next_free;
if (first == &fake_tail || first == nullptr) {
assert(num_free_blocks == 0);
throw std::runtime_error("No free blocks available");
}
fake_head.next_free = first->next_free;
first->next_free->prev_free = &fake_head;
first->prev_free = first->next_free = nullptr;
num_free_blocks--;
return first;
}
std::vector<KVCacheBlock*> FreeBlockQueue::popleft_n(size_t n) {
std::vector<KVCacheBlock*> ret;
if (n == 0) return ret;
assert(num_free_blocks >= n);
num_free_blocks -= n;
KVCacheBlock* curr = fake_head.next_free;
ret.reserve(n);
for (size_t i = 0; i < n; ++i) {
assert(curr != nullptr);
ret.push_back(curr);
KVCacheBlock* last = curr;
curr = curr->next_free;
last->prev_free = last->next_free = nullptr;
}
if (curr != nullptr) {
fake_head.next_free = curr;
curr->prev_free = &fake_head;
}
return ret;
}
void FreeBlockQueue::remove(KVCacheBlock* block) {
if (!block->prev_free || !block->next_free)
throw std::runtime_error("remove() called on an invalid block");
block->prev_free->next_free = block->next_free;
block->next_free->prev_free = block->prev_free;
block->prev_free = block->next_free = nullptr;
num_free_blocks--;
}
void FreeBlockQueue::append(KVCacheBlock* block) {
KVCacheBlock* last = fake_tail.prev_free;
last->next_free = block;
block->prev_free = last;
block->next_free = &fake_tail;
fake_tail.prev_free = block;
num_free_blocks++;
}
void FreeBlockQueue::append_n(const std::vector<KVCacheBlock*>& blocks) {
if (blocks.empty()) return;
KVCacheBlock* last = fake_tail.prev_free;
for (KVCacheBlock* b : blocks) {
b->prev_free = last;
last->next_free = b;
last = b;
}
last->next_free = &fake_tail;
fake_tail.prev_free = last;
num_free_blocks += blocks.size();
}
void FreeBlockQueue::prepend_n(const std::vector<KVCacheBlock*>& blocks) {
if (blocks.empty()) return;
KVCacheBlock* first = fake_head.next_free;
KVCacheBlock* prev = &fake_head;
for (KVCacheBlock* b : blocks) {
b->prev_free = prev;
prev->next_free = b;
prev = b;
}
prev->next_free = first;
first->prev_free = prev;
num_free_blocks += blocks.size();
}
std::vector<KVCacheBlock*> FreeBlockQueue::get_all_free_blocks() const {
std::vector<KVCacheBlock*> ret;
const KVCacheBlock* curr = fake_head.next_free;
while (curr && curr->next_free != nullptr) {
ret.push_back(const_cast<KVCacheBlock*>(curr));
curr = curr->next_free;
}
return ret;
}
// ---------------------------------------------------------------------------
// BlockPool (port of block_pool.py)
// ---------------------------------------------------------------------------
static std::vector<KVCacheBlock*> make_ptrs(std::vector<KVCacheBlock>& v) {
std::vector<KVCacheBlock*> p;
p.reserve(v.size());
for (auto& b : v) p.push_back(&b);
return p;
}
static std::vector<KVCacheBlock> make_block_vec(int32_t num_blocks) {
std::vector<KVCacheBlock> v;
v.reserve(num_blocks);
for (int32_t i = 0; i < num_blocks; ++i) v.emplace_back(i);
return v;
}
BlockPool::BlockPool(int32_t num_blocks, bool enable_caching)
: enable_caching_(enable_caching),
blocks_(make_block_vec(num_blocks)),
ptrs_(make_ptrs(blocks_)),
free_queue_(ptrs_) {
// vLLM reserves block_id 0 as the null block (never cached).
null_block = free_queue_.popleft();
null_block->is_null = true;
}
bool BlockPool::maybe_evict_cached_block(KVCacheBlock* block) {
if (!block->has_hash) return false;
auto it = cached_block_hash_to_block_.find(block->block_hash);
if (it == cached_block_hash_to_block_.end() || it->second != block) return false;
cached_block_hash_to_block_.erase(it);
block->reset_hash();
return true;
}
std::vector<KVCacheBlock*> BlockPool::get_new_blocks(size_t n) {
if (n > get_num_free_blocks())
throw std::runtime_error("Cannot get free blocks from pool");
auto ret = free_queue_.popleft_n(n);
for (KVCacheBlock* b : ret) {
if (enable_caching_) maybe_evict_cached_block(b);
assert(b->ref_cnt == 0);
b->ref_cnt += 1;
}
return ret;
}
KVCacheBlock* BlockPool::get_cached_block(uint64_t block_hash) {
auto it = cached_block_hash_to_block_.find(block_hash);
return it == cached_block_hash_to_block_.end() ? nullptr : it->second;
}
void BlockPool::touch(const std::vector<KVCacheBlock*>& blocks) {
for (KVCacheBlock* b : blocks) {
// ref_cnt==0 means the block is a free-list eviction candidate; pull it out.
if (b->ref_cnt == 0 && !b->is_null) free_queue_.remove(b);
b->ref_cnt += 1;
}
}
void BlockPool::free_blocks(const std::vector<KVCacheBlock*>& ordered_blocks) {
std::vector<KVCacheBlock*> without_hash, with_hash;
for (KVCacheBlock* b : ordered_blocks) {
if (b->is_null) continue;
b->ref_cnt -= 1;
if (b->ref_cnt == 0) (b->has_hash ? with_hash : without_hash).push_back(b);
}
free_queue_.prepend_n(without_hash); // un-hashed: evicted first (front)
free_queue_.append_n(with_hash); // hashed: kept warm (tail)
}
void BlockPool::cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
size_t num_cached_blocks, size_t num_full_blocks,
const std::vector<uint64_t>& block_hashes) {
for (size_t i = num_cached_blocks; i < num_full_blocks; ++i) {
KVCacheBlock* blk = req_blocks[i];
if (blk->has_hash) continue;
blk->has_hash = true;
blk->block_hash = block_hashes[i];
cached_block_hash_to_block_[blk->block_hash] = blk;
}
}
// ---------------------------------------------------------------------------
// PagedKVManager (port of SingleTypeKVCacheManager / FullAttentionManager)
// ---------------------------------------------------------------------------
static inline size_t cdiv(size_t a, size_t b) { return (a + b - 1) / b; }
PagedKVManager::PagedKVManager(int32_t num_blocks, int block_size, bool enable_caching)
: block_size_(block_size), pool_(num_blocks, enable_caching) {}
bool PagedKVManager::allocate(int seq_id, size_t total_tokens) {
auto& req = req_to_blocks_[seq_id];
size_t need = cdiv(total_tokens, block_size_);
if (need <= req.size()) return true;
size_t add = need - req.size();
if (add > pool_.get_num_free_blocks()) return false; // OOM
auto nb = pool_.get_new_blocks(add);
req.insert(req.end(), nb.begin(), nb.end());
return true;
}
std::vector<int32_t> PagedKVManager::block_table(int seq_id) const {
std::vector<int32_t> bt;
auto it = req_to_blocks_.find(seq_id);
if (it == req_to_blocks_.end()) return bt;
bt.reserve(it->second.size());
for (KVCacheBlock* b : it->second) bt.push_back(b->block_id);
return bt;
}
int64_t PagedKVManager::slot(int seq_id, int pos) const {
const auto& req = req_to_blocks_.at(seq_id);
int32_t phys = req[pos / block_size_]->block_id;
return (int64_t)phys * block_size_ + (pos % block_size_);
}
std::vector<int64_t> PagedKVManager::slot_mapping(int seq_id, const std::vector<int>& positions) const {
std::vector<int64_t> sm;
sm.reserve(positions.size());
for (int p : positions) sm.push_back(slot(seq_id, p));
return sm;
}
void PagedKVManager::free(int seq_id) {
auto it = req_to_blocks_.find(seq_id);
if (it == req_to_blocks_.end()) return;
// Free in reverse so the tail of the block chain is evicted first (vLLM order).
std::vector<KVCacheBlock*> ordered(it->second.rbegin(), it->second.rend());
pool_.free_blocks(ordered);
req_to_blocks_.erase(it);
}
// FNV-1a chained block hash. Deterministic and prefix-sensitive; folds the parent
// hash into the seed so each block hash transitively encodes its whole prefix
// (behavioral parity with vLLM hash_block_tokens chaining; vLLM uses sha256 bytes).
uint64_t PagedKVManager::hash_block(uint64_t parent_hash, const std::vector<int>& token_ids) {
uint64_t h = 1469598103934665603ull ^ parent_hash;
for (int t : token_ids) {
h ^= (uint64_t)(uint32_t)t;
h *= 1099511628211ull;
}
if (h == 0) h = 0x9e3779b97f4a7c15ull; // never 0 (0 reads as "no hash")
return h;
}
std::vector<uint64_t> PagedKVManager::compute_block_hashes(const std::vector<int>& token_ids) const {
std::vector<uint64_t> hashes;
uint64_t parent = 0; // NONE_HASH analogue
size_t n_full = token_ids.size() / block_size_;
for (size_t i = 0; i < n_full; ++i) {
std::vector<int> blk(token_ids.begin() + i * block_size_,
token_ids.begin() + (i + 1) * block_size_);
parent = hash_block(parent, blk);
hashes.push_back(parent);
}
return hashes;
}
size_t PagedKVManager::get_computed_blocks(const std::vector<uint64_t>& block_hashes) {
std::vector<KVCacheBlock*> hits;
for (uint64_t bh : block_hashes) { // stop at first miss (prefix property)
KVCacheBlock* cb = pool_.get_cached_block(bh);
if (!cb) break;
hits.push_back(cb);
}
pool_.touch(hits); // ++ref_cnt, pull from free list
return hits.size() * (size_t)block_size_;
}
void PagedKVManager::cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens) {
auto& req = req_to_blocks_[seq_id];
size_t n_full = num_tokens / block_size_;
pool_.cache_full_blocks(req, /*num_cached=*/0, n_full, block_hashes);
}
} // namespace paged

View File

@@ -1,108 +0,0 @@
#pragma once
// Paged KV cache block manager for llama.cpp (CPU-first prototype).
//
// Host-side block management is a faithful port of vLLM V1:
// vllm/v1/core/kv_cache_utils.py (KVCacheBlock, FreeKVCacheBlockQueue, hash_block_tokens)
// vllm/v1/core/block_pool.py (BlockPool: get_new_blocks/touch/free/evict/cache_full_blocks)
// vllm/v1/core/single_type_kv_cache_manager.py (allocate_new_blocks, find_longest_cache_hit)
//
// Parity is on behavior/algorithm (block chaining, first-miss stop, ref-counting,
// LRU eviction order), not on exact hash bytes. This unit has zero ggml/llama.cpp
// dependency so it can be unit-tested in isolation.
#include <cstdint>
#include <vector>
#include <unordered_map>
#include <map>
namespace paged {
// vLLM KVCacheBlock (kv_cache_utils.py).
struct KVCacheBlock {
int32_t block_id = 0;
int ref_cnt = 0;
bool has_hash = false; // vLLM: _block_hash is set only when full+cached
uint64_t block_hash = 0;
bool is_null = false;
KVCacheBlock* prev_free = nullptr;
KVCacheBlock* next_free = nullptr;
explicit KVCacheBlock(int32_t id = 0) : block_id(id) {}
void reset_hash() { has_hash = false; block_hash = 0; }
};
// Intrusive doubly-linked free list with fake head/tail (vLLM FreeKVCacheBlockQueue).
// O(1) middle removal is required so touch() can pull a warm cached block out of the
// free list when a later request hits its prefix.
class FreeBlockQueue {
public:
size_t num_free_blocks = 0;
explicit FreeBlockQueue(const std::vector<KVCacheBlock*>& blocks);
KVCacheBlock* popleft();
std::vector<KVCacheBlock*> popleft_n(size_t n);
void remove(KVCacheBlock* block);
void append(KVCacheBlock* block);
void append_n(const std::vector<KVCacheBlock*>& blocks);
void prepend_n(const std::vector<KVCacheBlock*>& blocks);
std::vector<KVCacheBlock*> get_all_free_blocks() const;
private:
KVCacheBlock fake_head{-1};
KVCacheBlock fake_tail{-1};
};
// vLLM BlockPool (block_pool.py).
class BlockPool {
public:
KVCacheBlock* null_block = nullptr;
BlockPool(int32_t num_blocks, bool enable_caching);
std::vector<KVCacheBlock*> get_new_blocks(size_t n);
KVCacheBlock* get_cached_block(uint64_t block_hash);
void touch(const std::vector<KVCacheBlock*>& blocks);
void free_blocks(const std::vector<KVCacheBlock*>& ordered_blocks);
void cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
size_t num_cached_blocks, size_t num_full_blocks,
const std::vector<uint64_t>& block_hashes);
size_t get_num_free_blocks() const { return free_queue_.num_free_blocks; }
private:
bool maybe_evict_cached_block(KVCacheBlock* block);
bool enable_caching_;
std::vector<KVCacheBlock> blocks_; // owns all block descriptors
std::vector<KVCacheBlock*> ptrs_;
FreeBlockQueue free_queue_;
// vLLM stores hash -> {block_id: block} to allow duplicate-content blocks; the
// prototype keeps the last writer (single KV-cache group is sufficient for the wins).
std::unordered_map<uint64_t, KVCacheBlock*> cached_block_hash_to_block_;
};
// Allocation + prefix-caching surface, ported from SingleTypeKVCacheManager /
// FullAttentionManager. Single KV-cache group; no extra_keys / eagle / spec-decode.
class PagedKVManager {
public:
PagedKVManager(int32_t num_blocks, int block_size, bool enable_caching);
// Grow seq_id to cover total_tokens slots. Returns false on OOM (free queue empty).
bool allocate(int seq_id, size_t total_tokens);
std::vector<int32_t> block_table(int seq_id) const;
int64_t slot(int seq_id, int pos) const;
std::vector<int64_t> slot_mapping(int seq_id, const std::vector<int>& positions) const;
void free(int seq_id);
int block_size() const { return block_size_; }
// Prefix caching (win 3).
static uint64_t hash_block(uint64_t parent_hash, const std::vector<int>& token_ids);
std::vector<uint64_t> compute_block_hashes(const std::vector<int>& token_ids) const;
size_t get_computed_blocks(const std::vector<uint64_t>& block_hashes); // returns num cached tokens
void cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens);
protected:
int block_size_;
BlockPool pool_;
std::map<int, std::vector<KVCacheBlock*>> req_to_blocks_;
};
} // namespace paged

View File

@@ -1,59 +0,0 @@
diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
index a49a055a6..d95102bbd 100644
--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
@@ -11,6 +11,8 @@
#include <cstring>
#include <limits>
#include <map>
+#include <numeric>
+#include <cstdlib>
#include <stdexcept>
static bool ggml_is_power_of_2(int n) {
@@ -931,6 +933,45 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
return { };
}
+ // [paged, experimental] Place this sequence's tokens at permuted,
+ // non-contiguous fixed-size BLOCK positions instead of a contiguous run.
+ // This validates that attention is invariant to physical KV placement -
+ // the correctness premise of paged attention. Enabled via LLAMA_KV_PAGED.
+ // Single-sequence scope (uses get_used() as the logical base); falls back
+ // to the normal allocator if the permuted cells aren't available.
+ static const bool paged_mode = (std::getenv("LLAMA_KV_PAGED") != nullptr);
+ if (paged_mode) {
+ const uint32_t bs = 16; // block size (tokens/block)
+ const uint32_t nblk = cells.size() / bs; // blocks in this stream's pool
+ if (nblk >= 2) {
+ // stride coprime to nblk => block-index permutation is a bijection
+ uint32_t k = 1;
+ for (uint32_t cand = (nblk / 2) | 1u; cand < nblk; cand += 2) {
+ if (std::gcd(cand, nblk) == 1u) { k = cand; break; }
+ }
+ const uint32_t base = cells.get_used();
+ bool ok = true;
+ for (uint32_t i = 0; i < n_tokens; ++i) {
+ const uint32_t L = base + i;
+ const uint32_t b = L / bs;
+ const uint32_t off = L % bs;
+ if (b >= nblk) { ok = false; break; }
+ const uint32_t phys = ((b * k) % nblk) * bs + off; // permuted block
+ if (phys >= cells.size() || !cells.is_empty(phys)) { ok = false; break; }
+ res.idxs[s].push_back(phys);
+ }
+ if (ok && res.idxs[s].size() == n_tokens) {
+ if (std::getenv("LLAMA_KV_PAGED_DEBUG")) {
+ fprintf(stderr, "[paged] seq placed %u tok at cells:", n_tokens);
+ for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]);
+ fprintf(stderr, " (k=%u nblk=%u base=%u)\n", k, nblk, base);
+ }
+ continue; // paged placement succeeded for this sequence
+ }
+ res.idxs[s].clear(); // fall back to the normal allocator
+ }
+ }
+
uint32_t n_tested = 0;
// for continuous slots, we test that all tokens in the ubatch fit, starting from the current head

View File

@@ -1,12 +0,0 @@
diff --git a/tests/test-paged-kv-e2e.cpp b/tests/test-paged-kv-e2e.cpp
index 5a352e3..06ead50 100644
--- a/tests/test-paged-kv-e2e.cpp
+++ b/tests/test-paged-kv-e2e.cpp
@@ -115,6 +115,7 @@ static path_result run_paged(const std::string & model_path) {
params.sampling.temp = 0.0f; // greedy
params.warmup = false;
params.kv_paged = true;
+ params.fit_params = false; // honor explicit n_gpu_blocks; GB10 dev_memory over-reports free VRAM
params.n_gpu_blocks = 64;
params.n_cpu_blocks = 16;
params.n_sequences = 1;

View File

@@ -1,42 +0,0 @@
#include "../paged_kv_manager.h"
#include <cassert>
#include <cstdio>
using namespace paged;
int main() {
BlockPool pool(/*num_blocks=*/8, /*enable_caching=*/true);
// block 0 is reserved as null_block (vLLM pops one at init)
assert(pool.null_block != nullptr && pool.null_block->block_id == 0);
assert(pool.get_num_free_blocks() == 7);
// get_new_blocks sets ref_cnt=1 and removes from free list
auto b = pool.get_new_blocks(2);
assert(b.size() == 2 && b[0]->ref_cnt == 1 && b[1]->ref_cnt == 1);
assert(pool.get_num_free_blocks() == 5);
// cache two full blocks with chained hashes, then look them up
std::vector<uint64_t> hashes = {1111, 2222};
pool.cache_full_blocks(b, /*num_cached=*/0, /*num_full=*/2, hashes);
assert(b[0]->has_hash && b[0]->block_hash == 1111);
assert(pool.get_cached_block(1111) == b[0]);
assert(pool.get_cached_block(2222) == b[1]);
assert(pool.get_cached_block(9999) == nullptr);
// free: hashed blocks go to tail (kept warm), so they remain queryable.
pool.free_blocks(b);
assert(b[0]->ref_cnt == 0);
assert(pool.get_num_free_blocks() == 7);
assert(pool.get_cached_block(1111) == b[0]); // still cached/warm
// touch a warm cached block: pulls it out of free list, ++ref_cnt
pool.touch({b[0]});
assert(b[0]->ref_cnt == 1);
assert(pool.get_num_free_blocks() == 6);
// exhausting the pool then allocating evicts a warm cached hash
auto rest = pool.get_new_blocks(pool.get_num_free_blocks());
(void) rest;
assert(pool.get_cached_block(2222) == nullptr); // evicted on reuse
printf("test_block_pool: OK\n");
return 0;
}

View File

@@ -1,44 +0,0 @@
#include "../paged_kv_manager.h"
#include <cassert>
#include <cstdio>
#include <vector>
using namespace paged;
static std::vector<KVCacheBlock> make_blocks(int n) {
std::vector<KVCacheBlock> v;
v.reserve(n);
for (int i = 0; i < n; ++i) v.push_back(KVCacheBlock{i});
return v;
}
int main() {
// ordered 0..9 at init; popleft yields ascending block_ids
auto blocks = make_blocks(10);
std::vector<KVCacheBlock*> ptrs;
for (auto& b : blocks) ptrs.push_back(&b);
FreeBlockQueue q(ptrs);
assert(q.num_free_blocks == 10);
KVCacheBlock* b0 = q.popleft();
assert(b0->block_id == 0);
assert(q.num_free_blocks == 9);
auto two = q.popleft_n(2); // {1,2}
assert(two.size() == 2 && two[0]->block_id == 1 && two[1]->block_id == 2);
assert(q.num_free_blocks == 7);
// O(1) middle removal: remove block 5 (currently free), count drops
q.remove(ptrs[5]);
assert(q.num_free_blocks == 6); // free: 3,4,6,7,8,9
// append puts a block at the tail; it comes back out only after the rest
q.append(b0); // free order now: 3,4,6,7,8,9,0
assert(q.num_free_blocks == 7);
auto all = q.get_all_free_blocks();
assert(all.front()->block_id == 3);
assert(all.back()->block_id == 0);
printf("test_free_block_queue: OK\n");
return 0;
}

View File

@@ -1,133 +0,0 @@
// Phase 2 (core numeric de-risk): attention over GATHERED paged KV must equal
// an independent host-computed reference.
//
// This answers the central risk in the design: feeding gather-to-scratch KV
// (a sequence whose blocks are non-contiguous in the shared pool) into ggml's
// standard attention ops (mul_mat -> soft_max_ext -> mul_mat) produces correct
// attention. If this holds, the paged read path is numerically sound; the
// remaining work is wiring it into llama-graph.cpp (Gate 0 in a real model).
#include "../paged_kv_manager.h"
#include "ggml.h"
#include "ggml-cpu.h"
#include "ggml-alloc.h"
#include "ggml-backend.h"
#include <cassert>
#include <cstdio>
#include <cmath>
#include <vector>
using namespace paged;
int main() {
const int d = 8; // head dim
const int n_kv = 48; // 3 blocks worth of KV tokens
const int n_q = 4; // query tokens
const int block_size = 16;
const int num_blocks = 8;
const int total_slots = block_size * num_blocks;
const float scale = 1.0f / std::sqrt((float) d);
// Non-contiguous physical layout for the KV sequence (blocks [2,1,5]).
PagedKVManager m(num_blocks, block_size, /*enable_caching=*/false);
assert(m.allocate(0, 2 * block_size));
assert(m.allocate(1, 2 * block_size));
m.free(0);
assert(m.allocate(2, n_kv));
std::vector<int> positions(n_kv);
for (int i = 0; i < n_kv; ++i) positions[i] = i;
auto slots64 = m.slot_mapping(2, positions);
std::vector<int32_t> slots32(slots64.begin(), slots64.end());
// Deterministic K, V, Q in logical [d, n] layout (column-major: col = token).
std::vector<float> K(d * n_kv), V(d * n_kv), Q(d * n_q);
for (int t = 0; t < n_kv; ++t)
for (int e = 0; e < d; ++e) {
K[t * d + e] = std::sin(0.1f * t + 0.3f * e);
V[t * d + e] = std::cos(0.2f * t - 0.1f * e);
}
for (int q = 0; q < n_q; ++q)
for (int e = 0; e < d; ++e) Q[q * d + e] = std::sin(0.05f * q + 0.7f * e);
// ---- Independent host reference attention -------------------------------
std::vector<float> ref(d * n_q, 0.0f);
for (int q = 0; q < n_q; ++q) {
std::vector<float> score(n_kv);
float mx = -1e30f;
for (int t = 0; t < n_kv; ++t) {
float dot = 0.0f;
for (int e = 0; e < d; ++e) dot += K[t * d + e] * Q[q * d + e];
score[t] = dot * scale;
mx = std::fmax(mx, score[t]);
}
float sum = 0.0f;
for (int t = 0; t < n_kv; ++t) { score[t] = std::exp(score[t] - mx); sum += score[t]; }
for (int t = 0; t < n_kv; ++t) {
float p = score[t] / sum;
for (int e = 0; e < d; ++e) ref[q * d + e] += p * V[t * d + e];
}
}
// ---- ggml paged path ----------------------------------------------------
ggml_backend_t backend = ggml_backend_cpu_init();
struct ggml_init_params dp = { ggml_tensor_overhead() * 16, NULL, true };
struct ggml_context * ctx_data = ggml_init(dp);
struct ggml_tensor * poolK = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, total_slots);
struct ggml_tensor * poolV = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, total_slots);
struct ggml_tensor * kSrc = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, n_kv);
struct ggml_tensor * vSrc = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, n_kv);
struct ggml_tensor * qT = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, n_q);
struct ggml_tensor * wIdx = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I64, n_kv);
struct ggml_tensor * gIdx = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I32, n_kv);
ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors(ctx_data, backend);
std::vector<float> zeros(d * total_slots, 0.0f);
ggml_backend_tensor_set(poolK, zeros.data(), 0, ggml_nbytes(poolK));
ggml_backend_tensor_set(poolV, zeros.data(), 0, ggml_nbytes(poolV));
ggml_backend_tensor_set(kSrc, K.data(), 0, ggml_nbytes(kSrc));
ggml_backend_tensor_set(vSrc, V.data(), 0, ggml_nbytes(vSrc));
ggml_backend_tensor_set(qT, Q.data(), 0, ggml_nbytes(qT));
ggml_backend_tensor_set(wIdx, slots64.data(), 0, ggml_nbytes(wIdx));
ggml_backend_tensor_set(gIdx, slots32.data(), 0, ggml_nbytes(gIdx));
struct ggml_init_params cp = { ggml_tensor_overhead() * 64 + ggml_graph_overhead(), NULL, true };
struct ggml_context * ctx = ggml_init(cp);
struct ggml_tensor * wroteK = ggml_set_rows(ctx, poolK, kSrc, wIdx);
struct ggml_tensor * wroteV = ggml_set_rows(ctx, poolV, vSrc, wIdx);
struct ggml_tensor * gK = ggml_get_rows(ctx, wroteK, gIdx); // [d, n_kv]
struct ggml_tensor * gV = ggml_get_rows(ctx, wroteV, gIdx); // [d, n_kv]
struct ggml_tensor * kq = ggml_mul_mat(ctx, gK, qT); // [n_kv, n_q]
struct ggml_tensor * probs = ggml_soft_max_ext(ctx, kq, NULL, scale, 0.0f);
struct ggml_tensor * vT = ggml_cont(ctx, ggml_transpose(ctx, gV)); // [n_kv, d]
struct ggml_tensor * out = ggml_mul_mat(ctx, vT, probs); // [d, n_q]
ggml_set_output(out);
struct ggml_cgraph * gf = ggml_new_graph(ctx);
ggml_build_forward_expand(gf, out);
ggml_gallocr_t galloc = ggml_gallocr_new(ggml_backend_cpu_buffer_type());
assert(ggml_gallocr_alloc_graph(galloc, gf));
assert(ggml_backend_graph_compute(backend, gf) == GGML_STATUS_SUCCESS);
std::vector<float> got(d * n_q);
ggml_backend_tensor_get(out, got.data(), 0, ggml_nbytes(out));
// ---- compare ------------------------------------------------------------
double max_err = 0.0;
for (int i = 0; i < d * n_q; ++i) max_err = std::fmax(max_err, std::fabs(got[i] - ref[i]));
printf("paged attention max abs err vs host reference: %.3e\n", max_err);
assert(max_err < 1e-4 && "paged-gathered attention must match host reference");
ggml_gallocr_free(galloc);
ggml_free(ctx);
ggml_free(ctx_data);
ggml_backend_buffer_free(buf);
ggml_backend_free(backend);
printf("test_ggml_paged_attn: OK (attention over non-contiguous paged KV matches reference)\n");
return 0;
}

View File

@@ -1,142 +0,0 @@
// Phase 1 integration test: prove the paged KV write+read MECHANISM at the
// ggml-op level, driven by PagedKVManager.
//
// write: ggml_set_rows(pool, k_src, slot_mapping) // scatter by slot
// read: ggml_get_rows(pool, gather_idx) // gather seq's slots
//
// The decisive property: a sequence's physical blocks are NON-CONTIGUOUS and
// OUT-OF-ORDER (forced via allocate/free/reallocate), yet gather(write(x)) == x,
// and a second sequence written into disjoint blocks does not contaminate it.
// This is exactly how a paged read path feeds contiguous scratch to attention.
#include "../paged_kv_manager.h"
#include "ggml.h"
#include "ggml-cpu.h"
#include "ggml-alloc.h"
#include "ggml-backend.h"
#include <cassert>
#include <cstdio>
#include <cmath>
#include <vector>
using namespace paged;
int main() {
const int n_embd = 8;
const int block_size = 16;
const int num_blocks = 8; // block 0 reserved as null
const int total_slots = block_size * num_blocks; // 128
// --- Force a non-contiguous, out-of-order block layout for seqC ----------
PagedKVManager m(num_blocks, block_size, /*enable_caching=*/false);
assert(m.allocate(/*seqA=*/0, 2 * block_size)); // blocks {1,2}
assert(m.allocate(/*seqB=*/1, 2 * block_size)); // blocks {3,4}
m.free(0); // returns {1,2} to free list
assert(m.allocate(/*seqC=*/2, 3 * block_size)); // reuses freed blocks, reordered
auto btC = m.block_table(2);
auto btB = m.block_table(1);
printf("seqC block_table = [");
for (size_t i = 0; i < btC.size(); ++i) printf("%s%d", i ? "," : "", btC[i]);
printf("]\n");
assert(btC.size() == 3);
// sanity: seqC and seqB occupy disjoint physical blocks
for (int cb : btC) for (int bb : btB) assert(cb != bb);
const int n_tokens = 3 * block_size; // 48 tokens for seqC
// slot_mapping for seqC positions 0..n_tokens-1
std::vector<int> positions(n_tokens);
for (int i = 0; i < n_tokens; ++i) positions[i] = i;
std::vector<int64_t> slots64 = m.slot_mapping(2, positions); // I64 for set_rows
std::vector<int32_t> slots32(slots64.begin(), slots64.end()); // I32 for get_rows
// seqB occupies different blocks; write a sentinel there to prove isolation.
std::vector<int> posB(2 * block_size);
for (size_t i = 0; i < posB.size(); ++i) posB[i] = (int) i;
std::vector<int64_t> slotsB64 = m.slot_mapping(1, posB);
// --- ggml backend + persistent (statically allocated) tensors ------------
ggml_backend_t backend = ggml_backend_cpu_init();
assert(backend);
struct ggml_init_params dp = { /*mem_size=*/ ggml_tensor_overhead() * 16,
/*mem_buffer=*/ NULL, /*no_alloc=*/ true };
struct ggml_context * ctx_data = ggml_init(dp);
// The shared paged KV pool: one flat block pool, exactly like a paged layer.
struct ggml_tensor * pool = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, n_embd, total_slots);
struct ggml_tensor * k_src = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, n_embd, n_tokens);
struct ggml_tensor * w_idx = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I64, n_tokens);
struct ggml_tensor * g_idx = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I32, n_tokens);
struct ggml_tensor * kB_src = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, n_embd, (int) posB.size());
struct ggml_tensor * wB_idx = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I64, (int) posB.size());
ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors(ctx_data, backend);
assert(buf);
// pool starts zeroed
std::vector<float> zeros(n_embd * total_slots, 0.0f);
ggml_backend_tensor_set(pool, zeros.data(), 0, ggml_nbytes(pool));
// token t carries the value (float) t in every embedding lane -> easy to verify
std::vector<float> ksrc(n_embd * n_tokens);
for (int t = 0; t < n_tokens; ++t)
for (int e = 0; e < n_embd; ++e) ksrc[t * n_embd + e] = (float) t;
ggml_backend_tensor_set(k_src, ksrc.data(), 0, ggml_nbytes(k_src));
ggml_backend_tensor_set(w_idx, slots64.data(), 0, ggml_nbytes(w_idx));
ggml_backend_tensor_set(g_idx, slots32.data(), 0, ggml_nbytes(g_idx));
// seqB sentinel = 999 everywhere
std::vector<float> kBsrc(n_embd * posB.size(), 999.0f);
ggml_backend_tensor_set(kB_src, kBsrc.data(), 0, ggml_nbytes(kB_src));
ggml_backend_tensor_set(wB_idx, slotsB64.data(), 0, ggml_nbytes(wB_idx));
// --- compute graph: write seqB, write seqC, then gather seqC -------------
struct ggml_init_params cp = { /*mem_size=*/ ggml_tensor_overhead() * 32 + ggml_graph_overhead(),
/*mem_buffer=*/ NULL, /*no_alloc=*/ true };
struct ggml_context * ctx = ggml_init(cp);
struct ggml_tensor * wroteB = ggml_set_rows(ctx, pool, kB_src, wB_idx); // view(pool)
struct ggml_tensor * wroteC = ggml_set_rows(ctx, wroteB, k_src, w_idx); // chain so order is fixed
struct ggml_tensor * gathered = ggml_get_rows(ctx, wroteC, g_idx);
ggml_set_output(gathered);
struct ggml_cgraph * gf = ggml_new_graph(ctx);
ggml_build_forward_expand(gf, gathered);
ggml_gallocr_t galloc = ggml_gallocr_new(ggml_backend_cpu_buffer_type());
assert(ggml_gallocr_alloc_graph(galloc, gf));
assert(ggml_backend_graph_compute(backend, gf) == GGML_STATUS_SUCCESS);
// --- verify gather(write(x)) == x for the non-contiguous sequence --------
std::vector<float> out(n_embd * n_tokens);
ggml_backend_tensor_get(gathered, out.data(), 0, ggml_nbytes(gathered));
int mism = 0;
for (int t = 0; t < n_tokens; ++t)
for (int e = 0; e < n_embd; ++e)
if (std::fabs(out[t * n_embd + e] - (float) t) > 1e-6f) mism++;
assert(mism == 0 && "gathered paged KV must equal source (round-trip)");
// --- verify isolation: read seqC slots directly from pool, unaffected by seqB
std::vector<float> pool_host(n_embd * total_slots);
ggml_backend_tensor_get(pool, pool_host.data(), 0, ggml_nbytes(pool));
for (int t = 0; t < n_tokens; ++t) {
int slot = (int) slots64[t];
for (int e = 0; e < n_embd; ++e)
assert(std::fabs(pool_host[slot * n_embd + e] - (float) t) < 1e-6f);
}
ggml_gallocr_free(galloc);
ggml_free(ctx);
ggml_free(ctx_data);
ggml_backend_buffer_free(buf);
ggml_backend_free(backend);
printf("test_ggml_paged_rw: OK (non-contiguous paged write/gather round-trip)\n");
return 0;
}

View File

@@ -1,32 +0,0 @@
#include "../paged_kv_manager.h"
#include <cassert>
#include <cstdio>
using namespace paged;
int main() {
PagedKVManager m(/*num_blocks=*/8, /*block_size=*/16, /*enable_caching=*/false);
// 20 tokens -> ceil(20/16)=2 blocks
assert(m.allocate(/*seq=*/0, 20));
auto bt = m.block_table(0);
assert(bt.size() == 2);
// slot arithmetic: pos 0 -> block bt[0]*16 + 0 ; pos 17 -> bt[1]*16 + 1
assert(m.slot(0, 0) == (int64_t)bt[0] * 16 + 0);
assert(m.slot(0, 17) == (int64_t)bt[1] * 16 + 1);
auto sm = m.slot_mapping(0, {0, 16, 17});
assert(sm.size() == 3 && sm[1] == (int64_t)bt[1] * 16 + 0);
// growing the same seq reuses existing blocks, adds only new ones
assert(m.allocate(0, 40)); // ceil(40/16)=3 -> +1 block
assert(m.block_table(0).size() == 3);
// OOM: blocks left = 8 - 1(null) - 3 = 4 blocks; ask for 5 blocks
assert(m.allocate(1, 5 * 16) == false);
// free returns blocks to the pool for reuse
m.free(0);
assert(m.allocate(1, 5 * 16)); // now fits
printf("test_paged_kv_manager: OK\n");
return 0;
}

View File

@@ -1,35 +0,0 @@
#include "../paged_kv_manager.h"
#include <cassert>
#include <cstdio>
#include <vector>
using namespace paged;
int main() {
PagedKVManager m(/*num_blocks=*/64, /*block_size=*/16, /*enable_caching=*/true);
// shared prefix of 32 tokens (2 full blocks) + distinct suffix
std::vector<int> shared(32);
for (int i = 0; i < 32; ++i) shared[i] = 100 + i;
// chained hashing is deterministic and prefix-sensitive
auto h = m.compute_block_hashes(shared);
assert(h.size() == 2);
auto h2 = m.compute_block_hashes(shared);
assert(h == h2); // deterministic
std::vector<int> other = shared; other[0] = 999;
assert(m.compute_block_hashes(other)[0] != h[0]); // sensitive to content
// seq 0: cold, no cache hit yet
assert(m.get_computed_blocks(h) == 0);
assert(m.allocate(0, 32));
m.cache_blocks(0, h, 32);
// seq 1: warm — the 2 shared blocks are a cache hit (32 tokens)
assert(m.get_computed_blocks(h) == 32);
// first-miss stop: a chain that diverges after block 1 hits only 1 block
auto hmix = h; hmix[1] = 0xDEADBEEF;
assert(m.get_computed_blocks(hmix) == 16);
printf("test_prefix_cache: OK\n");
return 0;
}

View File

@@ -1,106 +0,0 @@
# Paged-attention / parity benchmarks (GB10 / DGX Spark)
Goal of the series: vLLM parity. This records the measured gap so the parity claim is data-backed, not asserted.
**Setup:** GB10 (sm_121, 119 GiB unified). Model Qwen3-Coder-30B-A3B. llama.cpp = pinned base + this series
(MXFP4_MOE, `-fa 1 -b 2048 -ub 2048`, `llama-batched-bench`, PP=512 TG=128). vLLM = 0.23.0 FP8 (recorded
prior run, same box/model). S_PP / S_TG are aggregate prefill / decode tok/s across B streams.
## Fresh llama.cpp (this series, MXFP4) vs vLLM (FP8)
| B | llama S_PP | vLLM S_PP | PP gap | llama S_TG | vLLM S_TG | TG gap |
|---|-----------|-----------|--------|-----------|-----------|--------|
| 1 | 1565 | 9644 | 6.2× | **83** | 48 | **llama wins** |
| 8 | 3648 | 33373 | 9.1× | 126 | 312 | 2.5× |
| 32 | 2074 | 99398 | 48× | 319 | 1171 | 3.7× |
| 64 | 3643 | 151990 | 42× | 771 | 2064 | 2.7× |
## Verdict — two distinct gaps, only one is the engine's
1. **Prefill (S_PP): 648× behind, and it does NOT scale with B** (plateaus ~3.6k). This is the **FP4 MoE
GEMM kernel** (`mul_mat_q<MXFP4>` ~22 TFLOP/s), confirmed earlier. **Paged attention cannot close this**
it's per-token compute. Needs the tcgen05/CUTLASS grouped-GEMM (Lever 3, multi-week, no upstream base).
2. **Decode at concurrency (S_TG): 2.53.7× behind for B≥8** (we *win* at B=1). This gap IS partly the
engine's domain — vLLM's block-paged KV + continuous batching pack more concurrent decode work per step.
**This is what patches 00030006 target.** The win here is realistic; the prefill win is not (kernel).
## CORRECTION — decode-phase profile (B=64, decode-dominated nsys)
The "decode gap is engine-addressable" read above was **wrong**. Profiling a decode-dominated B=64 run:
| kernel | % GPU time |
|---|---|
| `mul_mat_q<MXFP4>` (MoE GEMM) | **54.6** |
| `flash_attn_ext` (attention) | 19.8 |
| `mul_mat_q<Q8>` (dense) | 10.9 |
| KV writes / quant / norms / rest | ~15 |
**Decode at concurrency is ALSO dominated by the FP4 MoE GEMM (54.6%)** — the same Lever-3 kernel as prefill.
Attention (the only thing paging optimizes) is ~20%, and the gather-read reclaims only the *masked-cell*
fraction of that. So **the paged series (00030006) cannot close the vLLM gap in either phase** — both are
MoE-kernel-bound. vLLM's concurrency advantage is its MoE/attention *kernels*, not (mainly) its KV management.
### What the paged series IS still good for (just not throughput parity)
- **Capacity**: block-granular + on-demand allocation → fit more/longer concurrent sequences in fixed VRAM.
- **Prefix sharing**: cross-request block dedup → lower TTFT + memory on shared system prompts / RAG.
These are real wins on *memory-pressured* and *shared-prefix* workloads — but they are not tok/s parity, and
batched-bench (fresh, non-fragmented, no shared prefix) won't show them.
## DENSE model parity (Qwen3-32B) — does the kernel gap exist for dense too? YES.
The MoE work above is about the grouped MoE GEMM. Dense models use a different (non-grouped) matmul path,
so we benchmarked a dense 32B head-to-head.
**Headline comparison — vLLM NVFP4 W4A16 vs llama.cpp Q4_K_M.** This is the *correct apples-to-apples on
DGX Spark*: both are **4-bit weights / 16-bit activations** (same quant class). vLLM = `Qwen3-32B-NVFP4A16`
(FlashInfer Marlin W4A16 kernel); llama.cpp = `Qwen3-32B-Q4_K_M` (int8-MMQ compute). The only difference is
the compute kernel — which is exactly what we're measuring. (Full **W4A4** NVFP4 does not run on GB10 today;
root cause below — and it would *not* be a fair comparison even if it did, since Q4_K_M is also weight-only-4-bit.)
| B | llama Q4_K_M PP | vLLM W4A16 PP | PP gap | llama decode | vLLM decode | TG gap |
|---|---|---|---|---|---|---|
| 1 | 708 | 5367 | 7.6× | 10.2 | 11.7 | ~parity |
| 8 | 761 | 14941 | 20× | 58 | 92 | 1.6× |
| 32 | 763 | 21952 | 29× | 205 | 330 | 1.6× |
| 64 | 765 | 24444 | 32× | 253 | 569 | 2.2× |
**Findings:**
1. **Dense prefill has the SAME (larger) kernel gap.** llama dense prefill plateaus at ~765 t/s regardless of
B; vLLM scales to 24.4k (32×). Both read 4-bit weights — the gap is the compute kernel: vLLM's FP4 Marlin
tensor-core GEMM vs llama's int8-MMQ. (Note: on consumer Blackwell, W4A16 Marlin is also reported *faster*
than the experimental W4A4 path, so W4A16 isn't a handicapped stand-in — it's the fast path.)
2. **Decode is ~parity at B=1** (10.2 vs 11.7 — both weight-bandwidth-bound reading 4-bit weights), and the
gap grows with batch (compute starts to matter → the kernel gap reappears: 2.2× at B=64).
3. **Scope decision (the reason for this benchmark): the Lever-3 kernel track must also deliver a NON-grouped
block-scaled FP4 GEMM for dense**, not only the MoE grouped GEMM. The dense GEMM is the simpler of the two
(a plain CUTLASS dense GEMM), so it's a good first kernel to land — and it benefits every dense model.
- **No cheap lever:** `GGML_CUDA_FORCE_CUBLAS` is a **no-op for dense too** (Q4_K pp512: 720.8 vs 721.8) —
dequant→cuBLAS-BF16 doesn't engage / isn't faster than int8-MMQ on GB10. With ubatch (saturates) and
nwarps (static_assert) already ruled out for MoE, **every config/flag lever is now exhausted** for both
model classes. Parity is strictly the FP4 tensor-core kernel.
4. **Why full W4A4 NVFP4 hangs on GB10 (root cause, researched).** This is a *known consumer-Blackwell
limitation, not a misconfiguration*. **FlashInfer ships no FP4 cubins for sm_120/sm_121** — its precompiled
kernels are all datacenter `Sm100a/Sm103a` (B200/B300). So on GB10 the dense `mm_fp4` W4A4 GEMM has no
working kernel: the optimized path is gated off for sm_121 (heuristic checks `minor==0`; 12.1 fails), the
CUTLASS dense FP4 fallback is documented to silently return **all-zeros**, and TRT-LLM errors at capability
120. Our exact symptom — loads weights, then stalls at the first profiling forward pass with
`enable_flashinfer_autotune=True` at 03% GPU — is the **FlashInfer FP4 autotuner/JIT spinning on an arch
with no FP4 cubins** (matches vllm #30163/#26381, flashinfer #2577/#3294). The "NVFP4 on DGX Spark" story
everyone cites is about *quantization + memory footprint + W4A16/MoE*, **not dense W4A4 inference**, which
isn't validated on sm_121 yet (where people patched it working, it was slower than W4A16 anyway).
**Therefore W4A16 vs Q4_K_M above is the right, reproducible apples-to-apples** for DGX Spark today.
Optional W4A4 retry (verify output isn't zeros first): `VLLM_SKIP_FLASHINFER_AUTOTUNE=1` +
`VLLM_NVFP4_GEMM_BACKEND=cutlass` + `--enforce-eager`, or NVIDIA's `vllm/vllm-openai:cu130-nightly` container.
## So, honestly, where parity stands
- **Decode single-stream: already at/above parity** (B=1: 83 vs 48).
- **Decode concurrency: a real, engine-addressable gap** the paged series can narrow (0004 on-demand pool +
0005 continuous batching). Target: close the 2.53.7× at B≥8.
- **Prefill: kernel-bound, not engine-bound.** No amount of paging reaches vLLM here; that's a separate track.
**Series status when measured:** 0001 (vendor) + 0002 (placement, token-identical) done; 0003 (gather-read)
turn-key-planned, not yet implemented. These numbers are the *baseline* the engine patches must improve on at
B≥8 decode — re-run this table after 0004/0005 to show the concurrency gap closing.

View File

@@ -1,82 +0,0 @@
# llama.cpp patch series — paged attention (vLLM-parity engine)
A **stacking** series: each patch is a small, self-contained, independently-buildable step toward an
in-model paged-attention engine. They apply in numeric order on top of the pinned `LLAMA_VERSION`
(`backend/cpp/llama-cpp/Makefile`). The build applies them automatically after checkout (see the
`llama.cpp:` target). Keeping the work as ordered patches — rather than one big diff — is what lets us
**rebase cleanly across llama.cpp bumps and avoid drift**: when a patch stops applying, only that small
patch needs fixing, and the failure points at exactly which step the upstream change touched.
## Base
- `LLAMA_VERSION` pin in `../Makefile`. **All patches are generated against that exact commit.** Bumping
the pin = re-run the regen workflow below and fix only the patches that no longer apply.
## The series (phases → patches)
| # | Patch | What | Verifies |
|---|-------|------|----------|
| 0001 | `0001-vendor-paged-kv-manager.patch` | Add `src/paged-kv-manager.{h,cpp}` (vLLM-parity block manager, CPU foundation) + CMake; no behavior change | builds; unit-tested separately under `../paged/` |
| 0002 | `0002-paged-kv-storage.patch` | Shared block-pool KV tensor + `set_rows`-by-slot writes, behind `LLAMA_KV_PAGED` | builds; write/gather round-trip |
| 0003 | `0003-paged-gather-read.patch` | `build_attn_paged` gather-read in `llama-graph.cpp` | **Gate 0**: token-identical greedy gen, single + multi-seq |
| 0004 | `0004-paged-ondemand-alloc.patch` | On-demand block allocation via PagedKVManager | max concurrent seqs before OOM |
| 0005 | `0005-paged-continuous-batching.patch` | Block-granular admit/evict in the server slot path | tok/s vs concurrency, mixed-length |
| 0006 | `0006-paged-prefix-caching.patch` | Block-hash cross-request prefix dedup | TTFT + memory on shared prefixes |
Each row is a separate `git commit` on the dev branch (below), exported 1:1 as a patch. Default off
(`LLAMA_KV_PAGED`) until Gate 0 (0003) is green, so partial series never changes stock behavior.
## Regen workflow (the anti-drift recipe)
```sh
# 1. check out the exact pin into a dev tree
git -C /tmp clone https://github.com/ggml-org/llama.cpp llama-dev && cd /tmp/llama-dev
git checkout <LLAMA_VERSION from ../Makefile>
git checkout -b paged
# 2. apply the current series (each becomes a commit), or develop the next patch
git am /path/to/backend/cpp/llama-cpp/patches/00*.patch # or `git apply` + commit per patch
# 3. iterate a phase as ONE commit, then export the whole series 1:1
git format-patch <LLAMA_VERSION>..paged -o /path/to/backend/cpp/llama-cpp/patches/ --zero-commit -N
# 4. on a pin bump: rebase `paged` onto the new pin; only conflicting patches need edits; re-export.
```
## Build integration
`../Makefile`'s `llama.cpp:` target runs, after `git checkout -b build $(LLAMA_VERSION)`:
```
for p in $(CURRENT_MAKEFILE_DIR)/patches/0*.patch; do git apply --verbose "$p"; done
```
All variants (avx/avx2/avx512/cuda/…) copy the patched `llama.cpp/` tree, so the series ships everywhere.
## Status
- **0001 vendor manager — DONE.** Applies clean to the pin; builds into `libllama`.
- **0002 block placement — DONE + VERIFIED.** Built `llama-simple` at the pin; greedy generation is
**token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B), paged branch confirmed firing.
- **0003 gather-read — DONE + VERIFIED (Gate 0 green).** Implemented in the **additive** form
(`ADDITIVE_DESIGN.md`): all logic in new `src/paged-attn.{h,cpp}` (a `llm_graph_input_i` gather-index
subclass + the K/V/mask gather), hooked by **one** line in `build_attn` + **two** thin accessors on
`llama_kv_cache_context` + 1 CMake line (216 insertions; no edit to `llm_graph_input_attn_kv` or
`llama-graph.h`). Greedy generation is **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B,
**9/9** across 3 prompts × {32,96,128} tokens), with `n_gather=71 < n_kv=256` confirming real
compaction. Patch: `0003-paged-gather-read-env-LLAMA_KV_PAGED.patch`.
- **Key correctness finding:** `get_gather_idxs` must emit cells **sorted by token position**. The CPU
flash-attn online softmax reduces cells in physical-array order and is FP-order-sensitive, so 0002's
scattered placement *alone* (full-window read, no gather) diverges from stock once a sequence crosses
the first 16-cell block. The position-sorted gather reproduces stock's exact reduction order -> bit-
identical, not merely mathematically equivalent. So 0002 is the placement substrate; **0003 is what
makes paged placement token-identical under flash-attn.**
- 00040006 follow.
### Honest parity note (important)
This series delivers the paged-attention **engine** (capacity + scheduling + prefix sharing). It does **not**
by itself reach vLLM throughput parity, because the measured prefill bottleneck is the **FP4 MoE GEMM kernel**
(Lever 3: `mul_mat_q<MXFP4>` ~22 TFLOP/s, ~27× behind vLLM) — a *per-token compute* gap that paging does not
touch. Paged attention closes the **concurrency/memory** gap (more sequences, prefix reuse); the prefill/throughput
gap additionally needs the tcgen05/CUTLASS grouped-GEMM (deferred, upstream-grade, no shortcut — see
`../paged/UPSTREAM_GGML_ISSUE.md` and `DGX_BLACKWELL_PLAN.md`). So full vLLM parity = this series **AND** the
kernel; neither alone suffices.

View File

@@ -1,91 +0,0 @@
diff --git a/ggml/src/ggml-cuda/fp4-grouped-moe.cu b/ggml/src/ggml-cuda/fp4-grouped-moe.cu
new file mode 100644
index 0000000..5f5a782
--- /dev/null
+++ b/ggml/src/ggml-cuda/fp4-grouped-moe.cu
@@ -0,0 +1,46 @@
+#include "fp4-grouped-moe.cuh"
+
+#include <cstdlib>
+#include <cstdio>
+
+// SCAFFOLD for the FP4 grouped-GEMM MoE kernel (Lever 3).
+//
+// Why: on GB10 (sm_121) the MoE matmul runs mul_mat_q<MXFP4> - a warp-level mma.sync grouped MMQ -
+// at ~22 effective TFLOP/s, ~27x behind vLLM prefill, and it also dominates decode at concurrency
+// (54.6% of GPU time at B=64). It is the single bottleneck to vLLM parity in BOTH phases; paged
+// attention cannot touch it (proven by profiling). The fix is a CUTLASS-3.x collective-mainloop
+// grouped GEMM over all experts, block-scaled e2m1 operands via tcgen05 tensor-memory MMA.
+//
+// This file is the integration seam. It is currently a no-op that always falls back to MMQ, so the
+// default build is byte-identical. The kernel is filled in over the phases in the design doc.
+
+static bool fp4_grouped_enabled() {
+ static const bool en = (std::getenv("GGML_CUDA_FP4_GROUPED") != nullptr);
+ return en;
+}
+
+bool ggml_cuda_fp4_grouped_moe(
+ ggml_backend_cuda_context & ctx,
+ const ggml_tensor * src0,
+ const ggml_tensor * src1,
+ const ggml_tensor * ids,
+ ggml_tensor * dst) {
+ GGML_UNUSED(ctx); GGML_UNUSED(src1); GGML_UNUSED(ids); GGML_UNUSED(dst);
+
+ if (!fp4_grouped_enabled()) {
+ return false; // default: existing MMQ path
+ }
+ if (src0->type != GGML_TYPE_MXFP4 && src0->type != GGML_TYPE_NVFP4) {
+ return false;
+ }
+
+ // TODO(kernel - see kernel design doc): CUTLASS 3.x GemmGrouped, sm_120a, block-scaled e2m1,
+ // tcgen05 MMA; per-expert problem offsets from `ids`; fused activation quant; numerical parity
+ // vs mul_mat_q<MXFP4> before enabling by default.
+ static bool warned = false;
+ if (!warned) {
+ warned = true;
+ fprintf(stderr, "[fp4-grouped] GGML_CUDA_FP4_GROUPED set, kernel not yet implemented - using MMQ\n");
+ }
+ return false; // scaffold: fall back until the kernel lands
+}
diff --git a/ggml/src/ggml-cuda/fp4-grouped-moe.cuh b/ggml/src/ggml-cuda/fp4-grouped-moe.cuh
new file mode 100644
index 0000000..29e1b5a
--- /dev/null
+++ b/ggml/src/ggml-cuda/fp4-grouped-moe.cuh
@@ -0,0 +1,13 @@
+#pragma once
+
+#include "common.cuh"
+
+// Entry point for the tcgen05/CUTLASS block-scaled FP4 (MXFP4/NVFP4) grouped-GEMM MoE kernel for
+// Blackwell consumer GPUs (sm_120/121). Returns true if it handled the op; false to fall back to
+// the existing warp-mma MMQ path. Gated behind GGML_CUDA_FP4_GROUPED until correct + faster.
+bool ggml_cuda_fp4_grouped_moe(
+ ggml_backend_cuda_context & ctx,
+ const ggml_tensor * src0, // expert weights, MXFP4/NVFP4 [n_embd, n_ff, n_expert]
+ const ggml_tensor * src1, // activations, F32 [n_embd, n_tokens, ...]
+ const ggml_tensor * ids, // expert routing, I32
+ ggml_tensor * dst); // F32 output
diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index 8ea462a..104d131 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -30,6 +30,7 @@
#include "ggml-cuda/im2col.cuh"
#include "ggml-cuda/mmf.cuh"
#include "ggml-cuda/mmq.cuh"
+#include "ggml-cuda/fp4-grouped-moe.cuh"
#include "ggml-cuda/mmvf.cuh"
#include "ggml-cuda/mmvq.cuh"
#include "ggml-cuda/norm.cuh"
@@ -2701,6 +2702,7 @@ static void ggml_cuda_mul_mat_id(ggml_backend_cuda_context & ctx, ggml_tensor *
}
if (ggml_cuda_should_use_mmq(src0->type, cc, ne12, /*n_experts=*/ne02)) {
+ if (ggml_cuda_fp4_grouped_moe(ctx, src0, src1, ids, dst)) { return; }
ggml_cuda_mul_mat_q(ctx, src0, src1, ids, dst);
return;
}

View File

@@ -1,447 +0,0 @@
From bef64835d444a44ed8391bc395cdab38164229d5 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Fri, 19 Jun 2026 22:54:49 +0000
Subject: [PATCH] vendor paged kv manager
vLLM-parity host-side KV block manager (FreeBlockQueue, BlockPool,
PagedKVManager, chained-hash prefix cache). Pure C++17, no behavior change -
nothing uses it yet; wired in by later patches in the series.
---
src/CMakeLists.txt | 1 +
src/paged-kv-manager.cpp | 296 +++++++++++++++++++++++++++++++++++++++
src/paged-kv-manager.h | 108 ++++++++++++++
3 files changed, 405 insertions(+)
create mode 100644 src/paged-kv-manager.cpp
create mode 100644 src/paged-kv-manager.h
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index d15ccfd99..a030940b8 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -24,6 +24,7 @@ add_library(llama
llama-io.cpp
llama-kv-cache.cpp
llama-kv-cache-iswa.cpp
+ paged-kv-manager.cpp
llama-kv-cache-dsa.cpp
llama-memory.cpp
llama-memory-hybrid.cpp
diff --git a/src/paged-kv-manager.cpp b/src/paged-kv-manager.cpp
new file mode 100644
index 000000000..ca0dcd83a
--- /dev/null
+++ b/src/paged-kv-manager.cpp
@@ -0,0 +1,296 @@
+#include "paged-kv-manager.h"
+#include <cassert>
+#include <stdexcept>
+
+namespace paged {
+
+// ---------------------------------------------------------------------------
+// FreeBlockQueue (port of kv_cache_utils.py FreeKVCacheBlockQueue)
+// ---------------------------------------------------------------------------
+
+FreeBlockQueue::FreeBlockQueue(const std::vector<KVCacheBlock*>& blocks) {
+ num_free_blocks = blocks.size();
+ for (size_t i = 0; i < blocks.size(); ++i) {
+ if (i > 0) blocks[i]->prev_free = blocks[i - 1];
+ if (i + 1 < blocks.size()) blocks[i]->next_free = blocks[i + 1];
+ }
+ if (!blocks.empty()) {
+ fake_head.next_free = blocks.front();
+ blocks.front()->prev_free = &fake_head;
+ fake_tail.prev_free = blocks.back();
+ blocks.back()->next_free = &fake_tail;
+ } else {
+ fake_head.next_free = &fake_tail;
+ fake_tail.prev_free = &fake_head;
+ }
+}
+
+KVCacheBlock* FreeBlockQueue::popleft() {
+ KVCacheBlock* first = fake_head.next_free;
+ if (first == &fake_tail || first == nullptr) {
+ assert(num_free_blocks == 0);
+ throw std::runtime_error("No free blocks available");
+ }
+ fake_head.next_free = first->next_free;
+ first->next_free->prev_free = &fake_head;
+ first->prev_free = first->next_free = nullptr;
+ num_free_blocks--;
+ return first;
+}
+
+std::vector<KVCacheBlock*> FreeBlockQueue::popleft_n(size_t n) {
+ std::vector<KVCacheBlock*> ret;
+ if (n == 0) return ret;
+ assert(num_free_blocks >= n);
+ num_free_blocks -= n;
+ KVCacheBlock* curr = fake_head.next_free;
+ ret.reserve(n);
+ for (size_t i = 0; i < n; ++i) {
+ assert(curr != nullptr);
+ ret.push_back(curr);
+ KVCacheBlock* last = curr;
+ curr = curr->next_free;
+ last->prev_free = last->next_free = nullptr;
+ }
+ if (curr != nullptr) {
+ fake_head.next_free = curr;
+ curr->prev_free = &fake_head;
+ }
+ return ret;
+}
+
+void FreeBlockQueue::remove(KVCacheBlock* block) {
+ if (!block->prev_free || !block->next_free)
+ throw std::runtime_error("remove() called on an invalid block");
+ block->prev_free->next_free = block->next_free;
+ block->next_free->prev_free = block->prev_free;
+ block->prev_free = block->next_free = nullptr;
+ num_free_blocks--;
+}
+
+void FreeBlockQueue::append(KVCacheBlock* block) {
+ KVCacheBlock* last = fake_tail.prev_free;
+ last->next_free = block;
+ block->prev_free = last;
+ block->next_free = &fake_tail;
+ fake_tail.prev_free = block;
+ num_free_blocks++;
+}
+
+void FreeBlockQueue::append_n(const std::vector<KVCacheBlock*>& blocks) {
+ if (blocks.empty()) return;
+ KVCacheBlock* last = fake_tail.prev_free;
+ for (KVCacheBlock* b : blocks) {
+ b->prev_free = last;
+ last->next_free = b;
+ last = b;
+ }
+ last->next_free = &fake_tail;
+ fake_tail.prev_free = last;
+ num_free_blocks += blocks.size();
+}
+
+void FreeBlockQueue::prepend_n(const std::vector<KVCacheBlock*>& blocks) {
+ if (blocks.empty()) return;
+ KVCacheBlock* first = fake_head.next_free;
+ KVCacheBlock* prev = &fake_head;
+ for (KVCacheBlock* b : blocks) {
+ b->prev_free = prev;
+ prev->next_free = b;
+ prev = b;
+ }
+ prev->next_free = first;
+ first->prev_free = prev;
+ num_free_blocks += blocks.size();
+}
+
+std::vector<KVCacheBlock*> FreeBlockQueue::get_all_free_blocks() const {
+ std::vector<KVCacheBlock*> ret;
+ const KVCacheBlock* curr = fake_head.next_free;
+ while (curr && curr->next_free != nullptr) {
+ ret.push_back(const_cast<KVCacheBlock*>(curr));
+ curr = curr->next_free;
+ }
+ return ret;
+}
+
+// ---------------------------------------------------------------------------
+// BlockPool (port of block_pool.py)
+// ---------------------------------------------------------------------------
+
+static std::vector<KVCacheBlock*> make_ptrs(std::vector<KVCacheBlock>& v) {
+ std::vector<KVCacheBlock*> p;
+ p.reserve(v.size());
+ for (auto& b : v) p.push_back(&b);
+ return p;
+}
+
+static std::vector<KVCacheBlock> make_block_vec(int32_t num_blocks) {
+ std::vector<KVCacheBlock> v;
+ v.reserve(num_blocks);
+ for (int32_t i = 0; i < num_blocks; ++i) v.emplace_back(i);
+ return v;
+}
+
+BlockPool::BlockPool(int32_t num_blocks, bool enable_caching)
+ : enable_caching_(enable_caching),
+ blocks_(make_block_vec(num_blocks)),
+ ptrs_(make_ptrs(blocks_)),
+ free_queue_(ptrs_) {
+ // vLLM reserves block_id 0 as the null block (never cached).
+ null_block = free_queue_.popleft();
+ null_block->is_null = true;
+}
+
+bool BlockPool::maybe_evict_cached_block(KVCacheBlock* block) {
+ if (!block->has_hash) return false;
+ auto it = cached_block_hash_to_block_.find(block->block_hash);
+ if (it == cached_block_hash_to_block_.end() || it->second != block) return false;
+ cached_block_hash_to_block_.erase(it);
+ block->reset_hash();
+ return true;
+}
+
+std::vector<KVCacheBlock*> BlockPool::get_new_blocks(size_t n) {
+ if (n > get_num_free_blocks())
+ throw std::runtime_error("Cannot get free blocks from pool");
+ auto ret = free_queue_.popleft_n(n);
+ for (KVCacheBlock* b : ret) {
+ if (enable_caching_) maybe_evict_cached_block(b);
+ assert(b->ref_cnt == 0);
+ b->ref_cnt += 1;
+ }
+ return ret;
+}
+
+KVCacheBlock* BlockPool::get_cached_block(uint64_t block_hash) {
+ auto it = cached_block_hash_to_block_.find(block_hash);
+ return it == cached_block_hash_to_block_.end() ? nullptr : it->second;
+}
+
+void BlockPool::touch(const std::vector<KVCacheBlock*>& blocks) {
+ for (KVCacheBlock* b : blocks) {
+ // ref_cnt==0 means the block is a free-list eviction candidate; pull it out.
+ if (b->ref_cnt == 0 && !b->is_null) free_queue_.remove(b);
+ b->ref_cnt += 1;
+ }
+}
+
+void BlockPool::free_blocks(const std::vector<KVCacheBlock*>& ordered_blocks) {
+ std::vector<KVCacheBlock*> without_hash, with_hash;
+ for (KVCacheBlock* b : ordered_blocks) {
+ if (b->is_null) continue;
+ b->ref_cnt -= 1;
+ if (b->ref_cnt == 0) (b->has_hash ? with_hash : without_hash).push_back(b);
+ }
+ free_queue_.prepend_n(without_hash); // un-hashed: evicted first (front)
+ free_queue_.append_n(with_hash); // hashed: kept warm (tail)
+}
+
+void BlockPool::cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
+ size_t num_cached_blocks, size_t num_full_blocks,
+ const std::vector<uint64_t>& block_hashes) {
+ for (size_t i = num_cached_blocks; i < num_full_blocks; ++i) {
+ KVCacheBlock* blk = req_blocks[i];
+ if (blk->has_hash) continue;
+ blk->has_hash = true;
+ blk->block_hash = block_hashes[i];
+ cached_block_hash_to_block_[blk->block_hash] = blk;
+ }
+}
+
+// ---------------------------------------------------------------------------
+// PagedKVManager (port of SingleTypeKVCacheManager / FullAttentionManager)
+// ---------------------------------------------------------------------------
+
+static inline size_t cdiv(size_t a, size_t b) { return (a + b - 1) / b; }
+
+PagedKVManager::PagedKVManager(int32_t num_blocks, int block_size, bool enable_caching)
+ : block_size_(block_size), pool_(num_blocks, enable_caching) {}
+
+bool PagedKVManager::allocate(int seq_id, size_t total_tokens) {
+ auto& req = req_to_blocks_[seq_id];
+ size_t need = cdiv(total_tokens, block_size_);
+ if (need <= req.size()) return true;
+ size_t add = need - req.size();
+ if (add > pool_.get_num_free_blocks()) return false; // OOM
+ auto nb = pool_.get_new_blocks(add);
+ req.insert(req.end(), nb.begin(), nb.end());
+ return true;
+}
+
+std::vector<int32_t> PagedKVManager::block_table(int seq_id) const {
+ std::vector<int32_t> bt;
+ auto it = req_to_blocks_.find(seq_id);
+ if (it == req_to_blocks_.end()) return bt;
+ bt.reserve(it->second.size());
+ for (KVCacheBlock* b : it->second) bt.push_back(b->block_id);
+ return bt;
+}
+
+int64_t PagedKVManager::slot(int seq_id, int pos) const {
+ const auto& req = req_to_blocks_.at(seq_id);
+ int32_t phys = req[pos / block_size_]->block_id;
+ return (int64_t)phys * block_size_ + (pos % block_size_);
+}
+
+std::vector<int64_t> PagedKVManager::slot_mapping(int seq_id, const std::vector<int>& positions) const {
+ std::vector<int64_t> sm;
+ sm.reserve(positions.size());
+ for (int p : positions) sm.push_back(slot(seq_id, p));
+ return sm;
+}
+
+void PagedKVManager::free(int seq_id) {
+ auto it = req_to_blocks_.find(seq_id);
+ if (it == req_to_blocks_.end()) return;
+ // Free in reverse so the tail of the block chain is evicted first (vLLM order).
+ std::vector<KVCacheBlock*> ordered(it->second.rbegin(), it->second.rend());
+ pool_.free_blocks(ordered);
+ req_to_blocks_.erase(it);
+}
+
+// FNV-1a chained block hash. Deterministic and prefix-sensitive; folds the parent
+// hash into the seed so each block hash transitively encodes its whole prefix
+// (behavioral parity with vLLM hash_block_tokens chaining; vLLM uses sha256 bytes).
+uint64_t PagedKVManager::hash_block(uint64_t parent_hash, const std::vector<int>& token_ids) {
+ uint64_t h = 1469598103934665603ull ^ parent_hash;
+ for (int t : token_ids) {
+ h ^= (uint64_t)(uint32_t)t;
+ h *= 1099511628211ull;
+ }
+ if (h == 0) h = 0x9e3779b97f4a7c15ull; // never 0 (0 reads as "no hash")
+ return h;
+}
+
+std::vector<uint64_t> PagedKVManager::compute_block_hashes(const std::vector<int>& token_ids) const {
+ std::vector<uint64_t> hashes;
+ uint64_t parent = 0; // NONE_HASH analogue
+ size_t n_full = token_ids.size() / block_size_;
+ for (size_t i = 0; i < n_full; ++i) {
+ std::vector<int> blk(token_ids.begin() + i * block_size_,
+ token_ids.begin() + (i + 1) * block_size_);
+ parent = hash_block(parent, blk);
+ hashes.push_back(parent);
+ }
+ return hashes;
+}
+
+size_t PagedKVManager::get_computed_blocks(const std::vector<uint64_t>& block_hashes) {
+ std::vector<KVCacheBlock*> hits;
+ for (uint64_t bh : block_hashes) { // stop at first miss (prefix property)
+ KVCacheBlock* cb = pool_.get_cached_block(bh);
+ if (!cb) break;
+ hits.push_back(cb);
+ }
+ pool_.touch(hits); // ++ref_cnt, pull from free list
+ return hits.size() * (size_t)block_size_;
+}
+
+void PagedKVManager::cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens) {
+ auto& req = req_to_blocks_[seq_id];
+ size_t n_full = num_tokens / block_size_;
+ pool_.cache_full_blocks(req, /*num_cached=*/0, n_full, block_hashes);
+}
+
+} // namespace paged
diff --git a/src/paged-kv-manager.h b/src/paged-kv-manager.h
new file mode 100644
index 000000000..740280a7f
--- /dev/null
+++ b/src/paged-kv-manager.h
@@ -0,0 +1,108 @@
+#pragma once
+// Paged KV cache block manager for llama.cpp (CPU-first prototype).
+//
+// Host-side block management is a faithful port of vLLM V1:
+// vllm/v1/core/kv_cache_utils.py (KVCacheBlock, FreeKVCacheBlockQueue, hash_block_tokens)
+// vllm/v1/core/block_pool.py (BlockPool: get_new_blocks/touch/free/evict/cache_full_blocks)
+// vllm/v1/core/single_type_kv_cache_manager.py (allocate_new_blocks, find_longest_cache_hit)
+//
+// Parity is on behavior/algorithm (block chaining, first-miss stop, ref-counting,
+// LRU eviction order), not on exact hash bytes. This unit has zero ggml/llama.cpp
+// dependency so it can be unit-tested in isolation.
+
+#include <cstdint>
+#include <vector>
+#include <unordered_map>
+#include <map>
+
+namespace paged {
+
+// vLLM KVCacheBlock (kv_cache_utils.py).
+struct KVCacheBlock {
+ int32_t block_id = 0;
+ int ref_cnt = 0;
+ bool has_hash = false; // vLLM: _block_hash is set only when full+cached
+ uint64_t block_hash = 0;
+ bool is_null = false;
+ KVCacheBlock* prev_free = nullptr;
+ KVCacheBlock* next_free = nullptr;
+
+ explicit KVCacheBlock(int32_t id = 0) : block_id(id) {}
+ void reset_hash() { has_hash = false; block_hash = 0; }
+};
+
+// Intrusive doubly-linked free list with fake head/tail (vLLM FreeKVCacheBlockQueue).
+// O(1) middle removal is required so touch() can pull a warm cached block out of the
+// free list when a later request hits its prefix.
+class FreeBlockQueue {
+public:
+ size_t num_free_blocks = 0;
+
+ explicit FreeBlockQueue(const std::vector<KVCacheBlock*>& blocks);
+ KVCacheBlock* popleft();
+ std::vector<KVCacheBlock*> popleft_n(size_t n);
+ void remove(KVCacheBlock* block);
+ void append(KVCacheBlock* block);
+ void append_n(const std::vector<KVCacheBlock*>& blocks);
+ void prepend_n(const std::vector<KVCacheBlock*>& blocks);
+ std::vector<KVCacheBlock*> get_all_free_blocks() const;
+
+private:
+ KVCacheBlock fake_head{-1};
+ KVCacheBlock fake_tail{-1};
+};
+
+// vLLM BlockPool (block_pool.py).
+class BlockPool {
+public:
+ KVCacheBlock* null_block = nullptr;
+
+ BlockPool(int32_t num_blocks, bool enable_caching);
+ std::vector<KVCacheBlock*> get_new_blocks(size_t n);
+ KVCacheBlock* get_cached_block(uint64_t block_hash);
+ void touch(const std::vector<KVCacheBlock*>& blocks);
+ void free_blocks(const std::vector<KVCacheBlock*>& ordered_blocks);
+ void cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
+ size_t num_cached_blocks, size_t num_full_blocks,
+ const std::vector<uint64_t>& block_hashes);
+ size_t get_num_free_blocks() const { return free_queue_.num_free_blocks; }
+
+private:
+ bool maybe_evict_cached_block(KVCacheBlock* block);
+
+ bool enable_caching_;
+ std::vector<KVCacheBlock> blocks_; // owns all block descriptors
+ std::vector<KVCacheBlock*> ptrs_;
+ FreeBlockQueue free_queue_;
+ // vLLM stores hash -> {block_id: block} to allow duplicate-content blocks; the
+ // prototype keeps the last writer (single KV-cache group is sufficient for the wins).
+ std::unordered_map<uint64_t, KVCacheBlock*> cached_block_hash_to_block_;
+};
+
+// Allocation + prefix-caching surface, ported from SingleTypeKVCacheManager /
+// FullAttentionManager. Single KV-cache group; no extra_keys / eagle / spec-decode.
+class PagedKVManager {
+public:
+ PagedKVManager(int32_t num_blocks, int block_size, bool enable_caching);
+
+ // Grow seq_id to cover total_tokens slots. Returns false on OOM (free queue empty).
+ bool allocate(int seq_id, size_t total_tokens);
+ std::vector<int32_t> block_table(int seq_id) const;
+ int64_t slot(int seq_id, int pos) const;
+ std::vector<int64_t> slot_mapping(int seq_id, const std::vector<int>& positions) const;
+ void free(int seq_id);
+ int block_size() const { return block_size_; }
+
+ // Prefix caching (win 3).
+ static uint64_t hash_block(uint64_t parent_hash, const std::vector<int>& token_ids);
+ std::vector<uint64_t> compute_block_hashes(const std::vector<int>& token_ids) const;
+ size_t get_computed_blocks(const std::vector<uint64_t>& block_hashes); // returns num cached tokens
+ void cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens);
+
+protected:
+ int block_size_;
+ BlockPool pool_;
+ std::map<int, std::vector<KVCacheBlock*>> req_to_blocks_;
+};
+
+} // namespace paged
--
2.43.0

View File

@@ -1,75 +0,0 @@
From 5c9c709e6c6b07e0399b75fd4e46e752d418a9a8 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Fri, 19 Jun 2026 23:04:17 +0000
Subject: [PATCH] paged kv block placement (env LLAMA_KV_PAGED)
Place each sequence's tokens at permuted, non-contiguous fixed-size block
positions in find_slot, proving attention is invariant to physical KV placement
(token-identical greedy generation). Default off; single-sequence scope; falls
back to the normal allocator. The paged-placement substrate for the gather-read.
---
src/llama-kv-cache.cpp | 41 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 41 insertions(+)
diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
index 2802103bd..999e2ae61 100644
--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
@@ -11,6 +11,8 @@
#include <cstring>
#include <limits>
#include <map>
+#include <numeric>
+#include <cstdlib>
#include <stdexcept>
static bool ggml_is_power_of_2(int n) {
@@ -1020,6 +1022,45 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
return { };
}
+ // [paged, experimental] Place this sequence's tokens at permuted,
+ // non-contiguous fixed-size BLOCK positions instead of a contiguous run.
+ // This validates that attention is invariant to physical KV placement -
+ // the correctness premise of paged attention. Enabled via LLAMA_KV_PAGED.
+ // Single-sequence scope (uses get_used() as the logical base); falls back
+ // to the normal allocator if the permuted cells aren't available.
+ static const bool paged_mode = (std::getenv("LLAMA_KV_PAGED") != nullptr);
+ if (paged_mode) {
+ const uint32_t bs = 16; // block size (tokens/block)
+ const uint32_t nblk = cells.size() / bs; // blocks in this stream's pool
+ if (nblk >= 2) {
+ // stride coprime to nblk => block-index permutation is a bijection
+ uint32_t k = 1;
+ for (uint32_t cand = (nblk / 2) | 1u; cand < nblk; cand += 2) {
+ if (std::gcd(cand, nblk) == 1u) { k = cand; break; }
+ }
+ const uint32_t base = cells.get_used();
+ bool ok = true;
+ for (uint32_t i = 0; i < n_tokens; ++i) {
+ const uint32_t L = base + i;
+ const uint32_t b = L / bs;
+ const uint32_t off = L % bs;
+ if (b >= nblk) { ok = false; break; }
+ const uint32_t phys = ((b * k) % nblk) * bs + off; // permuted block
+ if (phys >= cells.size() || !cells.is_empty(phys)) { ok = false; break; }
+ res.idxs[s].push_back(phys);
+ }
+ if (ok && res.idxs[s].size() == n_tokens) {
+ if (std::getenv("LLAMA_KV_PAGED_DEBUG")) {
+ fprintf(stderr, "[paged] seq placed %u tok at cells:", n_tokens);
+ for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]);
+ fprintf(stderr, " (k=%u nblk=%u base=%u)\n", k, nblk, base);
+ }
+ continue; // paged placement succeeded for this sequence
+ }
+ res.idxs[s].clear(); // fall back to the normal allocator
+ }
+ }
+
uint32_t n_tested = 0;
// for continuous slots, we test that all tokens in the ubatch fit, starting from the current head
--
2.43.0

View File

@@ -1,102 +0,0 @@
# Patch 0003 — paged gather-read: exact implementation plan
**Goal:** a sequence attends only its own (compacted) cells via `ggml_get_rows`, instead of the scattered
`[0,n_kv)` window. Token-identical (attention is permutation-invariant over the KV set). **Gated**: stock
path stays byte-identical (no new ops unless `LLAMA_KV_PAGED`).
**Base:** applies on top of 0001+0002 at the pin. Dev tree: `backend/cpp/llama-cpp-paged-dev` (branch `paged`).
## Design
The gather is keyed off one runtime index list (the sequence's used cells, in a fixed order), exposed as a
graph input (mirroring `k_idxs`). In `build_attn`, gather K, V **and the kq_mask** by that same index, so all
three stay aligned. `n_gathered` replaces `n_kv` for the attention. Only active when the cache is in paged
mode (a new `is_paged()` flag set when `LLAMA_KV_PAGED`/find_slot used permuted placement).
ggml note: `ggml_get_rows(a,b)` gathers `a`'s **ne1** by `b` (I32). Raw K is `[n_embd_k_gqa, kv_size, n_stream]`
→ ne1 = cells → direct. The mask is `[n_kv, n_tokens, 1, n_stream]` → n_kv is **ne0**, so gather as
`transpose → get_rows → transpose`.
### KEY CORRECTIONS (found while implementing — these change the edits)
1. **Gather index = ALL used (non-empty) cells in `[0,n_kv)`, NOT `sinfo.idxs`.** `sinfo.idxs` is only the
*current ubatch's write slots*; attention reads the *full history*. The query set per token is masked by
`kq_mask`, so gathering the union of all used cells + gathering the mask the same way is token-identical
and drops exactly the empty (already-masked) cells. So: `gather = { i in [0,n_kv) : !cells.is_empty(i) }`.
2. **Static-graph size is fine because llama.cpp rebuilds the graph every ubatch.** `n_gather` (used-cell
count) is therefore a build-time constant for that ubatch — `build_input_gather_idxs` sizes the I32
tensor to `get_n_gather()` computed at build, `set_input_gather_idxs` fills the identical cell list. They
MUST use the same loop (`for i in [0,n_kv): if !is_empty(i) push i`) so build-order == fill-order.
3. **K/V gather can live entirely in `build_attn`, no cache get_k change.** The `get_k` 4d view is contiguous
in `[ne0,ne1,ne2]` from cell 0 (nb2 == n_embd_head*n_head_kv*elemsz), so for **single stream (ns==1)**:
`reshape_3d(k, n_embd_head*n_head_kv, n_kv, 1) → get_rows(., gi) → reshape_4d(., n_embd_head, n_head_kv, n_gather, 1)`.
Multi-stream (ns>1) breaks contiguity (nb3 uses kv_size) → gate to ns==1 first, multi-stream follow-up.
4. So the ONLY cache additions are `is_paged()`, `get_n_gather(n_kv)`, `build/set_input_gather_idxs(n_kv)`;
everything else (K/V/mask gather) is in `build_attn`. `set_input_kq_mask` is **unchanged** (built over
n_kv, then gathered). Smaller than the 7-edit estimate above.
## Edits
### 1. `src/llama-kv-cache.h` — declare gather infra (in `llama_kv_cache`)
```cpp
bool is_paged() const { return paged_active; } // near get_size()
ggml_tensor * build_input_gather_idxs(ggml_context * ctx, const slot_info & sinfo) const;
void set_input_gather_idxs (ggml_tensor * dst, const slot_info & sinfo) const;
uint32_t get_n_gather(const slot_info & sinfo) const; // == sum of used cells gathered
```
Add member `mutable bool paged_active = false;` and in `llama_kv_cache_context` forward the three (like
`build_input_k_idxs`/`get_n_kv`).
### 2. `src/llama-kv-cache.cpp`
- In `find_slot`, in the paged branch (0002), set `paged_active = true;` on success.
- `get_n_gather(sinfo)` = `sinfo.idxs[0].size()` summed over streams (the count actually placed).
- `build_input_gather_idxs`: `ggml_new_tensor_1d(ctx, GGML_TYPE_I32, get_n_gather(sinfo)); ggml_set_input(...)`.
- `set_input_gather_idxs`: fill `data[k++] = strm_off + sinfo.idxs[s][i]` for every placed cell (same order
the mask/k/v will see). This is the canonical gather order.
### 3. `src/llama-graph.h` — `llm_graph_input_attn_kv`
Add `ggml_tensor * gather_idxs = nullptr;` + `ggml_tensor * get_gather_idxs() const { return gather_idxs; }`.
### 4. `src/llama-graph.cpp`
- `llm_graph_input_attn_kv::set_input`: if `mctx->is_paged()``mctx->set_input_gather_idxs(gather_idxs, ...)`.
- `build_attn_inp_kv` (creates the input): if `mctx_cur->is_paged()` → `inp->gather_idxs =
mctx_cur->build_input_gather_idxs(ctx0, ...)`.
- `build_attn` (the kv overload, ~2356): after `k`,`v`,`kq_mask`:
```cpp
if (ggml_tensor * gi = inp->get_gather_idxs()) {
k = ggml_get_rows(ctx0, k, gi); // [d, n_gather, ...] (reshape view ok)
v = v_trans ? /* gather columns */ : ggml_get_rows(ctx0, v, gi);
ggml_tensor * m = ggml_cont(ctx0, ggml_transpose(ctx0, kq_mask)); // [n_tokens, n_kv]
m = ggml_get_rows(ctx0, m, gi); // [n_tokens, n_gather]
kq_mask = ggml_cont(ctx0, ggml_transpose(ctx0, m)); // [n_gather, n_tokens]
}
ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask, sinks, v_mla, kq_scale, il);
```
Note: `get_k` returns the reshaped 4d view; gather must run on a cell-major shape. Simplest: add a paged
variant `get_k(ctx,il)` that returns `ggml_get_rows` of the **raw** `layers[ikv].k` then reshapes to
`[n_embd_head, n_head_kv, n_gather, ns]`. Do the gather in the cache, not the graph, for K/V; keep only the
mask gather in the graph. (Cleaner — revisit during impl.)
### 5. V-transposed path
When `!flash_attn`, V is stored transposed `[kv_size, n_embd_v_gqa]`; gather its **rows** (ne1 = n_embd) won't
work — gather columns via the same idx on the non-transposed store, OR force `is_paged()` to require
flash-attn for the first cut (`GGML_ASSERT`) and handle v_trans in a follow-up.
## Verification (the gate)
```sh
cmake --build build-cpu --target llama-simple -j
M=Qwen3-0.6B.Q4_K_M.gguf ; P="<the 0002 prompt>"
build-cpu/bin/llama-simple -m $M -n 64 "$P" > a.txt # stock
LLAMA_KV_PAGED=1 build-cpu/bin/llama-simple -m $M -n 64 "$P" > b.txt # paged gather-read
diff a.txt b.txt # MUST be identical
```
Also assert (debug) that `n_gather < n_kv` on a multi-chunk sequence (proves compaction, not identity).
Export only when identical: `git format-patch HEAD~1 -o patches/ --start-number 3 -N`.
## Risks
- Mask transpose/layout: if `b.txt` diverges, dump the gathered mask vs expected for token 0; off-by-order
means the `set_input_gather_idxs` order ≠ the get_k gather order — they MUST use the identical loop.
- flash-attn vs not: do flash-attn first (simpler mask), then v_trans.

View File

@@ -1,369 +0,0 @@
From c1de00f4cc1eb0dd25993880bb4c8562be1937d4 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Mon, 22 Jun 2026 10:24:22 +0200
Subject: [PATCH] paged gather-read (env LLAMA_KV_PAGED) - patch 0003
Gather K, V and the kq_mask down to each sequence stream's non-empty cells
before build_attn_mha. Position-sorted per stream so the flash-attn online
softmax reduction order matches stock byte-for-byte. Multi-stream: one index
column per stream over k->ne[3], padded to the max non-empty count with a
masked (empty) cell. Gated behind LLAMA_KV_PAGED; no-op when unset.
---
src/CMakeLists.txt | 1 +
src/llama-graph.cpp | 9 ++-
src/llama-kv-cache.cpp | 74 ++++++++++++++++++++++++
src/llama-kv-cache.h | 11 ++++
src/paged-attn.cpp | 128 +++++++++++++++++++++++++++++++++++++++++
src/paged-attn.h | 40 +++++++++++++
6 files changed, 262 insertions(+), 1 deletion(-)
create mode 100644 src/paged-attn.cpp
create mode 100644 src/paged-attn.h
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index a030940..58083b3 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -25,6 +25,7 @@ add_library(llama
llama-kv-cache.cpp
llama-kv-cache-iswa.cpp
paged-kv-manager.cpp
+ paged-attn.cpp
llama-kv-cache-dsa.cpp
llama-memory.cpp
llama-memory-hybrid.cpp
diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
index 68c9e60..b59d2a5 100644
--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
@@ -6,6 +6,8 @@
#include "llama-cparams.h"
#include "llama-kv-cache.h"
+
+#include "paged-attn.h"
#include "llama-kv-cache-iswa.h"
#include "llama-kv-cache-dsa.h"
#include "llama-memory-hybrid.h"
@@ -2356,7 +2358,12 @@ ggml_tensor * llm_graph_context::build_attn(
ggml_tensor * k = mctx_cur->get_k(ctx0, il);
ggml_tensor * v = mctx_cur->get_v(ctx0, il);
- ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask, sinks, v_mla, kq_scale, il);
+ // [paged 0003] gather K, V and the mask to the sequence's used cells only
+ // (no-op unless env LLAMA_KV_PAGED is set).
+ ggml_tensor * kq_mask_g = kq_mask;
+ paged_attn::gather(ctx0, res, mctx_cur, &k, &v, &kq_mask_g);
+
+ ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask_g, sinks, v_mla, kq_scale, il);
cb(cur, "kqv_out", il);
if (inp->self_v_rot) {
diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
index 999e2ae..30d02d7 100644
--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
@@ -1,4 +1,6 @@
#include "llama-kv-cache.h"
+#include <vector>
+#include <utility>
#include "llama-impl.h"
#include "llama-io.h"
@@ -1329,6 +1331,70 @@ ggml_tensor * llama_kv_cache::get_v(ggml_context * ctx, int32_t il, uint32_t n_k
ggml_row_size(v->type, kv_size*n_embd_v_gqa)*sinfo.s0);
}
+// [paged 0003] gather-read: enumerate the non-empty cells in [0, n_kv) for the
+// single stream addressed by sinfo. With paged placement (patch 0002) these are
+// the sequence's scattered block cells; gathering K/V/mask by this index list
+// compacts the attention read while preserving every unmasked (token,cell) pair.
+uint32_t llama_kv_cache::get_n_gather(uint32_t n_kv, const slot_info & sinfo) const {
+ // Multi-stream: the gathered K/V/mask tensors are rectangular [.., n_gather,
+ // n_stream], so n_gather is the MAX non-empty count across the batch streams.
+ // Streams with fewer cells are padded (see get_gather_idxs) with a masked
+ // (empty) cell index, which contributes exp(-inf)=0 and is thus a no-op.
+ // K is laid out over physical streams [s0, s1]; index v_cells the same way.
+ const uint32_t ns = sinfo.s1 - sinfo.s0 + 1;
+ uint32_t mx = 0;
+ for (uint32_t j = 0; j < ns; ++j) {
+ const auto & cells = v_cells[sinfo.s0 + j];
+ const uint32_t n = std::min<uint32_t>(n_kv, cells.size());
+ uint32_t cnt = 0;
+ for (uint32_t i = 0; i < n; ++i) {
+ if (!cells.is_empty(i)) {
+ ++cnt;
+ }
+ }
+ mx = std::max(mx, cnt);
+ }
+ return mx;
+}
+
+void llama_kv_cache::get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_info & sinfo) const {
+ const uint32_t ns = sinfo.s1 - sinfo.s0 + 1;
+ const uint32_t n_gather = get_n_gather(n_kv, sinfo);
+ // dst is [n_gather, n_stream] (ne0 = n_gather): column s at dst[s*n_gather..].
+ for (uint32_t j = 0; j < ns; ++j) {
+ const auto & cells = v_cells[sinfo.s0 + j];
+ const uint32_t n = std::min<uint32_t>(n_kv, cells.size());
+ // Collect the non-empty cells, then order them by token POSITION (not by
+ // physical cell index). The attention reduction (flash-attn online
+ // softmax, and the non-flash soft_max) runs over cells in array order and
+ // is order-sensitive in floating point. Stock (contiguous) placement
+ // happens to store cells in position order, so emitting the gathered
+ // indices in position order reproduces stock's exact reduction order -
+ // making the paged read bit-identical, not merely math-equivalent.
+ std::vector<std::pair<llama_pos, int32_t>> pc;
+ pc.reserve(n);
+ int32_t pad = -1;
+ for (uint32_t i = 0; i < n; ++i) {
+ if (!cells.is_empty(i)) {
+ pc.emplace_back(cells.pos_get(i), (int32_t) i);
+ } else if (pad < 0) {
+ pad = (int32_t) i; // first empty cell: its mask is -inf -> safe pad
+ }
+ }
+ std::sort(pc.begin(), pc.end());
+ int32_t * col = dst + (size_t) j * n_gather;
+ for (size_t k = 0; k < pc.size(); ++k) {
+ col[k] = pc[k].second;
+ }
+ // Pad the tail to n_gather with a masked (empty) cell so the rectangular
+ // gather drops to zero contribution for streams shorter than the max.
+ const int32_t padv = (pad >= 0) ? pad : (pc.empty() ? 0 : pc.back().second);
+ for (uint32_t k = (uint32_t) pc.size(); k < n_gather; ++k) {
+ col[k] = padv;
+ }
+ }
+}
+
ggml_tensor * llama_kv_cache::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const {
GGML_UNUSED(sinfo);
@@ -2620,6 +2686,14 @@ ggml_tensor * llama_kv_cache_context::get_v(ggml_context * ctx, int32_t il) cons
return kv->get_v(ctx, il, n_kv, sinfos[i_cur]);
}
+uint32_t llama_kv_cache_context::get_n_gather() const {
+ return kv->get_n_gather(n_kv, sinfos[i_cur]);
+}
+
+void llama_kv_cache_context::get_gather_idxs(int32_t * dst) const {
+ kv->get_gather_idxs(dst, n_kv, sinfos[i_cur]);
+}
+
ggml_tensor * llama_kv_cache_context::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il) const {
return kv->cpy_k(ctx, k_cur, k_idxs, il, sinfos[i_cur]);
}
diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
index 3d68f98..494c0fb 100644
--- a/src/llama-kv-cache.h
+++ b/src/llama-kv-cache.h
@@ -171,6 +171,12 @@ public:
ggml_tensor * get_k(ggml_context * ctx, int32_t il, uint32_t n_kv, const slot_info & sinfo) const;
ggml_tensor * get_v(ggml_context * ctx, int32_t il, uint32_t n_kv, const slot_info & sinfo) const;
+ // [paged 0003] count / list the non-empty cells in [0, n_kv) per stream of
+ // sinfo (position-sorted, padded across streams). Used by paged-attn
+ // gather-read. get_n_gather returns the max count across streams.
+ uint32_t get_n_gather(uint32_t n_kv, const slot_info & sinfo) const;
+ void get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_info & sinfo) const;
+
// store k_cur and v_cur in the cache based on the provided head location
ggml_tensor * cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const;
ggml_tensor * cpy_v(ggml_context * ctx, ggml_tensor * v_cur, ggml_tensor * v_idxs, int32_t il, const slot_info & sinfo) const;
@@ -368,6 +374,11 @@ public:
ggml_tensor * get_k(ggml_context * ctx, int32_t il) const;
ggml_tensor * get_v(ggml_context * ctx, int32_t il) const;
+ // [paged 0003] gather-read helpers (delegate to the kv cache for the
+ // current ubatch's stream).
+ uint32_t get_n_gather() const;
+ void get_gather_idxs(int32_t * dst) const;
+
// store k_cur and v_cur in the cache based on the provided head location
// note: the heads in k_cur and v_cur should be laid out contiguously in memory
// - k_cur [n_embd_head_k, n_head_k, n_tokens]
diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp
new file mode 100644
index 0000000..ade75e8
--- /dev/null
+++ b/src/paged-attn.cpp
@@ -0,0 +1,128 @@
+#include "paged-attn.h"
+
+#include "llama-graph.h"
+#include "llama-kv-cache.h"
+
+#include "ggml.h"
+#include "ggml-backend.h"
+
+#include <cstdlib>
+#include <cstdio>
+
+namespace paged_attn {
+
+bool active() {
+ static const bool a = (std::getenv("LLAMA_KV_PAGED") != nullptr);
+ return a;
+}
+
+static bool debug() {
+ static const bool d = (std::getenv("LLAMA_KV_PAGED_DEBUG") != nullptr);
+ return d;
+}
+
+namespace {
+
+// Graph input that, at set_input time, fills an I32 [n_gather, n_stream] tensor
+// with each stream's non-empty cell indices (position-sorted, padded with a
+// masked/empty cell) by delegating to the kv-cache context. Private to this
+// unit; default can_reuse()==false keeps the graph from being reused across
+// decodes (n_gather grows every step).
+class input_gather_idxs : public llm_graph_input_i {
+public:
+ input_gather_idxs(const llama_kv_cache_context * mctx, ggml_tensor * idxs)
+ : mctx(mctx), idxs(idxs) {}
+
+ void set_input(const llama_ubatch * ubatch) override {
+ GGML_UNUSED(ubatch);
+ GGML_ASSERT(idxs && ggml_backend_buffer_is_host(idxs->buffer));
+ mctx->get_gather_idxs((int32_t *) idxs->data);
+ }
+
+ const llama_kv_cache_context * mctx;
+ ggml_tensor * idxs;
+};
+
+} // namespace
+
+void gather(ggml_context * ctx0,
+ llm_graph_result * res,
+ const llama_kv_cache_context * mctx,
+ ggml_tensor ** k,
+ ggml_tensor ** v,
+ ggml_tensor ** kq_mask) {
+ if (!active()) {
+ return;
+ }
+
+ ggml_tensor * K = *k;
+ ggml_tensor * V = *v;
+ ggml_tensor * M = *kq_mask;
+
+ // Number of streams (sequences) in the unified batch. K is laid out
+ // [d, h, n_kv, n_stream] and the mask is [n_kv, n_tps, 1, n_stream]; the
+ // gather is per-stream (one index column per stream), so a single
+ // ggml_get_rows over the stream axis handles 1..N streams uniformly.
+ const int64_t n_stream = K->ne[3];
+ GGML_ASSERT(M->ne[3] == n_stream);
+
+ const int64_t n_gather = (int64_t) mctx->get_n_gather();
+ if (n_gather <= 0) {
+ // Worst-case graph reserve (empty cache) or nothing placed yet: leave
+ // the full [0, n_kv) read untouched so buffer sizing stays worst-case.
+ return;
+ }
+
+ if (debug()) {
+ static int64_t once = 0;
+ if (once++ < 2) {
+ fprintf(stderr, "[paged-attn] gather n_stream=%lld n_kv=%lld n_gather=%lld\n",
+ (long long) n_stream, (long long) K->ne[2], (long long) n_gather);
+ }
+ }
+
+ // Per-stream index tensor [n_gather, n_stream], filled at set_input from
+ // each stream's non-empty cells. ggml_get_rows broadcasts along ne[1]==
+ // n_stream, so column s gathers from stream s of the source.
+ ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_gather, n_stream);
+ ggml_set_input(idx);
+ res->add_input(llm_graph_input_ptr(new input_gather_idxs(mctx, idx)));
+
+ // --- gather K: collapse (head_dim, n_head) so cells become the row axis ---
+ {
+ ggml_tensor * t = ggml_cont(ctx0, K); // [d, h, n_kv, ns]
+ t = ggml_reshape_3d(ctx0, t, K->ne[0]*K->ne[1], K->ne[2], n_stream); // [d*h, n_kv, ns]
+ t = ggml_get_rows(ctx0, t, idx); // [d*h, n_gather, ns]
+ *k = ggml_reshape_4d(ctx0, t, K->ne[0], K->ne[1], n_gather, n_stream); // [d, h, n_gather, ns]
+ }
+
+ // --- gather V ---
+ // Normalize to a non-transposed [d, h, n_kv, ns] view first, so the gathered
+ // result is contiguous and build_attn_mha sees a consistent v_trans==false.
+ {
+ const bool v_trans = V->nb[1] > V->nb[2];
+ ggml_tensor * vsrc = v_trans
+ ? ggml_permute(ctx0, V, 2, 1, 0, 3) // [n_kv, h, d, ns] -> [d, h, n_kv, ns]
+ : V; // already [d, h, n_kv, ns]
+ ggml_tensor * t = ggml_cont(ctx0, vsrc); // [d, h, n_kv, ns]
+ t = ggml_reshape_3d(ctx0, t, vsrc->ne[0]*vsrc->ne[1], vsrc->ne[2], n_stream); // [d*h, n_kv, ns]
+ t = ggml_get_rows(ctx0, t, idx); // [d*h, n_gather, ns]
+ *v = ggml_reshape_4d(ctx0, t, vsrc->ne[0], vsrc->ne[1], n_gather, n_stream); // [d, h, n_gather, ns]
+ }
+
+ // --- gather mask (cells are ne0): transpose so cells become the row axis,
+ // gather per stream, transpose back ---
+ {
+ ggml_tensor * m = ggml_reshape_3d(ctx0, M, M->ne[0], M->ne[1], n_stream); // [n_kv, n_tps, ns]
+ m = ggml_cont(ctx0, ggml_transpose(ctx0, m)); // [n_tps, n_kv, ns]
+ m = ggml_get_rows(ctx0, m, idx); // [n_tps, n_gather, ns] (F32)
+ m = ggml_cont(ctx0, ggml_transpose(ctx0, m)); // [n_gather, n_tps, ns]
+ m = ggml_reshape_4d(ctx0, m, n_gather, M->ne[1], 1, n_stream);
+ if (M->type != m->type) {
+ m = ggml_cast(ctx0, m, M->type); // flash-attn requires an F16 mask
+ }
+ *kq_mask = m;
+ }
+}
+
+} // namespace paged_attn
diff --git a/src/paged-attn.h b/src/paged-attn.h
new file mode 100644
index 0000000..c5b7bd7
--- /dev/null
+++ b/src/paged-attn.h
@@ -0,0 +1,40 @@
+#pragma once
+// Paged attention gather-read (patch 0003, experimental).
+//
+// Companion to the paged block placement in llama_kv_cache::find_slot (patch
+// 0002). Patch 0002 places a sequence's tokens at permuted, non-contiguous
+// fixed-size block cells, but attention still reads the whole [0, n_kv) window
+// (empty cells masked to -inf). This unit compacts that read: it gathers K, V
+// and the kq_mask down to ONLY the sequence's used (non-empty) cells before
+// build_attn_mha.
+//
+// Correctness: attention is permutation-invariant over the KV set, and dropping
+// already-masked empty cells removes only exp(-inf)=0 terms - so greedy output
+// is identical to stock. Gated behind env LLAMA_KV_PAGED; a no-op when unset.
+//
+// All logic lives here to keep the core files additive: build_attn gets one
+// call, llama_kv_cache_context gets two thin accessors, CMake gets one line.
+
+#include <cstdint>
+
+struct ggml_context;
+struct ggml_tensor;
+class llm_graph_result;
+class llama_kv_cache_context;
+
+namespace paged_attn {
+
+// true iff env LLAMA_KV_PAGED is set (evaluated once).
+bool active();
+
+// Gather K, V and the kq_mask down to the current sequence's non-empty cells.
+// No-op (returns immediately) unless active(). On return *k, *v and *kq_mask
+// point at the compacted tensors; pass them straight to build_attn_mha.
+void gather(ggml_context * ctx0,
+ llm_graph_result * res,
+ const llama_kv_cache_context * mctx,
+ ggml_tensor ** k,
+ ggml_tensor ** v,
+ ggml_tensor ** kq_mask);
+
+} // namespace paged_attn
--
2.43.0

View File

@@ -1,298 +0,0 @@
From 7c294973de28d1ac991505638d726acfb371d541 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Mon, 22 Jun 2026 10:50:35 +0200
Subject: [PATCH] paged on-demand block allocation (env LLAMA_KV_PAGED) - patch
0004
Drive the paged placement in find_slot through the vendored PagedKVManager
(patch 0001) instead of a fixed full-pool permutation. Blocks are popped from a
free pool on demand as the sequence crosses block boundaries (peak << full
reservation) and returned on sequence end (seq_rm full removal / clear). One
manager per (kv-cache, stream); all state lives in the new src/paged-alloc unit,
so the core kv-cache struct is untouched - find_slot/clear/seq_rm gain only a
gated call. Default off; stock path byte-identical.
---
src/CMakeLists.txt | 1 +
src/llama-kv-cache.cpp | 69 +++++++++++++++++----------
src/paged-alloc.cpp | 106 +++++++++++++++++++++++++++++++++++++++++
src/paged-alloc.h | 39 +++++++++++++++
4 files changed, 190 insertions(+), 25 deletions(-)
create mode 100644 src/paged-alloc.cpp
create mode 100644 src/paged-alloc.h
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index 58083b3..4d9d7d1 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -26,6 +26,7 @@ add_library(llama
llama-kv-cache-iswa.cpp
paged-kv-manager.cpp
paged-attn.cpp
+ paged-alloc.cpp
llama-kv-cache-dsa.cpp
llama-memory.cpp
llama-memory-hybrid.cpp
diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
index 30d02d7..1125d9a 100644
--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
@@ -1,4 +1,5 @@
#include "llama-kv-cache.h"
+#include "paged-alloc.h"
#include <vector>
#include <utility>
@@ -381,6 +382,11 @@ llama_kv_cache::llama_kv_cache(
}
void llama_kv_cache::clear(bool data) {
+ // [paged 0004] return all on-demand blocks to the pool on cache clear.
+ if (paged_alloc::active()) {
+ paged_alloc::release_all(this);
+ }
+
for (uint32_t s = 0; s < n_stream; ++s) {
v_cells[s].reset();
v_heads[s] = 0;
@@ -409,6 +415,16 @@ bool llama_kv_cache::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) {
p1 = std::numeric_limits<llama_pos>::max();
}
+ // [paged 0004] free a stream's on-demand blocks when its whole sequence is
+ // removed (sequence end), so they return to the pool for reuse.
+ if (paged_alloc::active() && p0 == 0 && p1 == std::numeric_limits<llama_pos>::max()) {
+ if (seq_id >= 0) {
+ paged_alloc::release(this, (int) seq_to_stream[seq_id]);
+ } else {
+ paged_alloc::release_all(this);
+ }
+ }
+
if (seq_id >= 0) {
auto & cells = v_cells[seq_to_stream[seq_id]];
auto & head = v_heads[seq_to_stream[seq_id]];
@@ -1030,36 +1046,39 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
// the correctness premise of paged attention. Enabled via LLAMA_KV_PAGED.
// Single-sequence scope (uses get_used() as the logical base); falls back
// to the normal allocator if the permuted cells aren't available.
- static const bool paged_mode = (std::getenv("LLAMA_KV_PAGED") != nullptr);
- if (paged_mode) {
+ // [paged 0004] On-demand block allocation. Patch 0002 proved attention is
+ // invariant to physical KV placement; here that placement is driven by
+ // the vendored PagedKVManager (patch 0001): blocks are popped from a free
+ // pool only as the sequence crosses block boundaries (peak << full
+ // reservation) and returned on sequence end. Enabled via LLAMA_KV_PAGED;
+ // falls back to the normal allocator on pool exhaustion or any conflict.
+ if (paged_alloc::active()) {
const uint32_t bs = 16; // block size (tokens/block)
- const uint32_t nblk = cells.size() / bs; // blocks in this stream's pool
+ const uint32_t nblk = cells.size() / bs; // this stream's block budget
if (nblk >= 2) {
- // stride coprime to nblk => block-index permutation is a bijection
- uint32_t k = 1;
- for (uint32_t cand = (nblk / 2) | 1u; cand < nblk; cand += 2) {
- if (std::gcd(cand, nblk) == 1u) { k = cand; break; }
- }
const uint32_t base = cells.get_used();
- bool ok = true;
- for (uint32_t i = 0; i < n_tokens; ++i) {
- const uint32_t L = base + i;
- const uint32_t b = L / bs;
- const uint32_t off = L % bs;
- if (b >= nblk) { ok = false; break; }
- const uint32_t phys = ((b * k) % nblk) * bs + off; // permuted block
- if (phys >= cells.size() || !cells.is_empty(phys)) { ok = false; break; }
- res.idxs[s].push_back(phys);
- }
- if (ok && res.idxs[s].size() == n_tokens) {
- if (std::getenv("LLAMA_KV_PAGED_DEBUG")) {
- fprintf(stderr, "[paged] seq placed %u tok at cells:", n_tokens);
- for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]);
- fprintf(stderr, " (k=%u nblk=%u base=%u)\n", k, nblk, base);
+ const int strm = (int) seq_to_stream[seq_id];
+ std::vector<uint32_t> placed;
+ if (paged_alloc::place(this, strm, base, n_tokens, bs, nblk, placed)) {
+ bool ok = (placed.size() == n_tokens);
+ for (uint32_t i = 0; ok && i < n_tokens; ++i) {
+ if (placed[i] >= cells.size() || !cells.is_empty(placed[i])) {
+ ok = false;
+ }
+ }
+ if (ok) {
+ for (uint32_t phys : placed) {
+ res.idxs[s].push_back(phys);
+ }
+ if (std::getenv("LLAMA_KV_PAGED_DEBUG")) {
+ fprintf(stderr, "[paged] stream %d placed %u tok at cells:", strm, n_tokens);
+ for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]);
+ fprintf(stderr, " (nblk=%u base=%u)\n", nblk, base);
+ }
+ continue; // on-demand paged placement succeeded
}
- continue; // paged placement succeeded for this sequence
+ res.idxs[s].clear(); // fall back to the normal allocator
}
- res.idxs[s].clear(); // fall back to the normal allocator
}
}
diff --git a/src/paged-alloc.cpp b/src/paged-alloc.cpp
new file mode 100644
index 0000000..1d13f9c
--- /dev/null
+++ b/src/paged-alloc.cpp
@@ -0,0 +1,106 @@
+#include "paged-alloc.h"
+#include "paged-kv-manager.h"
+
+#include <cstdlib>
+#include <cstdio>
+#include <map>
+#include <memory>
+#include <utility>
+
+namespace paged_alloc {
+
+bool active() {
+ static const bool a = (std::getenv("LLAMA_KV_PAGED") != nullptr);
+ return a;
+}
+
+static bool debug() {
+ static const bool d = (std::getenv("LLAMA_KV_PAGED_DEBUG") != nullptr);
+ return d;
+}
+
+namespace {
+
+using key_t = std::pair<const void *, int>;
+
+// One PagedKVManager per (kv-cache, stream): each stream owns a separate
+// physical pool of cells.size() cells, so a manager's block ids map directly to
+// cell ranges within that stream's pool. The internal request id is always 0.
+std::map<key_t, std::unique_ptr<paged::PagedKVManager>> g_managers;
+
+paged::PagedKVManager * get_mgr(const void * cache, int stream,
+ uint32_t pool_blocks, uint32_t block_size) {
+ const key_t k{cache, stream};
+ auto it = g_managers.find(k);
+ if (it == g_managers.end()) {
+ // enable_caching=false: prefix caching is a later patch; 0004 exercises
+ // only on-demand allocate / free.
+ auto mgr = std::make_unique<paged::PagedKVManager>(
+ (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/false);
+ it = g_managers.emplace(k, std::move(mgr)).first;
+ }
+ return it->second.get();
+}
+
+} // namespace
+
+bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
+ uint32_t block_size, uint32_t pool_blocks,
+ std::vector<uint32_t> & out) {
+ if (n_tokens == 0) {
+ return true;
+ }
+
+ paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
+
+ const size_t before = mgr->block_table(0).size();
+
+ // Grow the request to cover the highest logical position. The manager pops
+ // free blocks only for the boundaries actually crossed - that is the on-
+ // demand behavior; an already-covered range adds nothing.
+ if (!mgr->allocate(0, (size_t) base + n_tokens)) {
+ return false; // pool exhausted -> caller falls back to the stock path
+ }
+
+ out.reserve(out.size() + n_tokens);
+ for (uint32_t i = 0; i < n_tokens; ++i) {
+ const int64_t s = mgr->slot(0, (int) (base + i));
+ out.push_back((uint32_t) s);
+ }
+
+ if (debug()) {
+ const size_t after = mgr->block_table(0).size();
+ if (after != before) {
+ fprintf(stderr,
+ "[paged-alloc] cache=%p stream=%d grew %zu->%zu blocks "
+ "(budget=%u; base=%u +%u tok)\n",
+ cache, stream, before, after, pool_blocks, base, n_tokens);
+ }
+ }
+
+ return true;
+}
+
+void release(const void * cache, int stream) {
+ auto it = g_managers.find({cache, stream});
+ if (it == g_managers.end()) {
+ return;
+ }
+ it->second->free(0);
+ g_managers.erase(it);
+ if (debug()) {
+ fprintf(stderr, "[paged-alloc] released cache=%p stream=%d\n", cache, stream);
+ }
+}
+
+void release_all(const void * cache) {
+ for (auto it = g_managers.begin(); it != g_managers.end(); ) {
+ if (it->first.first == cache) {
+ it = g_managers.erase(it);
+ } else {
+ ++it;
+ }
+ }
+}
+
+} // namespace paged_alloc
diff --git a/src/paged-alloc.h b/src/paged-alloc.h
new file mode 100644
index 0000000..bf66665
--- /dev/null
+++ b/src/paged-alloc.h
@@ -0,0 +1,39 @@
+#pragma once
+// On-demand paged KV block allocation (patch 0004, experimental).
+//
+// Backs the paged placement in llama_kv_cache::find_slot (patch 0002) with the
+// vendored host-side PagedKVManager (patch 0001). Instead of mapping a
+// sequence's logical positions onto a fixed full-pool permutation, blocks are
+// popped from a free pool ON DEMAND as the sequence crosses block boundaries,
+// and returned to the pool on sequence end. This is where the paged memory-
+// capacity benefit begins: a short sequence holds only a few blocks, not the
+// whole reserved window.
+//
+// Gated behind env LLAMA_KV_PAGED; a no-op when unset. All state lives in this
+// unit (a static registry keyed by kv-cache + stream), so the core kv-cache
+// struct stays untouched - find_slot only gains a gated call.
+
+#include <cstdint>
+#include <vector>
+
+namespace paged_alloc {
+
+// true iff env LLAMA_KV_PAGED is set (evaluated once).
+bool active();
+
+// Place n_tokens logical positions [base, base+n_tokens) of one stream on
+// demand, appending their physical cell indices to `out`. pool_blocks =
+// cells.size()/block_size is this stream's block budget. Returns false (leaving
+// `out` unchanged) on pool exhaustion, so the caller falls back to the stock
+// allocator. The caller still validates each returned cell is empty.
+bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
+ uint32_t block_size, uint32_t pool_blocks,
+ std::vector<uint32_t> & out);
+
+// Return a stream's blocks to the pool (sequence end).
+void release(const void * cache, int stream);
+
+// Return every stream's blocks for a kv-cache (clear() / teardown).
+void release_all(const void * cache);
+
+} // namespace paged_alloc
--
2.43.0

View File

@@ -1,143 +0,0 @@
From 141029beec609e87f24f6f6bba3ec842d7037862 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Mon, 22 Jun 2026 12:13:44 +0200
Subject: [PATCH] paged cross-request prefix caching (env LLAMA_KV_PAGED) -
patch 0006
Add host-side cross-request prefix sharing to the vendored PagedKVManager
(patches 0001-0004): on placement, hash a new sequence prefix blocks, reuse the
matching cached physical blocks (ref_cnt++) for the shared prefix and allocate
fresh blocks only for the divergent suffix. A shared block is freed only at
ref 0; copy-on-write privatises a still-shared (ref>1) block before a divergent
write so co-owners stay byte-correct. All logic lives in the vendored
src/paged-kv-manager unit (place_with_prefix / cow_block / ref-counting); the
core kv-cache files are untouched. Default off; gated behind LLAMA_KV_PAGED.
Wiring the physical-cell reuse into find_slot so the engine itself skips
recompute needs core seq-membership changes and is left to a later patch.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
src/paged-kv-manager.cpp | 65 ++++++++++++++++++++++++++++++++++++++++
src/paged-kv-manager.h | 23 ++++++++++++++
2 files changed, 88 insertions(+)
diff --git a/src/paged-kv-manager.cpp b/src/paged-kv-manager.cpp
index ca0dcd8..4c6ee4c 100644
--- a/src/paged-kv-manager.cpp
+++ b/src/paged-kv-manager.cpp
@@ -293,4 +293,69 @@ void PagedKVManager::cache_blocks(int seq_id, const std::vector<uint64_t>& block
pool_.cache_full_blocks(req, /*num_cached=*/0, n_full, block_hashes);
}
+// ---------------------------------------------------------------------------
+// Cross-request prefix caching + copy-on-write (patch 0006)
+// ---------------------------------------------------------------------------
+
+size_t PagedKVManager::place_with_prefix(int seq_id, const std::vector<int>& token_ids) {
+ auto& req = req_to_blocks_[seq_id];
+
+ // Longest cached prefix: hash the full blocks and stop at the first miss.
+ // A block hash transitively encodes its whole prefix (FNV chaining), so the
+ // first miss bounds the reusable prefix (vLLM find_longest_cache_hit).
+ const std::vector<uint64_t> hashes = compute_block_hashes(token_ids);
+ std::vector<KVCacheBlock*> hits;
+ for (uint64_t bh : hashes) {
+ KVCacheBlock* cb = pool_.get_cached_block(bh);
+ if (!cb) break;
+ hits.push_back(cb);
+ }
+
+ // Reuse: ++ref_cnt (pulling warm blocks back out of the free list) then
+ // splice the shared physical blocks into this sequence's block table.
+ pool_.touch(hits);
+ req.insert(req.end(), hits.begin(), hits.end());
+
+ // Allocate fresh blocks only for the divergent suffix.
+ const size_t need = cdiv(token_ids.size(), block_size_);
+ if (need > req.size()) {
+ const size_t add = need - req.size();
+ if (add > pool_.get_num_free_blocks()) {
+ // OOM: roll the sequence back (un-touch the shared prefix so no ref
+ // leaks) and report no placement; the caller falls back to stock.
+ std::vector<KVCacheBlock*> ordered(req.rbegin(), req.rend());
+ pool_.free_blocks(ordered);
+ req.clear();
+ return 0;
+ }
+ auto nb = pool_.get_new_blocks(add);
+ req.insert(req.end(), nb.begin(), nb.end());
+ }
+ return hits.size();
+}
+
+std::pair<int32_t, int32_t> PagedKVManager::cow_block(int seq_id, size_t bi) {
+ auto& req = req_to_blocks_.at(seq_id);
+ KVCacheBlock* old = req.at(bi);
+ if (old->ref_cnt <= 1) {
+ return { old->block_id, old->block_id }; // already private - no copy
+ }
+ // Private copy for this sequence. get_new_blocks sets the fresh block's
+ // ref_cnt to 1; free_blocks decrements the shared block, which stays >0 so
+ // it is NOT returned to the pool and the other owners are left untouched.
+ KVCacheBlock* fresh = pool_.get_new_blocks(1).front();
+ pool_.free_blocks({ old });
+ req[bi] = fresh;
+ return { old->block_id, fresh->block_id };
+}
+
+int PagedKVManager::block_ref_cnt_at(int seq_id, size_t bi) const {
+ return req_to_blocks_.at(seq_id).at(bi)->ref_cnt;
+}
+
+size_t PagedKVManager::num_blocks(int seq_id) const {
+ auto it = req_to_blocks_.find(seq_id);
+ return it == req_to_blocks_.end() ? 0 : it->second.size();
+}
+
} // namespace paged
diff --git a/src/paged-kv-manager.h b/src/paged-kv-manager.h
index 740280a..34decbc 100644
--- a/src/paged-kv-manager.h
+++ b/src/paged-kv-manager.h
@@ -14,6 +14,7 @@
#include <vector>
#include <unordered_map>
#include <map>
+#include <utility>
namespace paged {
@@ -99,6 +100,28 @@ public:
size_t get_computed_blocks(const std::vector<uint64_t>& block_hashes); // returns num cached tokens
void cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens);
+ // Cross-request prefix caching + copy-on-write (patch 0006).
+ //
+ // Splice the longest cached prefix of token_ids into seq_id (reuse the
+ // shared physical blocks, ref_cnt++ so a block frees only at ref 0) and
+ // allocate fresh blocks only for the divergent suffix. Returns the number of
+ // shared (reused) blocks; the caller skips recomputing those tokens. On pool
+ // exhaustion the sequence is rolled back (no ref leak) and 0 is returned.
+ size_t place_with_prefix(int seq_id, const std::vector<int>& token_ids);
+
+ // Copy-on-write the block at logical index bi of seq_id. If that block is
+ // shared (ref_cnt>1), allocate a fresh private block, drop this seq's ref on
+ // the shared one (other owners keep it, content untouched) and install the
+ // fresh block at bi. Returns {old_block_id, new_block_id}; new==old when the
+ // block was already private (ref_cnt<=1) and no copy is needed. The caller
+ // copies the physical cell contents old_block_id -> new_block_id.
+ std::pair<int32_t, int32_t> cow_block(int seq_id, size_t bi);
+
+ // Introspection for the prefix-share gate (debug/tests).
+ int block_ref_cnt_at(int seq_id, size_t bi) const;
+ size_t num_blocks(int seq_id) const;
+ size_t num_free_blocks() const { return pool_.get_num_free_blocks(); }
+
protected:
int block_size_;
BlockPool pool_;
--
2.43.0

View File

@@ -1,531 +0,0 @@
From da20c1c0571e84bc76202d915d4bb82892a3392b Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Mon, 22 Jun 2026 12:46:28 +0200
Subject: [PATCH] paged engine prefix recompute-skip (env LLAMA_KV_PAGED) -
patch 0007
Wire the host-side cross-request prefix cache (patch 0006) into the engine so a
new sequence physically SHARES the cached prefix blocks and skips recomputing the
shared prefix - the actual compute win that 0006 (which only proved the host-side
machinery + realised reuse via the stock seq_cp) did not yet deliver from the
paged path itself.
Mechanism (all gated behind LLAMA_KV_PAGED; default off, stock byte-identical):
* paged-alloc reworked from a per-stream, request-0, destroyed-on-free manager
into ONE persistent caching PagedKVManager per (kv-cache, stream) whose
requests are keyed by the real llama_seq_id. free(seq) now releases exactly
one sequence, so ref-counted shared blocks survive while another sharer holds
them. New seams: share_prefix (place_with_prefix -> shared prefix tokens),
slot, commit (publish a sequence into the content cache), ref-counted release,
plus ref/num-free introspection.
* Two gated llama_kv_cache methods (the core seq-membership handling 0007 needs):
paged_prefix_share() reuses the longest cached content prefix for a sequence
and marks the shared physical cells as belonging to it (cells.seq_add) so the
engine's attention mask includes the already-computed prefix KV; the caller
then decodes ONLY the divergent suffix. paged_prefix_commit() publishes a
sequence's full blocks for later reuse.
* find_slot's paged branch anchors placement on each sequence's own logical base
(ubatch.pos) and keys the manager request by seq_id, so an independently-freed
sequence and a shared prefix coexist in one unified pool. seq_rm/clear free
per-sequence (ref-counted) instead of nuking the whole stream.
* paged-prefix-api: a thin gated shim so a caller holding only the public
llama.h can reach the seam and the introspection without the internal headers.
Core existing-file touch: src/llama-kv-cache.{cpp,h}, +71 -3. Everything else is
additive vendored units. Verified on Qwen3-0.6B-Q8_0 (CPU, unified cache): a
sequence B sharing A's prefix decodes greedy tokens byte-identical to B from
scratch with the prefill computing ONLY the suffix (32 prefix tokens skipped) at
a block boundary AND mid-block; the shared block carries ref_cnt 2 while both
hold it, drops to 1 when one sharer is removed (survivor intact, re-shareable, no
use-after-free) and returns to the pool only when all sharers are freed. The
0004 serving gate (unified and non-unified) stays byte-identical stock vs paged.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
src/CMakeLists.txt | 1 +
src/llama-kv-cache.cpp | 66 +++++++++++++++++++++++--
src/llama-kv-cache.h | 8 +++
src/paged-alloc.cpp | 104 ++++++++++++++++++++++++++++++---------
src/paged-alloc.h | 69 +++++++++++++++++++-------
src/paged-prefix-api.cpp | 48 ++++++++++++++++++
src/paged-prefix-api.h | 27 ++++++++++
7 files changed, 280 insertions(+), 43 deletions(-)
create mode 100644 src/paged-prefix-api.cpp
create mode 100644 src/paged-prefix-api.h
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index 4d9d7d1..432f42d 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -27,6 +27,7 @@ add_library(llama
paged-kv-manager.cpp
paged-attn.cpp
paged-alloc.cpp
+ paged-prefix-api.cpp
llama-kv-cache-dsa.cpp
llama-memory.cpp
llama-memory-hybrid.cpp
diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
index 1125d9a..7510ff9 100644
--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
@@ -419,7 +419,7 @@ bool llama_kv_cache::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) {
// removed (sequence end), so they return to the pool for reuse.
if (paged_alloc::active() && p0 == 0 && p1 == std::numeric_limits<llama_pos>::max()) {
if (seq_id >= 0) {
- paged_alloc::release(this, (int) seq_to_stream[seq_id]);
+ paged_alloc::release(this, (int) seq_to_stream[seq_id], (int) seq_id);
} else {
paged_alloc::release_all(this);
}
@@ -1056,10 +1056,15 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
const uint32_t bs = 16; // block size (tokens/block)
const uint32_t nblk = cells.size() / bs; // this stream's block budget
if (nblk >= 2) {
- const uint32_t base = cells.get_used();
+ // [paged 0007] Anchor placement on this sequence's own logical
+ // base position (ubatch.pos), not the shared used-count, and key
+ // the manager request by the real seq_id. slot(seq,pos) is then
+ // stable per sequence, so an independently-freed (ref-counted)
+ // sequence and a shared prefix can coexist in one unified pool.
+ const uint32_t base = (uint32_t) ubatch.pos[s*n_tokens];
const int strm = (int) seq_to_stream[seq_id];
std::vector<uint32_t> placed;
- if (paged_alloc::place(this, strm, base, n_tokens, bs, nblk, placed)) {
+ if (paged_alloc::place(this, strm, (int) seq_id, base, n_tokens, bs, nblk, placed)) {
bool ok = (placed.size() == n_tokens);
for (uint32_t i = 0; ok && i < n_tokens; ++i) {
if (placed[i] >= cells.size() || !cells.is_empty(placed[i])) {
@@ -1165,6 +1170,61 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
return res;
}
+// [paged 0007] Cross-request prefix recompute-skip.
+//
+// Reuse a cached content prefix for seq_id: share_prefix() splices the longest
+// matching cached physical blocks into seq_id (ref_cnt++) and reserves fresh
+// blocks for the divergent suffix. We then mark the shared physical cells as
+// belonging to seq_id - those cells already hold the owner's computed KV at the
+// matching logical positions, so the caller decodes ONLY the suffix and the
+// prefix is never recomputed. Returns the number of shared prefix tokens.
+// Gated behind LLAMA_KV_PAGED; a no-op (returns 0) otherwise.
+int32_t llama_kv_cache::paged_prefix_share(llama_seq_id seq_id, const std::vector<llama_token> & tokens) {
+ if (!paged_alloc::active() || tokens.empty()) {
+ return 0;
+ }
+ const uint32_t bs = 16;
+ const uint32_t strm = (uint32_t) seq_to_stream[seq_id];
+ auto & cells = v_cells[strm];
+ const uint32_t nblk = cells.size() / bs;
+ if (nblk < 2) {
+ return 0;
+ }
+
+ std::vector<int> toks(tokens.begin(), tokens.end());
+ const size_t kshare = paged_alloc::share_prefix(this, (int) strm, (int) seq_id, toks, bs, nblk);
+
+ for (size_t p = 0; p < kshare; ++p) {
+ const int64_t cell = paged_alloc::slot(this, (int) strm, (int) seq_id, (int) p);
+ if (cell < 0 || (uint32_t) cell >= cells.size() ||
+ cells.is_empty((uint32_t) cell) ||
+ cells.pos_get((uint32_t) cell) != (llama_pos) p) {
+ // Owner cell missing / repurposed: cannot safely share. Roll the
+ // sequence back so the caller recomputes the whole prompt.
+ paged_alloc::release(this, (int) strm, (int) seq_id);
+ return 0;
+ }
+ if (!cells.seq_has((uint32_t) cell, seq_id)) {
+ cells.seq_add((uint32_t) cell, seq_id);
+ }
+ }
+ return (int32_t) kshare;
+}
+
+// [paged 0007] Publish a sequence's full blocks into the content cache so a
+// later paged_prefix_share() can reuse them. Call after the sequence KV is
+// computed (its prefill decode has run).
+void llama_kv_cache::paged_prefix_commit(llama_seq_id seq_id, const std::vector<llama_token> & tokens) {
+ if (!paged_alloc::active() || tokens.empty()) {
+ return;
+ }
+ const uint32_t bs = 16;
+ const uint32_t strm = (uint32_t) seq_to_stream[seq_id];
+ const uint32_t nblk = v_cells[strm].size() / bs;
+ std::vector<int> toks(tokens.begin(), tokens.end());
+ paged_alloc::commit(this, (int) strm, (int) seq_id, toks, bs, nblk);
+}
+
void llama_kv_cache::apply_ubatch(const slot_info & sinfo, const llama_ubatch & ubatch) {
// TODO: refactor [TAG_KV_CACHE_SHARE_CELLS]
if (other) {
diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
index 494c0fb..f374ac6 100644
--- a/src/llama-kv-cache.h
+++ b/src/llama-kv-cache.h
@@ -199,6 +199,14 @@ public:
// emplace the ubatch context into slot: [sinfo.idxs[0...ubatch.n_tokens - 1]]
void apply_ubatch(const slot_info & sinfo, const llama_ubatch & ubatch);
+ // [paged 0007] Cross-request prefix recompute-skip (experimental, gated by
+ // env LLAMA_KV_PAGED). paged_prefix_share() reuses a cached content prefix
+ // for seq_id and returns the number of shared prefix tokens (the caller
+ // decodes only the suffix); paged_prefix_commit() publishes a sequence into
+ // the content cache for later reuse. No-ops when LLAMA_KV_PAGED is unset.
+ int32_t paged_prefix_share (llama_seq_id seq_id, const std::vector<llama_token> & tokens);
+ void paged_prefix_commit(llama_seq_id seq_id, const std::vector<llama_token> & tokens);
+
//
// input API
//
diff --git a/src/paged-alloc.cpp b/src/paged-alloc.cpp
index 1d13f9c..c1027fb 100644
--- a/src/paged-alloc.cpp
+++ b/src/paged-alloc.cpp
@@ -23,9 +23,13 @@ namespace {
using key_t = std::pair<const void *, int>;
-// One PagedKVManager per (kv-cache, stream): each stream owns a separate
-// physical pool of cells.size() cells, so a manager's block ids map directly to
-// cell ranges within that stream's pool. The internal request id is always 0.
+// One persistent PagedKVManager per (kv-cache, stream): each stream owns a
+// separate physical pool of cells.size() cells, so a manager's block ids map
+// directly to cell ranges within that stream's pool. Requests inside a manager
+// are keyed by the real llama_seq_id (NOT a fixed 0), so free(seq) releases one
+// sequence and shared blocks survive at ref>0 - this is what makes ref-counted
+// cross-request prefix sharing (0007) possible. Caching is enabled so commit()
+// can publish blocks and share_prefix() can hit them.
std::map<key_t, std::unique_ptr<paged::PagedKVManager>> g_managers;
paged::PagedKVManager * get_mgr(const void * cache, int stream,
@@ -33,18 +37,21 @@ paged::PagedKVManager * get_mgr(const void * cache, int stream,
const key_t k{cache, stream};
auto it = g_managers.find(k);
if (it == g_managers.end()) {
- // enable_caching=false: prefix caching is a later patch; 0004 exercises
- // only on-demand allocate / free.
auto mgr = std::make_unique<paged::PagedKVManager>(
- (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/false);
+ (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/true);
it = g_managers.emplace(k, std::move(mgr)).first;
}
return it->second.get();
}
+paged::PagedKVManager * find_mgr(const void * cache, int stream) {
+ auto it = g_managers.find({cache, stream});
+ return it == g_managers.end() ? nullptr : it->second.get();
+}
+
} // namespace
-bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
+bool place(const void * cache, int stream, int seq, uint32_t base, uint32_t n_tokens,
uint32_t block_size, uint32_t pool_blocks,
std::vector<uint32_t> & out) {
if (n_tokens == 0) {
@@ -53,43 +60,79 @@ bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
- const size_t before = mgr->block_table(0).size();
+ const size_t before = mgr->block_table(seq).size();
- // Grow the request to cover the highest logical position. The manager pops
- // free blocks only for the boundaries actually crossed - that is the on-
- // demand behavior; an already-covered range adds nothing.
- if (!mgr->allocate(0, (size_t) base + n_tokens)) {
+ // Grow this sequence's request to cover its highest logical position. The
+ // manager pops free blocks only for boundaries actually crossed; if
+ // share_prefix() already reserved these blocks, this is a no-op.
+ if (!mgr->allocate(seq, (size_t) base + n_tokens)) {
return false; // pool exhausted -> caller falls back to the stock path
}
out.reserve(out.size() + n_tokens);
for (uint32_t i = 0; i < n_tokens; ++i) {
- const int64_t s = mgr->slot(0, (int) (base + i));
+ const int64_t s = mgr->slot(seq, (int) (base + i));
out.push_back((uint32_t) s);
}
if (debug()) {
- const size_t after = mgr->block_table(0).size();
+ const size_t after = mgr->block_table(seq).size();
if (after != before) {
fprintf(stderr,
- "[paged-alloc] cache=%p stream=%d grew %zu->%zu blocks "
+ "[paged-alloc] cache=%p stream=%d seq=%d grew %zu->%zu blocks "
"(budget=%u; base=%u +%u tok)\n",
- cache, stream, before, after, pool_blocks, base, n_tokens);
+ cache, stream, seq, before, after, pool_blocks, base, n_tokens);
}
}
return true;
}
-void release(const void * cache, int stream) {
- auto it = g_managers.find({cache, stream});
- if (it == g_managers.end()) {
+size_t share_prefix(const void * cache, int stream, int seq,
+ const std::vector<int> & tokens,
+ uint32_t block_size, uint32_t pool_blocks) {
+ paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
+ const size_t shared_blocks = mgr->place_with_prefix(seq, tokens);
+ const size_t shared_tokens = shared_blocks * (size_t) block_size;
+ if (debug() && shared_blocks > 0) {
+ fprintf(stderr,
+ "[paged-alloc] cache=%p stream=%d seq=%d shares %zu prefix blocks "
+ "(%zu tokens) - prefix NOT recomputed\n",
+ cache, stream, seq, shared_blocks, shared_tokens);
+ }
+ return shared_tokens;
+}
+
+int64_t slot(const void * cache, int stream, int seq, int pos) {
+ paged::PagedKVManager * mgr = find_mgr(cache, stream);
+ if (!mgr) {
+ return -1;
+ }
+ if ((size_t) (pos / mgr->block_size()) >= mgr->num_blocks(seq)) {
+ return -1;
+ }
+ return mgr->slot(seq, pos);
+}
+
+void commit(const void * cache, int stream, int seq,
+ const std::vector<int> & tokens, uint32_t block_size, uint32_t pool_blocks) {
+ paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
+ mgr->cache_blocks(seq, mgr->compute_block_hashes(tokens), tokens.size());
+ if (debug()) {
+ fprintf(stderr, "[paged-alloc] cache=%p stream=%d seq=%d committed %zu tokens\n",
+ cache, stream, seq, tokens.size());
+ }
+}
+
+void release(const void * cache, int stream, int seq) {
+ paged::PagedKVManager * mgr = find_mgr(cache, stream);
+ if (!mgr) {
return;
}
- it->second->free(0);
- g_managers.erase(it);
+ mgr->free(seq); // ref-counted: shared blocks survive while another seq holds them
if (debug()) {
- fprintf(stderr, "[paged-alloc] released cache=%p stream=%d\n", cache, stream);
+ fprintf(stderr, "[paged-alloc] released cache=%p stream=%d seq=%d (free=%zu)\n",
+ cache, stream, seq, mgr->num_free_blocks());
}
}
@@ -103,4 +146,21 @@ void release_all(const void * cache) {
}
}
+int ref_cnt_at(const void * cache, int stream, int seq, int pos, uint32_t block_size) {
+ paged::PagedKVManager * mgr = find_mgr(cache, stream);
+ if (!mgr) {
+ return -1;
+ }
+ const size_t bi = (size_t) pos / block_size;
+ if (bi >= mgr->num_blocks(seq)) {
+ return -1;
+ }
+ return mgr->block_ref_cnt_at(seq, bi);
+}
+
+size_t num_free(const void * cache, int stream) {
+ paged::PagedKVManager * mgr = find_mgr(cache, stream);
+ return mgr ? mgr->num_free_blocks() : 0;
+}
+
} // namespace paged_alloc
diff --git a/src/paged-alloc.h b/src/paged-alloc.h
index bf66665..88dedef 100644
--- a/src/paged-alloc.h
+++ b/src/paged-alloc.h
@@ -1,17 +1,27 @@
#pragma once
-// On-demand paged KV block allocation (patch 0004, experimental).
+// On-demand paged KV block allocation + cross-request prefix reuse
+// (patches 0004 + 0007, experimental).
//
-// Backs the paged placement in llama_kv_cache::find_slot (patch 0002) with the
-// vendored host-side PagedKVManager (patch 0001). Instead of mapping a
-// sequence's logical positions onto a fixed full-pool permutation, blocks are
-// popped from a free pool ON DEMAND as the sequence crosses block boundaries,
-// and returned to the pool on sequence end. This is where the paged memory-
-// capacity benefit begins: a short sequence holds only a few blocks, not the
-// whole reserved window.
+// Backs the paged placement in llama_kv_cache::find_slot with the vendored
+// host-side PagedKVManager (patch 0001). Two responsibilities:
//
-// Gated behind env LLAMA_KV_PAGED; a no-op when unset. All state lives in this
-// unit (a static registry keyed by kv-cache + stream), so the core kv-cache
-// struct stays untouched - find_slot only gains a gated call.
+// * On-demand allocation (0004): a sequence's logical positions are mapped to
+// physical cells block-by-block, popped from a free pool only as the
+// sequence grows and returned on sequence end.
+//
+// * Cross-request prefix reuse (0007): before a new sequence's suffix is
+// decoded, share_prefix() reuses the cached physical blocks of a matching
+// content prefix (ref_cnt++), so the engine shares the already-computed KV
+// cells and the caller decodes ONLY the divergent suffix - the prefix is not
+// recomputed. commit() publishes a sequence's full blocks into the content
+// cache so later sequences can hit them. Freeing is ref-counted: a shared
+// block returns to the pool only when every sharer has been released.
+//
+// One persistent PagedKVManager per (kv-cache, stream); requests inside it are
+// keyed by the real llama_seq_id, so free(seq) releases exactly one sequence and
+// shared blocks survive at ref>0. All state lives in this unit (a static
+// registry), so the core kv-cache struct stays untouched - find_slot gains only
+// gated calls. Gated behind env LLAMA_KV_PAGED; a no-op when unset.
#include <cstdint>
#include <vector>
@@ -21,19 +31,42 @@ namespace paged_alloc {
// true iff env LLAMA_KV_PAGED is set (evaluated once).
bool active();
-// Place n_tokens logical positions [base, base+n_tokens) of one stream on
-// demand, appending their physical cell indices to `out`. pool_blocks =
-// cells.size()/block_size is this stream's block budget. Returns false (leaving
+// Place n_tokens logical positions [base, base+n_tokens) of (cache,stream,seq)
+// on demand, appending their physical cell indices to `out`. pool_blocks =
+// cells.size()/block_size is the stream's block budget. Returns false (leaving
// `out` unchanged) on pool exhaustion, so the caller falls back to the stock
// allocator. The caller still validates each returned cell is empty.
-bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
+bool place(const void * cache, int stream, int seq, uint32_t base, uint32_t n_tokens,
uint32_t block_size, uint32_t pool_blocks,
std::vector<uint32_t> & out);
-// Return a stream's blocks to the pool (sequence end).
-void release(const void * cache, int stream);
+// [0007] Reuse the longest cached content prefix of `tokens` for (cache,stream,
+// seq): splice the shared physical blocks into seq (ref_cnt++) and reserve fresh
+// blocks for the divergent suffix. Returns the number of shared PREFIX TOKENS
+// (block-aligned); the caller marks those cells for seq and decodes only the
+// suffix. 0 if nothing matched or on pool exhaustion (sequence rolled back).
+size_t share_prefix(const void * cache, int stream, int seq,
+ const std::vector<int> & tokens,
+ uint32_t block_size, uint32_t pool_blocks);
+
+// [0007] Physical cell backing logical position `pos` of (cache,stream,seq), or
+// -1 if seq is unknown. Used to map a shared prefix position to its cell.
+int64_t slot(const void * cache, int stream, int seq, int pos);
-// Return every stream's blocks for a kv-cache (clear() / teardown).
+// [0007] Publish seq's full (block-aligned) blocks into the content cache so a
+// later share_prefix() can reuse them. Call after the sequence's KV is computed.
+void commit(const void * cache, int stream, int seq,
+ const std::vector<int> & tokens, uint32_t block_size, uint32_t pool_blocks);
+
+// Return one sequence's blocks to the pool (ref-counted; sequence end).
+void release(const void * cache, int stream, int seq);
+
+// Drop every manager for a kv-cache (clear() / teardown).
void release_all(const void * cache);
+// Introspection for the prefix-share gate (debug/tests). ref_cnt_at returns the
+// ref count of the block backing logical position `pos`, or -1 if unknown.
+int ref_cnt_at(const void * cache, int stream, int seq, int pos, uint32_t block_size);
+size_t num_free(const void * cache, int stream);
+
} // namespace paged_alloc
diff --git a/src/paged-prefix-api.cpp b/src/paged-prefix-api.cpp
new file mode 100644
index 0000000..8573cd2
--- /dev/null
+++ b/src/paged-prefix-api.cpp
@@ -0,0 +1,48 @@
+#include "paged-prefix-api.h"
+#include "paged-alloc.h"
+#include "llama-kv-cache.h"
+
+#include <vector>
+
+namespace paged_prefix_api {
+
+static llama_kv_cache * kv_of(llama_context * ctx) {
+ // The driver targets a plain unified KV-cache model; dynamic_cast yields null
+ // for wrapped caches (iSWA / hybrid), where cross-request cell sharing does
+ // not apply, so the shim degrades to a safe no-op.
+ return dynamic_cast<llama_kv_cache *>(llama_get_memory(ctx));
+}
+
+int32_t share(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n) {
+ llama_kv_cache * kv = kv_of(ctx);
+ if (!kv || n <= 0) {
+ return 0;
+ }
+ return kv->paged_prefix_share(seq, std::vector<llama_token>(tokens, tokens + n));
+}
+
+void commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n) {
+ llama_kv_cache * kv = kv_of(ctx);
+ if (!kv || n <= 0) {
+ return;
+ }
+ kv->paged_prefix_commit(seq, std::vector<llama_token>(tokens, tokens + n));
+}
+
+int ref_at(llama_context * ctx, llama_seq_id seq, int pos) {
+ llama_kv_cache * kv = kv_of(ctx);
+ if (!kv) {
+ return -1;
+ }
+ return paged_alloc::ref_cnt_at((const void *) kv, /*stream=*/0, (int) seq, pos, /*block_size=*/16);
+}
+
+long num_free(llama_context * ctx) {
+ llama_kv_cache * kv = kv_of(ctx);
+ if (!kv) {
+ return 0;
+ }
+ return (long) paged_alloc::num_free((const void *) kv, /*stream=*/0);
+}
+
+} // namespace paged_prefix_api
diff --git a/src/paged-prefix-api.h b/src/paged-prefix-api.h
new file mode 100644
index 0000000..78a3864
--- /dev/null
+++ b/src/paged-prefix-api.h
@@ -0,0 +1,27 @@
+#pragma once
+// Thin test/diagnostic shim over the paged cross-request prefix engine seam
+// (patch 0007). Lets a driver that only includes the public llama.h reach the
+// gated llama_kv_cache::paged_prefix_* methods and the paged-alloc introspection
+// without pulling in the internal kv-cache headers. All entry points are no-ops
+// (return 0) unless env LLAMA_KV_PAGED is set. Experimental; not a stable API.
+
+#include "llama.h"
+
+namespace paged_prefix_api {
+
+// Reuse the longest cached content prefix of [tokens, tokens+n) for `seq` and
+// return the number of shared prefix tokens (the caller decodes only the
+// suffix). 0 if nothing was shared.
+int32_t share(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
+
+// Publish `seq`'s full blocks into the content cache (call after its KV is computed).
+void commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
+
+// Ref count of the paged block backing logical position `pos` of `seq` (unified
+// stream 0), or -1 if unknown.
+int ref_at(llama_context * ctx, llama_seq_id seq, int pos);
+
+// Number of free blocks in the unified stream-0 pool, or 0 if no manager.
+long num_free(llama_context * ctx);
+
+} // namespace paged_prefix_api
--
2.43.0

View File

@@ -1,130 +0,0 @@
From 088d58f3a0160cbc706226ac2e77ecfeae4c164a Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Mon, 22 Jun 2026 17:02:22 +0200
Subject: [PATCH] paged server cross-request prefix share (env LLAMA_KV_PAGED)
- patch 0008
Wire the paged cross-request prefix recompute-skip (patch 0007's engine seam,
paged_prefix_api::share/commit) into the llama-server continuous-batching loop
(update_slots) so CONCURRENT requests that share a long prefix physically reuse
one committed copy of the prefix blocks and prefill only their divergent suffix.
Patch 0007 proved the engine seam correct via a standalone driver, but the server
never called it: two concurrent shared-prefix requests each recomputed the full
prefix. The server's native prompt cache only reuses a slot's OWN prior prompt
(longest-common-prefix vs slot.prompt.tokens) - it does not share across distinct
concurrent slots. 0008 adds that cross-slot share.
Mechanism (all gated behind LLAMA_KV_PAGED; default off, stock byte-identical):
* In update_slots prompt-processing, after the native n_past is computed and
only for a FRESH slot (n_past < one block, i.e. the native cache did not
already cover the prefix), call paged_prefix_api::share() to splice the
longest committed cross-request prefix into this sequence (ref_cnt++ on the
shared physical blocks) and advance n_past past it, so the batch fill computes
ONLY the suffix. The slot's own divergent tail cells are removed first so the
shared cells own [n_past, kshare) without colliding (the native path removes
these later anyway). The n_past < block gate guarantees any block-aligned
share the engine returns is strictly larger than n_past and therefore always
adopted, so the engine's reservation always matches the suffix-only batch and
never leaves stale blocks (which otherwise fragment the paged pool).
* When a slot finishes prefill (SLOT_STATE_DONE_PROMPT -> GENERATING, the prefix
KV just computed), call paged_prefix_api::commit() to publish its prefix so
concurrent/later sharers can reuse it.
The share() / commit() entry points are forward-declared (defined in libllama,
src/paged-prefix-api.cpp) to avoid pulling internal kv-cache headers into the
server translation unit.
Verified in the server (32B NVFP4, CUDA, --kv-unified): with a live sequence
holding the prefix, K=16/32 concurrent shared-prefix requests prefill only their
~27-token suffix instead of the ~1003-token prefix (36x fewer prefill tokens;
K=16 23.9s -> 1.5s, K=32 57.9s -> 2.3s), the engine logs "shares ... prefix
blocks - NOT recomputed" with ref_cnt>1, and greedy output stays within the
documented CUDA batch-shape non-determinism band (stock native prompt-caching
shows the same magnitude). Cross-request sharing requires the unified KV cache.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
tools/server/server-context.cpp | 50 +++++++++++++++++++++++++++++++++
1 file changed, 50 insertions(+)
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
index da6a475..04c6361 100644
--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -15,6 +15,16 @@
#include "mtmd.h"
#include "mtmd-helper.h"
+// [paged 0008] Cross-request prefix recompute-skip shim. share()/commit() are
+// defined in libllama (src/paged-prefix-api.cpp, patch 0007) and are no-ops
+// unless env LLAMA_KV_PAGED is set. Declared here so the paged cross-slot prefix
+// cache wires into update_slots() without pulling in internal kv-cache headers.
+// Fully gated; stock (paged off) is byte-identical.
+namespace paged_prefix_api {
+ int32_t share (llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
+ void commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
+}
+
#include <algorithm>
#include <cstddef>
#include <cinttypes>
@@ -3007,6 +3017,37 @@ private:
}
}
+ // [paged 0008] Cross-request prefix recompute-skip. The native prompt cache
+ // above only reuses THIS slot's own prior prompt; when the paged KV
+ // engine is active, also reuse a committed CROSS-slot prefix so
+ // concurrent requests sharing a long prefix skip recompute. Gated on
+ // LLAMA_KV_PAGED (paged_kv_share static); stock stays byte-identical.
+ static const bool paged_kv_share = getenv("LLAMA_KV_PAGED") != nullptr;
+ // Only attempt the cross-request share on a FRESH slot (the native
+ // cache above did not already cover the prefix). With n_past < a
+ // block, any block-aligned share the engine returns is strictly
+ // larger than n_past and is therefore always adopted below - so the
+ // engine's full-prompt reservation always matches the suffix-only
+ // submission and never leaves stale blocks (which fragmented the
+ // paged pool and crashed the server under high fan-out otherwise).
+ if (paged_kv_share && n_past < 16 && slot.task->params.cache_prompt && !input_tokens.has_mtmd) {
+ const llama_tokens ptoks = input_tokens.get_text_tokens();
+ // Drop this slot's own cells beyond the natively-cached prefix before
+ // splicing the shared physical prefix in, so the shared cells can own
+ // [n_past, kshare) without colliding (the native path removes exactly
+ // these later; a no-op for a fresh slot).
+ common_context_seq_rm(ctx_tgt, slot.id, n_past, -1);
+ const int32_t kshare = paged_prefix_api::share(ctx_tgt, slot.id, ptoks.data(), (int) ptoks.size());
+ if (kshare > n_past) {
+ slot.prompt.tokens.keep_first(n_past);
+ for (int i = n_past; i < kshare; ++i) {
+ slot.prompt.tokens.push_back(ptoks[i]);
+ }
+ n_past = kshare;
+ SLT_INF(slot, "paged: reusing %d cross-request shared prefix tokens - not recomputed\n", n_past);
+ }
+ }
+
// [TAG_PROMPT_LOGITS]
if (n_past == slot.task->n_tokens() && n_past > 0) {
SLT_WRN(slot, "need to evaluate at least 1 token for each active slot (n_past = %d, task.n_tokens() = %d)\n", n_past, slot.task->n_tokens());
@@ -3427,6 +3468,15 @@ private:
// prompt evaluated for next-token prediction
slot.state = SLOT_STATE_GENERATING;
+ // [paged 0008] Publish this slot's computed prefix so concurrent/later
+ // slots can share it (no-op unless LLAMA_KV_PAGED). The prefill decode
+ // for [0, n_tokens) has just run, so the prefix KV is computed.
+ static const bool paged_kv_commit = getenv("LLAMA_KV_PAGED") != nullptr;
+ if (paged_kv_commit && slot.task->params.cache_prompt && !slot.prompt.tokens.has_mtmd) {
+ const llama_tokens ctoks = slot.prompt.tokens.get_text_tokens();
+ paged_prefix_api::commit(ctx_tgt, slot.id, ctoks.data(), (int) ctoks.size());
+ }
+
if (slot.can_speculate()) {
common_speculative_begin(spec.get(), slot.id, slot.prompt.tokens.get_text_tokens());
}
--
2.43.0

View File

@@ -1,609 +0,0 @@
From 59490d82e4d0d4ad05ffb5ca3cccc668f4a75281 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Mon, 22 Jun 2026 20:03:17 +0200
Subject: [PATCH] paged in-kernel decode read (env LLAMA_KV_PAGED) - patch 0009
Replace the per-layer per-step gather (patch 0003: ggml_get_rows of K/V into a
contiguous buffer) with an in-kernel paged read on the decode step. build_attn
passes the UNMODIFIED physical K/V views plus a block table (src[5] of
ggml_flash_attn_ext: an I32 [n_view, n_stream] position-ordered physical-cell
index, padded to FATTN_KQ_STRIDE). The CUDA fattn vec kernel and the CPU
reference map logical KV index j -> physical cell block_table[seq*ne11+j] and
read K_base+cell*nb11 / V_base+cell*nb21 in place, so the get_rows of K and V
(the bulk of the gather) is gone. The mask stays a small compacted [n_view]
causal mask in the same position order; KV_max / parallel_blocks / stream_k
split-K are unchanged. The decode shape is forced onto the vec kernel (the only
one wired for the block table); a nullptr block table => the stock contiguous
read, byte-identical.
Token-POSITION ordering keeps the flash-attn reduction order identical to stock,
so CPU-paged logits == CPU-stock bit-for-bit (verified: 4-stream FA greedy, 64
tokens). On GPU paged(vec) == stock(vec) at batch 1; at batch>1 it stays within
the documented vec-vs-mma non-determinism band. Decode step at batch 32 / 1024
ctx on GB10 (Qwen3-32B NVFP4): paged-gather 1279 ms -> in-kernel 696 ms (-46%),
recovering the gather regression to stock parity (647 ms). Gated behind
LLAMA_KV_PAGED; no-op (stock byte-identical) when unset.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/include/ggml.h | 6 ++
ggml/src/ggml-cpu/ops.cpp | 10 ++-
ggml/src/ggml-cuda/fattn-common.cuh | 8 +-
ggml/src/ggml-cuda/fattn-mma-f16.cuh | 4 +-
ggml/src/ggml-cuda/fattn-tile.cuh | 4 +-
ggml/src/ggml-cuda/fattn-vec.cuh | 25 +++++--
ggml/src/ggml-cuda/fattn-wmma-f16.cu | 4 +-
ggml/src/ggml-cuda/fattn.cu | 9 +++
ggml/src/ggml.c | 14 ++++
src/llama-graph.cpp | 23 ++++--
src/llama-graph.h | 3 +-
src/llama-kv-cache.cpp | 31 ++++++++
src/llama-kv-cache.h | 4 +
src/paged-attn.cpp | 107 +++++++++++++++++++++++++++
src/paged-attn.h | 18 +++++
15 files changed, 248 insertions(+), 22 deletions(-)
diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
index d6807b6..823f5a9 100644
--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
@@ -2427,6 +2427,12 @@ extern "C" {
struct ggml_tensor * a,
struct ggml_tensor * sinks);
+ // [paged] optional block table in src[5]: I32 [n_kv_logical, n_stream]; maps each
+ // logical KV index to the physical cell within K/V. nullptr => stock contiguous read.
+ GGML_API void ggml_flash_attn_ext_set_block_table(
+ struct ggml_tensor * a,
+ struct ggml_tensor * block_table);
+
// TODO: needs to be adapted to ggml_flash_attn_ext
GGML_API struct ggml_tensor * ggml_flash_attn_back(
struct ggml_context * ctx,
diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
index 74611dc..63c07a2 100644
--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
@@ -8330,6 +8330,8 @@ static void ggml_compute_forward_flash_attn_ext_f16_one_chunk(
const ggml_tensor * v = dst->src[2];
const ggml_tensor * mask = dst->src[3];
const ggml_tensor * sinks = dst->src[4];
+ const ggml_tensor * block_table = dst->src[5]; // [paged] logical->physical cell map (src[5])
+ const int32_t * bt = block_table ? (const int32_t *) block_table->data : nullptr;
GGML_TENSOR_LOCALS(int64_t, neq, q, ne)
GGML_TENSOR_LOCALS(size_t, nbq, q, nb)
@@ -8449,7 +8451,9 @@ static void ggml_compute_forward_flash_attn_ext_f16_one_chunk(
float s; // KQ value
- const char * k_data = (const char *) k->data + ( ic*nbk1 + ik2*nbk2 + ik3*nbk3);
+ // [paged] map the logical KV index ic to its physical cell via the block table.
+ const int64_t ic_phys = bt ? (int64_t) bt[ik3*nek1 + ic] : ic;
+ const char * k_data = (const char *) k->data + ( ic_phys*nbk1 + ik2*nbk2 + ik3*nbk3);
kq_vec_dot(DK, &s, 0, k_data, 0, Q_q, 0, 1);
s = s*scale; // scale KQ value
@@ -8465,7 +8469,7 @@ static void ggml_compute_forward_flash_attn_ext_f16_one_chunk(
float ms = 1.0f; // upon new higher max val, scale VKQ and KQ sum with this value
float vs = 1.0f; // post-softmax KQ value, expf(s - M)
- const char * v_data = ((const char *) v->data + (ic*nbv1 + iv2*nbv2 + iv3*nbv3));
+ const char * v_data = ((const char *) v->data + (ic_phys*nbv1 + iv2*nbv2 + iv3*nbv3));
if (v->type == GGML_TYPE_F16) {
if (s > M) {
@@ -9021,7 +9025,7 @@ static void ggml_compute_forward_flash_attn_ext_f16(
const int64_t dr = (nr + nchunk - 1) / nchunk;
static constexpr int64_t Q_TILE_SZ = ggml_fa_tile_config::Q;
- bool use_tiled = !use_ref &&
+ bool use_tiled = !use_ref && dst->src[5] == nullptr && // [paged] one_chunk honors the block table
(q->type == GGML_TYPE_F32 &&
kv_is_f32_or_f16 &&
k->type == v->type &&
diff --git a/ggml/src/ggml-cuda/fattn-common.cuh b/ggml/src/ggml-cuda/fattn-common.cuh
index 8dfa51a..3c6ddd5 100644
--- a/ggml/src/ggml-cuda/fattn-common.cuh
+++ b/ggml/src/ggml-cuda/fattn-common.cuh
@@ -39,7 +39,8 @@ typedef void (* fattn_kernel_t)(
const int32_t nb11, const int32_t nb12, const int64_t nb13,
const int32_t nb21, const int32_t nb22, const int64_t nb23,
const int32_t ne31, const int32_t ne32, const int32_t ne33,
- const int32_t nb31, const int32_t nb32, const int64_t nb33);
+ const int32_t nb31, const int32_t nb32, const int64_t nb33,
+ const int * __restrict__ block_table);
typedef float (*vec_dot_KQ_t)(
const char * __restrict__ K_c, const void * __restrict__ Q_v, const int * __restrict__ Q_q8 , const void * __restrict__ Q_ds);
@@ -981,6 +982,8 @@ void launch_fattn(
const ggml_tensor * mask = dst->src[3];
const ggml_tensor * sinks = dst->src[4];
+ const ggml_tensor * block_table = dst->src[5]; // [paged] optional logical->physical map
+ const int * bt_ptr = block_table ? (const int *) block_table->data : nullptr;
ggml_tensor * KQV = dst;
@@ -1217,7 +1220,8 @@ void launch_fattn(
K->ne[0], K->ne[1], K->ne[2], K->ne[3], nb11, nb12, nb13,
nb21, nb22, nb23,
mask ? mask->ne[1] : 0, mask ? mask->ne[2] : 0, mask ? mask->ne[3] : 0,
- mask ? mask->nb[1] : 0, mask ? mask->nb[2] : 0, mask ? mask->nb[3] : 0
+ mask ? mask->nb[1] : 0, mask ? mask->nb[2] : 0, mask ? mask->nb[3] : 0,
+ bt_ptr
);
CUDA_CHECK(cudaGetLastError());
diff --git a/ggml/src/ggml-cuda/fattn-mma-f16.cuh b/ggml/src/ggml-cuda/fattn-mma-f16.cuh
index 83478a0..0a92cd6 100644
--- a/ggml/src/ggml-cuda/fattn-mma-f16.cuh
+++ b/ggml/src/ggml-cuda/fattn-mma-f16.cuh
@@ -1723,7 +1723,9 @@ static __global__ void flash_attn_ext_f16(
const int32_t nb11, const int32_t nb12, const int64_t nb13,
const int32_t nb21, const int32_t nb22, const int64_t nb23,
const int32_t ne31, const int32_t ne32, const int32_t ne33,
- const int32_t nb31, const int32_t nb32, const int64_t nb33) {
+ const int32_t nb31, const int32_t nb32, const int64_t nb33,
+ const int * __restrict__ block_table) {
+ GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel
ggml_cuda_pdl_sync(); // TODO optimize placement
#if defined(FLASH_ATTN_AVAILABLE) && (defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE))
const char * GGML_CUDA_RESTRICT Q = Q_ptr;
diff --git a/ggml/src/ggml-cuda/fattn-tile.cuh b/ggml/src/ggml-cuda/fattn-tile.cuh
index 0a09981..0ff14e6 100644
--- a/ggml/src/ggml-cuda/fattn-tile.cuh
+++ b/ggml/src/ggml-cuda/fattn-tile.cuh
@@ -808,7 +808,9 @@ static __global__ void flash_attn_tile(
const int32_t nb11, const int32_t nb12, const int64_t nb13,
const int32_t nb21, const int32_t nb22, const int64_t nb23,
const int32_t ne31, const int32_t ne32, const int32_t ne33,
- const int32_t nb31, const int32_t nb32, const int64_t nb33) {
+ const int32_t nb31, const int32_t nb32, const int64_t nb33,
+ const int * __restrict__ block_table) {
+ GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel
#ifdef FLASH_ATTN_AVAILABLE
const char * GGML_CUDA_RESTRICT Q = Q_ptr;
const char * GGML_CUDA_RESTRICT K = K_ptr;
diff --git a/ggml/src/ggml-cuda/fattn-vec.cuh b/ggml/src/ggml-cuda/fattn-vec.cuh
index 69dd936..a09e2fb 100644
--- a/ggml/src/ggml-cuda/fattn-vec.cuh
+++ b/ggml/src/ggml-cuda/fattn-vec.cuh
@@ -39,7 +39,8 @@ static __global__ void flash_attn_ext_vec(
const int32_t nb11, const int32_t nb12, const int64_t nb13,
const int32_t nb21, const int32_t nb22, const int64_t nb23,
const int32_t ne31, const int32_t ne32, const int32_t ne33,
- const int32_t nb31, const int32_t nb32, const int64_t nb33) {
+ const int32_t nb31, const int32_t nb32, const int64_t nb33,
+ const int * __restrict__ block_table) {
ggml_cuda_pdl_lc();
#ifdef FLASH_ATTN_AVAILABLE
const char * GGML_CUDA_RESTRICT Q = Q_ptr;
@@ -61,7 +62,7 @@ static __global__ void flash_attn_ext_vec(
nb11, nb12, nb13,
nb21, nb22, nb23,
ne31, ne32, ne33,
- nb31, nb32, nb33);
+ nb31, nb32, nb33, block_table);
NO_DEVICE_CODE;
return;
}
@@ -110,6 +111,14 @@ static __global__ void flash_attn_ext_vec(
K += nb13*sequence + nb12*(head / gqa_ratio);
V += nb23*sequence + nb22*(head / gqa_ratio);
+ // [paged] in-kernel block-table read: logical KV index j -> physical cell
+ // block_table[sequence*ne11 + j]; read K0 + cell*nb11 / V0 + cell*nb21. The
+ // mask/KV_max stay logical (the table is in token-position order). nullptr =>
+ // the stock contiguous read below.
+ const char * GGML_CUDA_RESTRICT K0 = K;
+ const char * GGML_CUDA_RESTRICT V0 = V;
+ const int * GGML_CUDA_RESTRICT bt = block_table ? block_table + (size_t) sequence*ne11 : nullptr;
+
const half * maskh = (const half *) (mask + nb33*(sequence % ne33) + nb31*ic0);
const float slope = get_alibi_slope(max_bias, head, n_head_log2, m0, m1);
@@ -267,10 +276,11 @@ static __global__ void flash_attn_ext_vec(
#pragma unroll
for (int i_KQ_0 = 0; i_KQ_0 < nthreads_KQ; ++i_KQ_0) {
const int i_KQ = threadIdx.y*WARP_SIZE + (nthreads_KQ == WARP_SIZE ? 0 : (threadIdx.x & ~(nthreads_KQ-1))) + i_KQ_0;
+ const char * GGML_CUDA_RESTRICT K_blk = bt ? (K0 + (int64_t) bt[k_VKQ_0 + i_KQ]*nb11) : (K + i_KQ*nb11);
#pragma unroll
for (int j = 0; j < ncols; ++j) {
- float sum = vec_dot_KQ(K + i_KQ*nb11, Q_reg[j], Q_i32[j], Q_ds[j]);
+ float sum = vec_dot_KQ(K_blk, Q_reg[j], Q_i32[j], Q_ds[j]);
sum = warp_reduce_sum<nthreads_KQ>(sum);
if (use_logit_softcap) {
@@ -324,6 +334,7 @@ static __global__ void flash_attn_ext_vec(
#pragma unroll
for (int k0 = 0; k0 < WARP_SIZE; k0 += V_cols_per_iter) {
const int k = threadIdx.y*WARP_SIZE + k0 + (nthreads_V == WARP_SIZE ? 0 : threadIdx.x / nthreads_V);
+ const char * GGML_CUDA_RESTRICT V_blk = bt ? (V0 + (int64_t) bt[k_VKQ_0 + k]*nb21) : (V + k*nb21);
#ifdef V_DOT2_F32_F16_AVAILABLE
half2 KQ_k[ncols];
@@ -336,14 +347,14 @@ static __global__ void flash_attn_ext_vec(
half2 tmp[V_rows_per_thread/2];
if constexpr (type_V == GGML_TYPE_BF16) {
float2 tmp_f[V_rows_per_thread/2];
- dequantize_V(V + k*nb21, tmp_f,
+ dequantize_V(V_blk, tmp_f,
2*i_VKQ_0 + (nthreads_V == WARP_SIZE ? threadIdx.x : threadIdx.x % nthreads_V)*V_rows_per_thread);
#pragma unroll
for (int i_VKQ_1 = 0; i_VKQ_1 < V_rows_per_thread/2; ++i_VKQ_1) {
tmp[i_VKQ_1] = __float22half2_rn(tmp_f[i_VKQ_1]);
}
} else {
- dequantize_V(V + k*nb21, tmp,
+ dequantize_V(V_blk, tmp,
2*i_VKQ_0 + (nthreads_V == WARP_SIZE ? threadIdx.x : threadIdx.x % nthreads_V)*V_rows_per_thread);
}
#pragma unroll
@@ -363,7 +374,7 @@ static __global__ void flash_attn_ext_vec(
#pragma unroll
for (int i_VKQ_0 = 0; i_VKQ_0 < D/2; i_VKQ_0 += nthreads_V*V_rows_per_thread/2) {
float2 tmp[V_rows_per_thread/2];
- dequantize_V(V + k*nb21, tmp,
+ dequantize_V(V_blk, tmp,
2*i_VKQ_0 + (nthreads_V == WARP_SIZE ? threadIdx.x : threadIdx.x % nthreads_V)*V_rows_per_thread);
#pragma unroll
for (int i_VKQ_1 = 0; i_VKQ_1 < V_rows_per_thread/2; ++i_VKQ_1) {
@@ -522,7 +533,7 @@ static __global__ void flash_attn_ext_vec(
nb11, nb12, nb13,
nb21, nb22, nb23,
ne31, ne32, ne33,
- nb31, nb32, nb33);
+ nb31, nb32, nb33, block_table);
NO_DEVICE_CODE;
#endif // FLASH_ATTN_AVAILABLE
}
diff --git a/ggml/src/ggml-cuda/fattn-wmma-f16.cu b/ggml/src/ggml-cuda/fattn-wmma-f16.cu
index 6850716..5357849 100644
--- a/ggml/src/ggml-cuda/fattn-wmma-f16.cu
+++ b/ggml/src/ggml-cuda/fattn-wmma-f16.cu
@@ -44,7 +44,9 @@ static __global__ void flash_attn_ext_f16(
const int32_t nb11, const int32_t nb12, const int64_t nb13,
const int32_t nb21, const int32_t nb22, const int64_t nb23,
const int32_t ne31, const int32_t ne32, const int32_t ne33,
- const int32_t nb31, const int32_t nb32, const int64_t nb33) {
+ const int32_t nb31, const int32_t nb32, const int64_t nb33,
+ const int * __restrict__ block_table) {
+ GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel
#if defined(FLASH_ATTN_AVAILABLE) && (defined(GGML_HIP_ROCWMMA_FATTN) && defined(GGML_USE_WMMA_FATTN))
const char * GGML_CUDA_RESTRICT Q = Q_ptr;
const char * GGML_CUDA_RESTRICT K = K_ptr;
diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu
index d6c501b..e3771ee 100644
--- a/ggml/src/ggml-cuda/fattn.cu
+++ b/ggml/src/ggml-cuda/fattn.cu
@@ -574,6 +574,15 @@ size_t ggml_cuda_flash_attn_ext_get_alloc_size(int device, const ggml_tensor * d
void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
ggml_cuda_set_device(ctx.device);
+
+ // [paged] the block table (src[5]) is only honored by the vec kernel's
+ // in-kernel read; force it. build_attn only sets it for a vec-supported
+ // 1-token-per-stream decode shape.
+ if (dst->src[5] != nullptr) {
+ ggml_cuda_flash_attn_ext_vec(ctx, dst);
+ return;
+ }
+
switch (ggml_cuda_get_best_fattn_kernel(ggml_cuda_get_device(), dst)) {
case BEST_FATTN_KERNEL_NONE:
GGML_ABORT("fatal error");
diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
index b43016c..adbe52b 100644
--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
@@ -5442,6 +5442,20 @@ void ggml_flash_attn_ext_add_sinks(
a->src[4] = sinks;
}
+void ggml_flash_attn_ext_set_block_table(
+ struct ggml_tensor * a,
+ struct ggml_tensor * block_table) {
+ if (!block_table) {
+ a->src[5] = NULL;
+ return;
+ }
+
+ GGML_ASSERT(a->op == GGML_OP_FLASH_ATTN_EXT);
+ GGML_ASSERT(block_table->type == GGML_TYPE_I32);
+
+ a->src[5] = block_table;
+}
+
// ggml_flash_attn_back
struct ggml_tensor * ggml_flash_attn_back(
diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
index b59d2a5..abdb48d 100644
--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
@@ -2074,7 +2074,8 @@ ggml_tensor * llm_graph_context::build_attn_mha(
ggml_tensor * sinks,
ggml_tensor * v_mla,
float kq_scale,
- int il) const {
+ int il,
+ ggml_tensor * block_table) const {
const bool v_trans = v->nb[1] > v->nb[2];
// split the batch into streams if needed
@@ -2109,6 +2110,9 @@ ggml_tensor * llm_graph_context::build_attn_mha(
hparams.attn_soft_cap ? hparams.f_attn_logit_softcapping : 0.0f);
cb(cur, LLAMA_TENSOR_NAME_FATTN, il);
+ if (block_table) {
+ ggml_flash_attn_ext_set_block_table(cur, block_table);
+ }
ggml_flash_attn_ext_add_sinks(cur, sinks);
ggml_flash_attn_ext_set_prec (cur, GGML_PREC_F32);
@@ -2358,12 +2362,19 @@ ggml_tensor * llm_graph_context::build_attn(
ggml_tensor * k = mctx_cur->get_k(ctx0, il);
ggml_tensor * v = mctx_cur->get_v(ctx0, il);
- // [paged 0003] gather K, V and the mask to the sequence's used cells only
- // (no-op unless env LLAMA_KV_PAGED is set).
- ggml_tensor * kq_mask_g = kq_mask;
- paged_attn::gather(ctx0, res, mctx_cur, &k, &v, &kq_mask_g);
+ // [paged] decode read: when paging is active and this is a 1-token-per-stream
+ // decode step, present K/V as n_gather views + a block table so the fattn
+ // kernel reads the sequence's cells in-kernel (no get_rows of K/V). Else
+ // fall back to the gather-read (prefill, transposed V, or env off). All a
+ // no-op unless env LLAMA_KV_PAGED is set => stock byte-identical.
+ ggml_tensor * kq_mask_g = kq_mask;
+ ggml_tensor * block_table = nullptr;
+ const bool is_decode = (q_cur->ne[2] == k->ne[3]); // 1 query token per stream
+ if (!(is_decode && paged_attn::in_kernel_decode(ctx0, res, mctx_cur, &k, &v, &kq_mask_g, &block_table))) {
+ paged_attn::gather(ctx0, res, mctx_cur, &k, &v, &kq_mask_g);
+ }
- ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask_g, sinks, v_mla, kq_scale, il);
+ ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask_g, sinks, v_mla, kq_scale, il, block_table);
cb(cur, "kqv_out", il);
if (inp->self_v_rot) {
diff --git a/src/llama-graph.h b/src/llama-graph.h
index 5e8a658..c95ae49 100644
--- a/src/llama-graph.h
+++ b/src/llama-graph.h
@@ -969,7 +969,8 @@ struct llm_graph_context {
ggml_tensor * sinks, // [n_head_q]
ggml_tensor * v_mla, // [n_embd_head_v_mla, n_embd_head_v, n_head_v]
float kq_scale,
- int il) const;
+ int il,
+ ggml_tensor * block_table = nullptr) const; // [paged] optional src[5] block table
llm_graph_input_attn_no_cache * build_attn_inp_no_cache() const;
diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
index 7510ff9..0351f86 100644
--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
@@ -1474,6 +1474,33 @@ void llama_kv_cache::get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_in
}
}
+void llama_kv_cache::get_block_table(int32_t * dst, uint32_t n_blk, uint32_t n_kv, const slot_info & sinfo) const {
+ const uint32_t ns = sinfo.s1 - sinfo.s0 + 1;
+ for (uint32_t j = 0; j < ns; ++j) {
+ const auto & cells = v_cells[sinfo.s0 + j];
+ const uint32_t n = std::min<uint32_t>(n_kv, cells.size());
+ std::vector<std::pair<llama_pos, int32_t>> pc;
+ pc.reserve(n);
+ int32_t pad = -1;
+ for (uint32_t i = 0; i < n; ++i) {
+ if (!cells.is_empty(i)) {
+ pc.emplace_back(cells.pos_get(i), (int32_t) i);
+ } else if (pad < 0) {
+ pad = (int32_t) i;
+ }
+ }
+ std::sort(pc.begin(), pc.end());
+ int32_t * col = dst + (size_t) j * n_blk;
+ for (size_t k = 0; k < pc.size(); ++k) {
+ col[k] = pc[k].second;
+ }
+ const int32_t padv = (pad >= 0) ? pad : (pc.empty() ? 0 : pc.back().second);
+ for (uint32_t k = (uint32_t) pc.size(); k < n_blk; ++k) {
+ col[k] = padv;
+ }
+ }
+}
+
ggml_tensor * llama_kv_cache::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const {
GGML_UNUSED(sinfo);
@@ -2773,6 +2800,10 @@ void llama_kv_cache_context::get_gather_idxs(int32_t * dst) const {
kv->get_gather_idxs(dst, n_kv, sinfos[i_cur]);
}
+void llama_kv_cache_context::get_block_table(int32_t * dst, uint32_t n_blk) const {
+ kv->get_block_table(dst, n_blk, n_kv, sinfos[i_cur]);
+}
+
ggml_tensor * llama_kv_cache_context::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il) const {
return kv->cpy_k(ctx, k_cur, k_idxs, il, sinfos[i_cur]);
}
diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
index f374ac6..e9980b6 100644
--- a/src/llama-kv-cache.h
+++ b/src/llama-kv-cache.h
@@ -176,6 +176,9 @@ public:
// gather-read. get_n_gather returns the max count across streams.
uint32_t get_n_gather(uint32_t n_kv, const slot_info & sinfo) const;
void get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_info & sinfo) const;
+ // [paged inc1] block table [n_blk, n_stream] (position order, padded to n_blk
+ // per column with a masked empty cell) for the in-kernel paged read.
+ void get_block_table(int32_t * dst, uint32_t n_blk, uint32_t n_kv, const slot_info & sinfo) const;
// store k_cur and v_cur in the cache based on the provided head location
ggml_tensor * cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const;
@@ -386,6 +389,7 @@ public:
// current ubatch's stream).
uint32_t get_n_gather() const;
void get_gather_idxs(int32_t * dst) const;
+ void get_block_table(int32_t * dst, uint32_t n_blk) const;
// store k_cur and v_cur in the cache based on the provided head location
// note: the heads in k_cur and v_cur should be laid out contiguously in memory
diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp
index ade75e8..8eebeaa 100644
--- a/src/paged-attn.cpp
+++ b/src/paged-attn.cpp
@@ -43,6 +43,25 @@ public:
ggml_tensor * idxs;
};
+// Block table filler for the in-kernel paged read: fills an I32 [n_blk, n_stream]
+// tensor with each stream's position-ordered cells, padded to n_blk (per column)
+// with a masked empty cell, by delegating to the kv-cache context.
+class input_block_table : public llm_graph_input_i {
+public:
+ input_block_table(const llama_kv_cache_context * mctx, ggml_tensor * idxs, uint32_t n_blk)
+ : mctx(mctx), idxs(idxs), n_blk(n_blk) {}
+
+ void set_input(const llama_ubatch * ubatch) override {
+ GGML_UNUSED(ubatch);
+ GGML_ASSERT(idxs && ggml_backend_buffer_is_host(idxs->buffer));
+ mctx->get_block_table((int32_t *) idxs->data, n_blk);
+ }
+
+ const llama_kv_cache_context * mctx;
+ ggml_tensor * idxs;
+ uint32_t n_blk;
+};
+
} // namespace
void gather(ggml_context * ctx0,
@@ -125,4 +144,92 @@ void gather(ggml_context * ctx0,
}
}
+bool in_kernel_decode(ggml_context * ctx0,
+ llm_graph_result * res,
+ const llama_kv_cache_context * mctx,
+ ggml_tensor ** k,
+ ggml_tensor ** v,
+ ggml_tensor ** kq_mask,
+ ggml_tensor ** block_table) {
+ if (!active()) {
+ return false;
+ }
+ // Bench escape hatch: LLAMA_KV_PAGED_GATHER=1 forces the old gather-read decode
+ // path (for a same-build BEFORE/AFTER decode-step comparison). Dev-only.
+ static const bool force_gather = (std::getenv("LLAMA_KV_PAGED_GATHER") != nullptr);
+ if (force_gather) {
+ return false;
+ }
+
+ ggml_tensor * K = *k;
+ ggml_tensor * V = *v;
+ ggml_tensor * M = *kq_mask;
+
+ const int64_t n_stream = K->ne[3];
+ GGML_ASSERT(M->ne[3] == n_stream);
+
+ const int64_t n_gather = (int64_t) mctx->get_n_gather();
+ if (n_gather <= 0) {
+ // Worst-case reserve / nothing placed yet: keep the dense [0,n_kv) read.
+ return false;
+ }
+
+ // The in-kernel read addresses V along its d-major (non-transposed) axis. If
+ // the cache stores V transposed, fall back to gather() (which normalizes it).
+ if (V->nb[1] > V->nb[2]) {
+ return false;
+ }
+
+ if (debug()) {
+ static int64_t once = 0;
+ if (once++ < 2) {
+ fprintf(stderr, "[paged-attn] in-kernel decode n_stream=%lld n_kv=%lld n_gather=%lld\n",
+ (long long) n_stream, (long long) K->ne[2], (long long) n_gather);
+ }
+ }
+
+ // Block table [n_gather, n_stream]: column s holds stream s's non-empty cells
+ // in token-POSITION order (identical to the gather index, so the reduction
+ // order matches stock bit-for-bit), padded with a masked empty cell. Filled
+ // at set_input from the kv-cache (get_gather_idxs), exactly like the gather.
+ // Pad the logical length to FATTN_KQ_STRIDE (256) so the CUDA fattn vec kernel
+ // reads fixed 128-wide KV blocks without overrun and the KV_max mask scan
+ // engages; padded entries point at a masked empty cell (0 contribution). Stays
+ // <= n_kv since n_kv is itself padded to 256 and n_gather <= n_kv.
+ int64_t n_view = GGML_PAD(n_gather, 256);
+ if (n_view > K->ne[2]) {
+ n_view = K->ne[2];
+ }
+
+ ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_view, n_stream);
+ ggml_set_input(idx);
+ res->add_input(llm_graph_input_ptr(new input_block_table(mctx, idx, (uint32_t) n_view)));
+
+ // Present K and V as [d, h, n_view, ns] VIEWS of the full physical window:
+ // identical per-cell (nb1,nb2) and per-stream (nb3) strides, only the cell
+ // dim shrinks to n_view. NOT materialized - the kernel reads in place.
+ *k = ggml_view_4d(ctx0, K, K->ne[0], K->ne[1], n_view, n_stream,
+ K->nb[1], K->nb[2], K->nb[3], 0);
+ *v = ggml_view_4d(ctx0, V, V->ne[0], V->ne[1], n_view, n_stream,
+ V->nb[1], V->nb[2], V->nb[3], 0);
+
+ // Compact the mask to [n_gather, n_tps, 1, ns] in the same position order so
+ // the kernel's logical mask index aligns with the block table. Cheap: the
+ // mask is ~(d*h) smaller than K/V, which is why only its get_rows remains.
+ {
+ ggml_tensor * m = ggml_reshape_3d(ctx0, M, M->ne[0], M->ne[1], n_stream);
+ m = ggml_cont(ctx0, ggml_transpose(ctx0, m));
+ m = ggml_get_rows(ctx0, m, idx);
+ m = ggml_cont(ctx0, ggml_transpose(ctx0, m));
+ m = ggml_reshape_4d(ctx0, m, n_view, M->ne[1], 1, n_stream);
+ if (M->type != m->type) {
+ m = ggml_cast(ctx0, m, M->type);
+ }
+ *kq_mask = m;
+ }
+
+ *block_table = idx;
+ return true;
+}
+
} // namespace paged_attn
diff --git a/src/paged-attn.h b/src/paged-attn.h
index c5b7bd7..23e2184 100644
--- a/src/paged-attn.h
+++ b/src/paged-attn.h
@@ -37,4 +37,22 @@ void gather(ggml_context * ctx0,
ggml_tensor ** v,
ggml_tensor ** kq_mask);
+// [paged inc1] In-kernel paged decode read. Instead of materializing the
+// sequence's cells (gather()), present K and V as n_gather-length VIEWS of the
+// full physical window and return the position-ordered physical-cell index list
+// as a block table (src[5] of ggml_flash_attn_ext). The fattn kernel/op then
+// reads K_base + block_table[j]*nb in-kernel, removing the get_rows of K and V
+// (the bulk of the gather). On return (true): *k,*v point at the views, *kq_mask
+// at the compacted mask, *block_table at the I32 [n_gather, n_stream] index.
+// Returns false (leaving *k,*v,*kq_mask untouched) when the in-kernel path does
+// not apply - env off, nothing placed, or a transposed V cache - so the caller
+// keeps the dense gather()/contiguous read.
+bool in_kernel_decode(ggml_context * ctx0,
+ llm_graph_result * res,
+ const llama_kv_cache_context * mctx,
+ ggml_tensor ** k,
+ ggml_tensor ** v,
+ ggml_tensor ** kq_mask,
+ ggml_tensor ** block_table);
+
} // namespace paged_attn
--
2.43.0

View File

@@ -1,269 +0,0 @@
From 9ac56933abd5de4a1f349c811c2d74aab09f7ab1 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Mon, 22 Jun 2026 22:36:09 +0200
Subject: [PATCH] paged tile in-kernel decode read + dispatch guard (env
LLAMA_KV_PAGED) - patch 0010
Increment 2 (robustness, ~0 headline ms): make the paged in-kernel decode read
safe against silent mis-routing, and plumb the same read into the tile kernel
for the increment-3 GQA head-group work.
fattn-tile.cuh: graft the patch-0009 phys(j) block-table read (mirror of
fattn-vec.cuh). Both flash_attn_tile_load_tile overloads, flash_attn_tile_iter_KQ
(K) and flash_attn_tile_iter (V) take an optional per-sequence block table; a row
i is read from base + block_table[row_base + i]*stride instead of base + i*stride.
The table defaults to nullptr (default args + a null bt_seq when src[5] is unset),
so every existing non-paged caller is byte-identical to stock. The mask / KV_max
stay logical (token-position order), as in vec.
fattn.cu: DISPATCH GUARD. When the block table (src[5]) is present, route ONLY to
the vec or tile kernel and never fall through to the best-kernel switch. The
mma/wmma kernels GGML_UNUSED the table and would silently read the wrong
(contiguous physical) cells; the guard makes that unreachable. The vec dispatcher
GGML_ABORTs for an unsupported D/type rather than mis-reading. Default route is vec
(the inc-1 byte-validated path). LLAMA_KV_PAGED_DISPATCH_LOG=1 prints the routed
kernel once.
Gates: CPU byte-identical paged-on vs off (Qwen3-0.6B, build-cpu) PASS. GPU
vec-paged == stock at -s 1 PASS. Dispatch confirmed VEC for the real decode shape:
Qwen3-0.6B Q ne=[128,1,16,1] and Qwen3-32B NVFP4 Q ne=[128,1,64,N] both route to
vec, matching the nsys profile (flash_attn_ext_vec).
The tile graft is plumbed for increment-3 GQA head-group reuse but is EXPERIMENTAL
and NOT yet byte-validated (LLAMA_KV_PAGED_TILE=1). A tile-vs-tile gate shows
tile-paged diverging from tile-stock at the first cross-tile KV depth: the
GQA-grouped (ncols2>1) tile path reads a full nbatch_fa-row tile with
oob_check=false while the compacted paged mask is not padded to cover the tile, so
past-end rows leak. vec bounds its KV walk by KV_max and is unaffected. Bounding
the tile path is increment-3 work; the default vec route and all stock paths are
untouched.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/src/ggml-cuda/fattn-tile.cuh | 45 ++++++++++++++++++++-----------
ggml/src/ggml-cuda/fattn.cu | 38 +++++++++++++++++++++++---
2 files changed, 64 insertions(+), 19 deletions(-)
diff --git a/ggml/src/ggml-cuda/fattn-tile.cuh b/ggml/src/ggml-cuda/fattn-tile.cuh
index 0ff14e6..bb84d61 100644
--- a/ggml/src/ggml-cuda/fattn-tile.cuh
+++ b/ggml/src/ggml-cuda/fattn-tile.cuh
@@ -373,7 +373,8 @@ static constexpr __device__ int ggml_cuda_fattn_tile_get_nbatch_K(const int DKQ,
// TODO: deduplicate with mma-f16
template<int warp_size, int nwarps, int I, int J, int J_padding, bool oob_check>
static __device__ __forceinline__ void flash_attn_tile_load_tile(
- const half2 * const __restrict__ KV, half2 * const __restrict__ tile_KV, const int stride_KV, const int i_sup) {
+ const half2 * const __restrict__ KV, half2 * const __restrict__ tile_KV, const int stride_KV, const int i_sup,
+ const int * const __restrict__ block_table = nullptr, const int row_base = 0) {
constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
constexpr int cpy_ne = cpy_nb / 4;
@@ -402,9 +403,11 @@ static __device__ __forceinline__ void flash_attn_tile_load_tile(
const int j = j0*cpy_ne + (stride_j == warp_size ? threadIdx.x : threadIdx.x % stride_j)*cpy_ne;
const __align__(16) half2 zero[cpy_ne] = {{0.0f, 0.0f}};
+ // [paged] remap the row through the block table (nullptr => stock contiguous read).
+ const half2 * const KV_row = block_table ? KV + (int64_t) block_table[row_base + i]*stride_KV : KV + i*stride_KV;
ggml_cuda_memcpy_1<cpy_nb>(
tile_KV + i*(J/2 + J_padding) + j,
- !oob_check || i < i_sup ? KV + i*stride_KV + j : zero);
+ !oob_check || i < i_sup ? KV_row + j : zero);
}
}
}
@@ -423,7 +426,8 @@ static __device__ __forceinline__ void flash_attn_tile_load_tile(
template<int warp_size, int nwarps, int I, int J, int J_padding, bool oob_check>
static __device__ __forceinline__ void flash_attn_tile_load_tile(
- const half2 * const __restrict__ KV, float * const __restrict__ tile_KV, const int stride_KV, const int i_sup) {
+ const half2 * const __restrict__ KV, float * const __restrict__ tile_KV, const int stride_KV, const int i_sup,
+ const int * const __restrict__ block_table = nullptr, const int row_base = 0) {
constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
constexpr int cpy_ne = cpy_nb / 4;
@@ -453,8 +457,10 @@ static __device__ __forceinline__ void flash_attn_tile_load_tile(
const half2 zero[cpy_ne/2] = {{0.0f, 0.0f}};
__align__(16) half2 tmp_h2[cpy_ne/2];
+ // [paged] remap the row through the block table (nullptr => stock contiguous read).
+ const half2 * const KV_row = block_table ? KV + (int64_t) block_table[row_base + i]*stride_KV : KV + i*stride_KV;
ggml_cuda_memcpy_1<sizeof(tmp_h2)>(
- tmp_h2, !oob_check || i < i_sup ? KV + i*stride_KV + j : zero);
+ tmp_h2, !oob_check || i < i_sup ? KV_row + j : zero);
__align__(16) float2 tmp_f2[cpy_ne/2];
#pragma unroll
@@ -487,6 +493,7 @@ static __device__ __forceinline__ void flash_attn_tile_iter_KQ(
const int k_VKQ_0,
const int k_VKQ_sup,
const int k_KQ_0,
+ const int * const __restrict__ block_table,
float * KQ_acc) {
constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
constexpr int cpy_ne = cpy_nb / 4;
@@ -495,8 +502,10 @@ static __device__ __forceinline__ void flash_attn_tile_iter_KQ(
constexpr int cpw = ncols > nwarps ? ncols/nwarps : 1; // Q columns per warp
constexpr int np = nwarps > ncols ? nwarps/ncols : 1; // number of parallel warps per Q column
+ // [paged] when block_table is set K_h2 is the un-offset base; the table supplies the row.
+ const half2 * const K_base = block_table ? (K_h2 + k_KQ_0/2) : (K_h2 + int64_t(k_VKQ_0)*stride_K2 + k_KQ_0/2);
flash_attn_tile_load_tile<warp_size, nwarps, nbatch_fa, nbatch_K, cpy_ne, oob_check>
- (K_h2 + int64_t(k_VKQ_0)*stride_K2 + k_KQ_0/2, KV_tmp, stride_K2, k_VKQ_sup);
+ (K_base, KV_tmp, stride_K2, k_VKQ_sup, block_table, k_VKQ_0);
__syncthreads();
#ifdef FAST_FP16_AVAILABLE
@@ -572,7 +581,8 @@ static __device__ __forceinline__ void flash_attn_tile_iter(
T_acc * const VKQ,
const int k_VKQ_0,
const int k_VKQ_max,
- const int col_Q_0) {
+ const int col_Q_0,
+ const int * const __restrict__ block_table) {
constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
constexpr int cpy_ne = cpy_nb / 4;
@@ -605,12 +615,12 @@ static __device__ __forceinline__ void flash_attn_tile_iter(
#pragma unroll
for (int k_KQ_0 = 0; k_KQ_0 < DKQ - nbatch_K_last; k_KQ_0 += nbatch_K) {
flash_attn_tile_iter_KQ<warp_size, nwarps, ncols1, ncols2, DKQ, nbatch_fa, nbatch_K, use_logit_softcap, oob_check>(
- Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, KQ_acc);
+ Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, block_table, KQ_acc);
}
if (nbatch_K_last > 0) {
constexpr int k_KQ_0 = DKQ - nbatch_K_last;
flash_attn_tile_iter_KQ<warp_size, nwarps, ncols1, ncols2, DKQ, nbatch_fa, nbatch_K_last, use_logit_softcap, oob_check>(
- Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, KQ_acc);
+ Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, block_table, KQ_acc);
}
// Apply logit softcap + mask, update KQ_max:
@@ -715,8 +725,10 @@ static __device__ __forceinline__ void flash_attn_tile_iter(
static_assert(nbatch_V % np == 0, "bad nbatch_V");
#pragma unroll
for (int k0 = 0; k0 < nbatch_fa; k0 += nbatch_V) {
+ // [paged] when block_table is set V_h2 is the un-offset base; the table supplies the row.
+ const half2 * const V_base = block_table ? V_h2 : (V_h2 + int64_t(k_VKQ_0 + k0)*stride_V2);
flash_attn_tile_load_tile<warp_size, nwarps, nbatch_V, DV, 0, oob_check>
- (V_h2 + int64_t(k_VKQ_0 + k0)*stride_V2, KV_tmp, stride_V2, k_VKQ_sup - k0);
+ (V_base, KV_tmp, stride_V2, k_VKQ_sup - k0, block_table, k_VKQ_0 + k0);
__syncthreads();
#ifdef FAST_FP16_AVAILABLE
@@ -810,7 +822,6 @@ static __global__ void flash_attn_tile(
const int32_t ne31, const int32_t ne32, const int32_t ne33,
const int32_t nb31, const int32_t nb32, const int64_t nb33,
const int * __restrict__ block_table) {
- GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel
#ifdef FLASH_ATTN_AVAILABLE
const char * GGML_CUDA_RESTRICT Q = Q_ptr;
const char * GGML_CUDA_RESTRICT K = K_ptr;
@@ -837,7 +848,7 @@ static __global__ void flash_attn_tile(
nb11, nb12, nb13,
nb21, nb22, nb23,
ne31, ne32, ne33,
- nb31, nb32, nb33);
+ nb31, nb32, nb33, block_table);
NO_DEVICE_CODE;
return;
}
@@ -861,6 +872,10 @@ static __global__ void flash_attn_tile(
const half2 * K_h2 = (const half2 *) (K + nb13*sequence + nb12*(head0 / gqa_ratio));
const half2 * V_h2 = (const half2 *) (V + nb23*sequence + nb22*(head0 / gqa_ratio)); // K and V have same shape
+ // [paged] per-sequence logical->physical block table in token-position order
+ // (mask/KV_max stay logical); nullptr => the stock contiguous read.
+ const int * const __restrict__ bt_seq = block_table ? block_table + (size_t) sequence*ne11 : nullptr;
+
const half * maskh = mask ? (const half *) (mask + nb33*(sequence % ne33)) : nullptr;
const int stride_K2 = nb11 / sizeof(half2);
@@ -963,14 +978,14 @@ static __global__ void flash_attn_tile(
constexpr bool oob_check = false;
flash_attn_tile_iter<warp_size, nwarps, ncols1, ncols2, DKQ, DV, nbatch_fa, nbatch_K, use_logit_softcap, oob_check>
(Q_tmp, K_h2, V_h2, maskh, ne01, logit_softcap, slope, KQ, KV_tmp,
- stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0);
+ stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0, bt_seq);
k_VKQ_0 += gridDim.y*nbatch_fa;
}
if (k_VKQ_0 < k_VKQ_max) {
constexpr bool oob_check = true;
flash_attn_tile_iter<warp_size, nwarps, ncols1, ncols2, DKQ, DV, nbatch_fa, nbatch_K, use_logit_softcap, oob_check>
(Q_tmp, K_h2, V_h2, maskh, ne01, logit_softcap, slope, KQ, KV_tmp,
- stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0);
+ stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0, bt_seq);
}
} else {
// Branch without out-of-bounds checks.
@@ -978,7 +993,7 @@ static __global__ void flash_attn_tile(
constexpr bool oob_check = false;
flash_attn_tile_iter<warp_size, nwarps, ncols1, ncols2, DKQ, DV, nbatch_fa, nbatch_K, use_logit_softcap, oob_check>
(Q_tmp, K_h2, V_h2, maskh, ne01, logit_softcap, slope, KQ, KV_tmp,
- stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0);
+ stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0, bt_seq);
}
}
@@ -1144,7 +1159,7 @@ static __global__ void flash_attn_tile(
nb11, nb12, nb13,
nb21, nb22, nb23,
ne31, ne32, ne33,
- nb31, nb32, nb33);
+ nb31, nb32, nb33, block_table);
NO_DEVICE_CODE;
#endif // FLASH_ATTN_AVAILABLE
}
diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu
index e3771ee..afcafa2 100644
--- a/ggml/src/ggml-cuda/fattn.cu
+++ b/ggml/src/ggml-cuda/fattn.cu
@@ -575,11 +575,41 @@ size_t ggml_cuda_flash_attn_ext_get_alloc_size(int device, const ggml_tensor * d
void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
ggml_cuda_set_device(ctx.device);
- // [paged] the block table (src[5]) is only honored by the vec kernel's
- // in-kernel read; force it. build_attn only sets it for a vec-supported
- // 1-token-per-stream decode shape.
+ // [paged] DISPATCH GUARD. The block table (src[5]) is read in-kernel ONLY by
+ // the vec and tile kernels; the mma/wmma kernels GGML_UNUSED it and would
+ // silently read the wrong (contiguous physical) cells. So when a block table
+ // is present we route here and NEVER fall through to the best-kernel switch
+ // below - no decode shape can silently reach an mma/wmma misread. build_attn
+ // only sets src[5] for the 1-token-per-stream decode shape; the vec
+ // dispatcher GGML_ABORTs for an unsupported D/type rather than mis-reading,
+ // and any shape that should not be paged must take the host-side gather path
+ // (LLAMA_KV_PAGED_GATHER=1) instead.
+ //
+ // Default route = vec (inc-1, byte-validated: vec-paged == stock at -s 1 and
+ // CPU byte-identical). LLAMA_KV_PAGED_TILE=1 routes the same shape to the
+ // tile kernel; the tile in-kernel read is plumbed (fattn-tile.cuh) for the
+ // increment-3 GQA head-group reuse, but is EXPERIMENTAL / NOT yet byte-
+ // validated: the GQA-grouped (ncols2>1) tile path reads a full nbatch_fa tile
+ // with oob_check=false while the compacted paged mask is not padded to cover
+ // it, so it diverges from stock. Not for production paged decode until
+ // increment-3 bounds that path; the default vec route is unaffected.
if (dst->src[5] != nullptr) {
- ggml_cuda_flash_attn_ext_vec(ctx, dst);
+ static const bool paged_tile = getenv("LLAMA_KV_PAGED_TILE") != nullptr;
+ if (getenv("LLAMA_KV_PAGED_DISPATCH_LOG") != nullptr) {
+ static bool logged = false;
+ if (!logged) {
+ logged = true;
+ fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld])\n",
+ paged_tile ? "TILE(experimental)" : "VEC",
+ (long) dst->src[0]->ne[0], (long) dst->src[0]->ne[1],
+ (long) dst->src[0]->ne[2], (long) dst->src[0]->ne[3]);
+ }
+ }
+ if (paged_tile) {
+ ggml_cuda_flash_attn_ext_tile(ctx, dst);
+ } else {
+ ggml_cuda_flash_attn_ext_vec(ctx, dst);
+ }
return;
}
--
2.43.0

View File

@@ -1,147 +0,0 @@
From d5ca5cd756e42214d0003bca815ca91943679b0d Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Tue, 23 Jun 2026 00:18:35 +0200
Subject: [PATCH] paged decode: route GQA-grouped tile kernel by default (F16,
gqa>=2) - patch 0011
Increment 3 (the attention lever). In fattn.cu's paged dispatch guard, route the
in-kernel decode to the tile kernel for the common grouped-query F16 case, and
keep the inc-1 vec kernel for everything else.
The tile kernel carries native GQA head-group reuse: its ncols2 axis groups the
q-heads that share one kv-head, so each K/V row is loaded once for the whole
group instead of once per q-head. vec re-streams each kv-head's K/V once per
q-head (8x for Qwen3-32B's n_head 64 / n_head_kv 8) and runs at 168 regs ->
3 blocks/SM = 25% occupancy on GB10; tile is 108-128 regs with native grouping.
The inc-2 phys(j) block-table read was already plumbed into tile (patch 0010);
this patch makes it the default for {F16 K and V, gqa_ratio >= 2}.
Routing guard (why conditional): the tile kernel has no K/V type template - it
loads half2 - so a non-F16 cache (BF16 / quantized) would be converted by
launch_fattn to a contiguous F16 copy, which breaks the in-kernel block-table
read (the table indexes the original paged layout, not the copy). So tile is
correct only for an F16 cache; non-F16 caches and the non-grouped gqa==1 shape
fall back to the inc-1 vec path, exactly as before this change. The head-group
reuse also only helps at gqa_ratio >= 2. LLAMA_KV_PAGED_VEC=1 forces vec for A/B.
Note: paged decode is currently exercised with an F16 cache only; quantized +
paged is a separate pre-existing limitation, independent of this change
(verified: stock + q8_0 cache works, but paged + q8_0 aborts both before and
after this patch, since both route the non-F16 cache to vec).
Measured GB10 (sm_121, 48 SM), Qwen3-32B NVFP4 dense, F16 cache, gqa 8, batch 32,
1024 ctx, llama-batched-bench npp=1024 ntg=128 npl=32, GGML_CUDA_DISABLE_GRAPHS=1,
same build, env-toggled:
STOCK (mma) 174.8 ms/step 183.1 t/s
PAGED-VEC (inc-1) 186.3 ms/step 171.8 t/s (+6.6% vs stock)
PAGED-TILE (inc-3) 177.9 ms/step 179.8 t/s (+1.8% vs stock)
GQA grouping recovers 8.4 ms/step (-4.5%) over the inc-1 vec default and brings
paged decode to within 1.8% of stock. The win grows with context (npl=8, tile vs
vec decode step): 1024 -2.3%, 4096 -3.3%, 8192 and 16384 wider, as attention
takes a larger share of the step.
Why not the split-K tune: the vec decode grid is already block-saturated
(1 x parallel_blocks 3 x 2048 = 6144 blocks ~ 43 waves over 144 resident on 48
SM), so raising parallel_blocks / KV_max adds no SM fill. The under-saturation is
intra-SM (occupancy + the 8x KV re-streaming), which GQA grouping attacks
directly; more split-K does not.
Correctness (greedy, GGML_CUDA_DISABLE_GRAPHS=1):
- CPU plumbing gate (Qwen3-0.6B, build-cpu, paged-on vs off): BYTE-IDENTICAL.
- GPU 0.6B gqa=2, 8 seq x 48 tok: tile is token-identical to the inc-1 vec path
in 7/8 sequences; the 8th diverges at token 5, within the same kernel-noise
band where vec also drifts from stock. Stock uses the mma kernel for this
multi-stream GQA shape, so a different kernel = different rounding =
autoregressive token drift; vec and tile agree with each other while both
differ from stock (both pick 15678 where stock picks 38835), confirming the
drift is kernel choice, not a paging error.
- GPU 32B gqa=8, 4 seq x 40 tok: tile tracks stock at least as well as vec
(seq3: tile == stock == 624 at the token where vec picked 13).
Stock is byte-identical: the dispatch guard only diverts when the block table
(src[5]) is set; the non-paged best-kernel switch is untouched. The ncols2>1 tile
path reads the last nbatch_fa tile with oob_check=false and relies on the mask
-inf padding - the same pattern stock uses for ncols2>1 - and the compacted paged
mask is gathered to the n_view (GGML_PAD 256) width so it carries that padding.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
---
ggml/src/ggml-cuda/fattn.cu | 51 ++++++++++++++++++++++++++-----------
1 file changed, 36 insertions(+), 15 deletions(-)
diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu
index afcafa2..6b15810 100644
--- a/ggml/src/ggml-cuda/fattn.cu
+++ b/ggml/src/ggml-cuda/fattn.cu
@@ -580,32 +580,53 @@ void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst
// silently read the wrong (contiguous physical) cells. So when a block table
// is present we route here and NEVER fall through to the best-kernel switch
// below - no decode shape can silently reach an mma/wmma misread. build_attn
- // only sets src[5] for the 1-token-per-stream decode shape; the vec
+ // only sets src[5] for the 1-token-per-stream decode shape; the vec/tile
// dispatcher GGML_ABORTs for an unsupported D/type rather than mis-reading,
// and any shape that should not be paged must take the host-side gather path
// (LLAMA_KV_PAGED_GATHER=1) instead.
//
- // Default route = vec (inc-1, byte-validated: vec-paged == stock at -s 1 and
- // CPU byte-identical). LLAMA_KV_PAGED_TILE=1 routes the same shape to the
- // tile kernel; the tile in-kernel read is plumbed (fattn-tile.cuh) for the
- // increment-3 GQA head-group reuse, but is EXPERIMENTAL / NOT yet byte-
- // validated: the GQA-grouped (ncols2>1) tile path reads a full nbatch_fa tile
- // with oob_check=false while the compacted paged mask is not padded to cover
- // it, so it diverges from stock. Not for production paged decode until
- // increment-3 bounds that path; the default vec route is unaffected.
+ // Default route = the GQA-grouped TILE kernel (inc-3) WHEN it is both correct
+ // and a win, else the inc-1 vec path. Tile groups the q-heads that share one
+ // kv-head (ncols2), loading each K/V row once for the whole group instead of
+ // once per q-head, and runs at higher occupancy than vec (108-128 regs vs 168).
+ // Two constraints make this conditional: (1) the tile kernel has no K/V type
+ // template - it loads half2 - so a non-F16 cache (BF16/quantized) would be
+ // converted by launch_fattn to a contiguous F16 copy, which breaks the
+ // in-kernel block-table read (the table indexes the original paged layout, not
+ // the copy); vec instead reads the original cache with in-kernel dequant, so it
+ // is the only correct paged path for non-F16 caches. (2) the head-group reuse
+ // only helps when gqa_ratio>=2. So route to tile only for {F16 K and V,
+ // gqa_ratio>=2}; everything else stays on vec, matching stock (which also sends
+ // quantized-cache decode to the vector kernel). Measured on GB10 (Qwen3-32B
+ // nvfp4, F16 cache, gqa 8, batch 32, 1024 ctx): tile 177.9 ms/step vs vec 186.3
+ // vs stock 174.8 - GQA grouping recovers ~4.5% over the inc-1 vec default and
+ // brings paged decode to ~1.8% of stock. Validated token-coherent with vec:
+ // 0.6B 8-seq 7/8 identical (8th within the kernel-noise band where vec also
+ // drifts from stock), 32B gqa=8 tile tracks stock at least as well as vec, CPU
+ // plumbing gate byte-identical. The ncols2>1 tile path reads the last nbatch_fa
+ // tile with oob_check=false relying on mask -inf padding (the SAME pattern stock
+ // uses for ncols2>1); the compacted paged mask is gathered to the n_view
+ // (GGML_PAD 256) width so it carries that padding. LLAMA_KV_PAGED_VEC=1 forces
+ // the inc-1 vec path for A/B.
if (dst->src[5] != nullptr) {
- static const bool paged_tile = getenv("LLAMA_KV_PAGED_TILE") != nullptr;
+ const ggml_tensor * Qp = dst->src[0];
+ const ggml_tensor * Kp = dst->src[1];
+ const ggml_tensor * Vp = dst->src[2];
+ const bool kv_f16 = Kp->type == GGML_TYPE_F16 && Vp->type == GGML_TYPE_F16;
+ const int64_t gqa_ratio = Kp->ne[2] > 0 ? Qp->ne[2] / Kp->ne[2] : 1;
+ const bool force_vec = getenv("LLAMA_KV_PAGED_VEC") != nullptr;
+ const bool use_tile = !force_vec && kv_f16 && gqa_ratio >= 2;
if (getenv("LLAMA_KV_PAGED_DISPATCH_LOG") != nullptr) {
static bool logged = false;
if (!logged) {
logged = true;
- fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld])\n",
- paged_tile ? "TILE(experimental)" : "VEC",
- (long) dst->src[0]->ne[0], (long) dst->src[0]->ne[1],
- (long) dst->src[0]->ne[2], (long) dst->src[0]->ne[3]);
+ fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld] gqa=%ld kv_f16=%d)\n",
+ use_tile ? "TILE(gqa)" : "VEC",
+ (long) Qp->ne[0], (long) Qp->ne[1], (long) Qp->ne[2], (long) Qp->ne[3],
+ (long) gqa_ratio, (int) kv_f16);
}
}
- if (paged_tile) {
+ if (use_tile) {
ggml_cuda_flash_attn_ext_tile(ctx, dst);
} else {
ggml_cuda_flash_attn_ext_vec(ctx, dst);
--
2.43.0

View File

@@ -1,50 +0,0 @@
From 6e3e976e2b11adb05519f31dd5aad0c204678f5c Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Tue, 23 Jun 2026 11:12:05 +0200
Subject: [PATCH] feat(paged): assert mask-pad invariant for the paged tile
route (patch 0012)
The now-default paged decode route (GQA-grouped fattn-tile kernel) does not
leak past-end KV rows only because the compacted mask/block-table length is
padded to a whole number of flash-attn KV tiles: n_view = GGML_PAD(n_gather,
256), and the tile (nbatch_fa = 64 for head_dim 128) divides 256, so the last
tile sits entirely inside the -inf pad window. That invariant was implicit.
Add a defensive GGML_ASSERT(n_view % 64 == 0) right after the pad/clamp so a
future change to the pad (e.g. < 256) or the tile (> 256) that broke the
whole-tile property cannot silently reintroduce the leak. Additive only, no
behaviour change.
Verified: build-cpu compiles, and the paged CPU byte gate (LLAMA_KV_PAGED off
vs on, Qwen3-0.6B-Q8_0, greedy, -ngl 0) stays byte-identical while the assert
stays silent (n_view remains a whole number of tiles across all decode steps).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
src/paged-attn.cpp | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp
index 8eebeaa..fed8ca9 100644
--- a/src/paged-attn.cpp
+++ b/src/paged-attn.cpp
@@ -201,6 +201,15 @@ bool in_kernel_decode(ggml_context * ctx0,
n_view = K->ne[2];
}
+ // The flash-attn KV tile is 64 rows wide (nbatch_fa for head_dim 128). n_view must be
+ // a whole number of such tiles so the in-kernel decode never reads past the gathered
+ // rows: the trailing pad cells [n_gather, n_view) are all -inf, so any tile straddling
+ // the boundary still contributes zero. This holds today only because the pad (256) is a
+ // multiple of the tile; a future pad < 256 (or nbatch_fa > 256) that broke it would
+ // silently reintroduce a past-end KV leak, so assert it rather than trust it.
+ // pad must be a multiple of the flash-attn KV tile so the last tile is fully inside the -inf pad
+ GGML_ASSERT(n_view % 64 == 0);
+
ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_view, n_stream);
ggml_set_input(idx);
res->add_input(llm_graph_input_ptr(new input_block_table(mctx, idx, (uint32_t) n_view)));
--
2.43.0

View File

@@ -1,137 +0,0 @@
From 17d97cb74e3e8c93751afd33f5c183e57056fde9 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Tue, 23 Jun 2026 11:52:45 +0200
Subject: [PATCH] feat(paged): decoupled per-step prefill-token budget (patch
0013)
llama-server already co-batches decode with chunked prefill: update_slots()
appends every generating slot's sampled token first, then fills the rest of the
n_batch budget with prompt tokens, deferring the overflow to the next step. But
the prefill chunk size is hard-wired to n_batch (default 2048): one slot's
~2048-token prefill chunk lands in a single compute-heavy step, and every decode
co-batched into that step sees a multi-second inter-token-latency (ITL) spike.
Lowering n_batch shrinks the chunk but also caps decode-concurrency width and
prefill throughput, because they are coupled.
Add LLAMA_PREFILL_BUDGET: a per-step prefill-token budget decoupled from n_batch
(the analogue of vLLM's --max-num-batched-tokens / long_prefill_token_threshold).
The prompt-fill loop and the outer slot loop now also stop once this many prompt
tokens have been added in the current update_slots() step, so a long prefill is
split across more steps that each still advance in-flight decode. Default (env
unset or <= 0) = disabled, so stock behaviour is byte-identical. Orthogonal to
LLAMA_KV_PAGED: this is a pure scheduler knob and works with paged off.
Measured on GB10 (sm_121), dense Qwen3-32B-NVFP4, paged build, 8 steady decode
streams with one 6000-token prefill injected mid-stream; same binary, only
LLAMA_PREFILL_BUDGET differs:
metric stock(off) budget=256 budget=512
worst decode freeze (ms) 3380 482 (7.0x) 778 (4.3x)
median decode ITL in window 2264 411 (5.5x) 689
decode_stall (ms) 3285 387 (8.5x) 684 (4.8x)
decode steps during prefill 38 201 (5.3x) 108
injected-req TTFT (ms) 8493 10172 (+20%) 8432 (~0%)
steady-state baseline ITL 94 95 94
This is a LATENCY/fairness lever, not an aggregate-throughput lever: it flattens
the decode ITL spike a long prefill inflicts on co-batched decoders (8.5x smaller
worst freeze and 5.3x more decode progress during the prefill at budget=256), in
exchange for a modest TTFT rise on the long request (the classic chunked-prefill
trade-off; budget=512 buys 4.8x with ~no TTFT cost). Steady aggregate decode is
unchanged: it is bandwidth/weight-capped on GB10 (the NVFP4 weight-read floor),
which the scheduler cannot lift.
Correctness (same model, greedy temp 0, fa on):
- budget unset or >= n_batch: byte-identical to stock (the added break never
fires before the existing n_batch break; the off-path is a no-op by
construction).
- short prompt (<= budget): byte-identical to stock.
- the knob is exactly equivalent to stock's native -b chunking: budget=512 ==
stock -b512 and budget=256 == stock -b256, both BYTE-IDENTICAL, while keeping
n_batch=2048 for decode width.
- on a prompt larger than the budget the chunked greedy output diverges from the
single n_batch chunk only by intrinsic flash-attn chunk-size FP grouping: PURE
stock -b256 diverges from stock -b2048 the same way with the patch inactive,
and the output stays coherent and answers correctly.
Productisation (LocalAI): surface as a model options knob (max_prefill_tokens /
mpt) parsed in grpc-server.cpp, default 0 = disabled, per CHUNKED_PREFILL_PLAN
Phase B; the vendored update_slots() hunk here is that plan's scheduler patch and
stays disjoint from the paged allocation hunks.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
tools/server/server-context.cpp | 35 ++++++++++++++++++++++++++++++++-
1 file changed, 34 insertions(+), 1 deletion(-)
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
index 04c6361..5d83b30 100644
--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -2723,6 +2723,29 @@ private:
int32_t n_batch = llama_n_batch(ctx_tgt);
int32_t n_ubatch = llama_n_ubatch(ctx_tgt);
+ // PAGED serving lever (patch 0013): decoupled per-step prefill-token budget.
+ // Analogue of vLLM's --max-num-batched-tokens. Stock llama-server caps the prompt
+ // tokens ingested per update_slots() step at n_batch only; with cont_batching the
+ // sampled decode tokens of every generating slot are appended FIRST, then prompt
+ // tokens fill the batch up to n_batch. A long prompt therefore grabs an ~n_batch
+ // chunk in a SINGLE compute-heavy step, spiking the inter-token latency of every
+ // co-batched decoder (head-of-line jitter). LLAMA_PREFILL_BUDGET caps the prompt
+ // tokens added per step independently of n_batch, splitting a long prefill across
+ // more steps so in-flight decode keeps advancing smoothly. Default (env unset or
+ // <=0) = disabled => stock behavior is byte-identical. Orthogonal to LLAMA_KV_PAGED
+ // (this is a pure scheduler knob; works with paged off).
+ int32_t n_prefill_budget = 0; // 0 = disabled (stock n_batch-only chunking)
+ {
+ const char * env_pb = getenv("LLAMA_PREFILL_BUDGET");
+ if (env_pb) {
+ const int v = atoi(env_pb);
+ if (v > 0) {
+ n_prefill_budget = std::min(n_batch, std::max(1, v));
+ }
+ }
+ }
+ int32_t n_prompt_budgeted = 0; // prompt tokens added to the batch this step (across slots)
+
float alora_scale = -1.0f;
size_t alora_disabled_id = 0;
@@ -3159,7 +3182,10 @@ private:
const bool n_before_user_known = n_before_user > 0;
// add prompt tokens for processing in the current batch
- while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch) {
+ // (patch 0013) also stop once the per-step prefill budget is spent, so a long
+ // prompt is split across more steps and leaves batch room for co-batched decode
+ while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch &&
+ (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) {
// get next token to process
llama_token cur_tok = input_tokens[slot.prompt.n_tokens()];
if (cur_tok == LLAMA_TOKEN_NULL) {
@@ -3185,6 +3211,7 @@ private:
slot.prompt.tokens.push_back(cur_tok);
slot.n_prompt_tokens_processed++;
+ n_prompt_budgeted++; // (patch 0013) count toward the per-step prefill budget
// stop the prompt batch exactly before the latest user input, so a checkpoint
// can be created after the previous messages
@@ -3293,6 +3320,12 @@ private:
if (batch.n_tokens >= n_batch) {
break;
}
+
+ // (patch 0013) stop adding prompts once the per-step prefill budget is spent,
+ // leaving the remaining batch capacity for co-batched decode of other slots
+ if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) {
+ break;
+ }
}
}
--
2.43.0

View File

@@ -1,140 +0,0 @@
From 652b858252b354f4d4fb49e5ed7468eeee8e32fc Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Tue, 23 Jun 2026 15:47:06 +0200
Subject: [PATCH] feat(paged): expert-aware MoE token-tile cap (patch 0014)
On GB10 (sm_121) the Qwen3-30B-A3B-class mxfp4 MoE decode path already uses the
sorted grouped FP4-MMA GEMM (MUL_MAT_ID -> ggml_cuda_mul_mat_q ids branch:
mm_ids_helper moe_align/scatter + one persistent stream-k mul_mat_q), so the
originally reported npl128 throughput cliff does NOT reproduce on this build.
llama-batched-bench decode (S_TG t/s) is monotonic across batch:
npl 1 8 32 64 128 256
S_TG 85 282 629 935 1295 1779 (stock, mxfp4 MoE, -fa on)
There is no knee to erase; the old cliff (a real high-batch regression, 620 t/s
at npl128) was fixed upstream by grouped-mmq + MoE stream-k load balancing.
What remains is a pure tile-shape micro-inefficiency. In mul_mat_q_case the
token-tile width mmq_x is chosen to cover ncols_max (= ne12, the per-expert
column upper bound = token count, up to 128) in one column-tile. At MoE decode
the per-expert token density is ~ne12*k/n_experts (top-8 of 128 => ~1/16 of
ne12, e.g. ~8 tokens/expert at npl128), so each expert's single mmq_x-wide
col-tile is only ~6% filled: the MMA accumulator tile is mmq_x-wide at compile
time and burns throughput on the padding columns while the larger y-tile lowers
occupancy. Stock picks the LARGEST tile (128) where the SMALLEST tile that still
covers the density would raise fill + occupancy at no extra weight read (at
tokens/expert <= mmq_x there is exactly one non-empty col-tile per expert; the
emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k
kernel) - the inverse of vLLM's small per-expert BLOCK_SIZE_M.
Add LLAMA_MOE_MMQ_X: an env cap on mmq_x for the MUL_MAT_ID path only
(expert_bounds != nullptr). Default (unset or <= 0) = disabled, so the mmq_x
selection, and therefore every kernel launched, is byte-identical to stock. The
cap only ever lowers the loop's upper bound and still selects from the same
granularity- and shared-memory-validated mmq_x set stock already uses for
smaller batches, so no new kernel configuration is exercised.
Measured on GB10, qwen3coder-mxfp4.gguf, -fa on, -npp 128 -ntg 128, same binary,
only LLAMA_MOE_MMQ_X differs (decode S_TG t/s / prefill S_PP t/s):
npl stock S_TG cap64 S_TG d% stock S_PP cap64 S_PP
64 936 938 +0.1 2924 2883
128 1295 1357 +4.8 3075 3038
256 1784 1825 +2.3 3085 3046
(reproduced across interleaved reps; cap64 npl128 = 1357.5/1357.0, very stable)
cap64 lifts high-batch decode +4.8% (npl128) / +2.3% (npl256), neutral at
npl <= 64, for a consistent ~1.3% prefill cost. Smaller caps are net-negative:
cap16 / cap32 crater prefill -41% / -17% (a 512-token prefill ubatch has ~32
tokens/expert, which overflows a 16/32-wide tile into extra col-tiles + weight
re-reads), so 64 is the recommended value and the only one that helps net.
Honest framing: this is NOT a cliff fix (no cliff exists) and not a real-server
throughput unlock (llama-server continuous batching already scales). It is a
modest high-effective-batch DECODE micro-optimization that matches vLLM's
smaller per-expert M-tiling, surfaced as an opt-in, default-off knob. The
durable density-aware auto-select (drop the blunt global cap, choose mmq_x from
ne_get_rows / n_active_experts so prefill keeps its large tile) is scoped in
patches/paged/MOE_GROUPED_GEMM_SCOPE.md.
Correctness: greedy temp-0 llama-server output with cap64 is byte-identical to
stock for single-stream generation (fibonacci / capital-of-France / photosynthesis
prompts) and stays coherent; batched-bench ran thousands of capped MoE matmuls at
npl128/256 (mmq_x forced 128 -> 64) with no CUDA error / NaN and stable output.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/src/ggml-cuda/mmq.cuh | 37 ++++++++++++++++++++++++++++++++++++-
1 file changed, 36 insertions(+), 1 deletion(-)
diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
index edf546d..cff608e 100644
--- a/ggml/src/ggml-cuda/mmq.cuh
+++ b/ggml/src/ggml-cuda/mmq.cuh
@@ -6,6 +6,7 @@
#include <climits>
#include <cstdint>
+#include <cstdlib>
using namespace ggml_cuda_mma;
@@ -4052,6 +4053,18 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
}
}
+// [paged patch 0014] MoE token-tile (mmq_x) cap, read once from env LLAMA_MOE_MMQ_X.
+// Returns 0 when unset / non-positive => disabled (stock mmq_x selection, byte-identical).
+// On the MUL_MAT_ID grouped-GEMM path this caps the per-expert column-tile width toward the
+// low MoE-decode per-expert token density, raising tile fill + occupancy (see mul_mat_q_case).
+static inline int ggml_cuda_moe_mmq_x_cap() {
+ static const int cap = []() -> int {
+ const char * s = getenv("LLAMA_MOE_MMQ_X");
+ return s ? atoi(s) : 0;
+ }();
+ return cap;
+}
+
template <ggml_type type>
void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
const int id = ggml_cuda_get_device();
@@ -4063,10 +4076,32 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
const int mmq_x_max = get_mmq_x_max_host(cc);
const int mmq_y = get_mmq_y_host(cc);
+ // [paged patch 0014] expert-aware MoE token-tile (mmq_x) cap.
+ // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are
+ // tokens sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count,
+ // up to 128) in a single column-tile. At MoE decode the per-expert token density is low
+ // (top-k of many experts: ~ne12*k/n_experts tokens/expert, e.g. ~8 at npl128 for
+ // Qwen3-30B-A3B top-8/128), so each expert's single mmq_x-wide col-tile is mostly empty:
+ // the MMA accumulator tile is mmq_x-wide at compile time and wastes throughput on the
+ // padding columns while the larger y-tile lowers occupancy. Capping mmq_x toward the
+ // per-expert density raises tile fill + occupancy with no extra weight reads (at
+ // tokens/expert <= mmq_x there is still exactly one non-empty col-tile per expert; the
+ // emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k kernel).
+ // Default (env unset or <= 0) = disabled => mmq_x selection is byte-identical to stock;
+ // off the ids path the cap never applies.
+ int mmq_x_lim = mmq_x_max;
+ if (args.expert_bounds != nullptr) {
+ const int moe_cap = ggml_cuda_moe_mmq_x_cap();
+ if (moe_cap > 0) {
+ const int cap = moe_cap < 8 ? 8 : moe_cap;
+ mmq_x_lim = cap < mmq_x_max ? cap : mmq_x_max;
+ }
+ }
+
int mmq_x_best = 0;
int ntiles_x_best = INT_MAX;
- for (int mmq_x = 8; mmq_x <= mmq_x_max && ntiles_x_best > 1; mmq_x += 8) {
+ for (int mmq_x = 8; mmq_x <= mmq_x_lim && ntiles_x_best > 1; mmq_x += 8) {
const int granularity = mmq_get_granularity_host(mmq_x, cc);
if (mmq_x % granularity != 0 || mmq_get_nbytes_shared<type>(mmq_x, mmq_y, cc, warp_size, nwarps) > smpbo) {
--
2.43.0

View File

@@ -1,238 +0,0 @@
From 151343bc8c7b956c99eafc855704b70d44637a3b Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Tue, 23 Jun 2026 21:03:00 +0200
Subject: [PATCH] feat(paged): expert-density-aware MoE token-tile auto-select
(patch 0015)
The durable follow-up to patch 0014's blunt LLAMA_MOE_MMQ_X global cap (which the
0014 doc itself scoped): replace the manual env cap with a host-side, default-on
auto-select inside mul_mat_q_case that picks a small token-tile (mmq_x) for the
MUL_MAT_ID grouped FP4-MMA GEMM only when the per-expert token density is low
(decode), and keeps the large 128-wide tile when density is high (prefill). No new
kernel: the selection only lowers the loop's upper bound to an already-compiled,
granularity- and shared-memory-validated mmq_x.
Density is estimated host-side from the args the ids path already passes:
ne_get_rows = ncols_dst = ne12 * n_expert_used (token-expert assignments)
n_experts = nchannels_x = ne02
density = ceil(ne_get_rows / min(ne_get_rows, n_experts)) (tokens/expert)
Cap to the small tile (default 64) only when density <= density_max. Unlike 0014's
global cap, the high-density prefill ubatch stays on the big tile, so S_PP does not
regress by construction.
density_max default = 8 (not tile/4 = 16). The cap must fire for decode but not for
a prefill ubatch, and each has per-expert density n_tokens*n_used/n_experts. At the
standard n_ubatch=512, n_used=8: prefill density = 4096/n_experts (32 at 128 experts,
16 at 256), decode at npl<=128 is <= 1024/n_experts (8 at 128, 4 at 256). Default 8
sits strictly between for every n_experts in [128,511], so it caps decode and leaves
prefill on the big tile. tile/4 (=16) equalled the 256-expert prefill density and
cratered its S_PP by ~2%, the regression this threshold exists to avoid.
Measured on GB10 (sm_121), Qwen3.6-35B-A3B NVFP4 (256 experts, top-8, GDN linear
attention), llama-batched-bench -fa on -npp 128 -ntg 128, default-on vs stock
(LLAMA_MOE_AUTO_TILE=0), median of 5 reps:
npl S_TG stock S_TG 0015 dTG% S_PP stock S_PP 0015 dPP%
8 183.59 183.18 -0.22% 1489.2 1500.1 +0.73%
32 264.02 263.44 -0.22% 2034.5 2033.5 -0.05%
64 311.76 310.41 -0.43% 2028.3 2027.6 -0.03%
128 336.10 337.32 +0.36% 2025.0 2027.7 +0.13%
Honest read: on THIS model the decode effect is within run-to-run noise (neutral)
and prefill is neutral. q36-35b-a3b decode is bound by the GDN/SSM recurrence and
256 tiny-expert weight bandwidth, not the MoE col-tile occupancy, so the col-tile
lever (worth +4.8% @npl128 on Qwen3-Coder-30B, 128 larger experts, patch 0014
cap64) does not move it. A npl128 tile sweep on this model confirms 64 is the only
useful width (TILE8 -6.3%, TILE16 -3.2%, TILE32 -0.2%, TILE64 +0.7%, TILE96 -0.8%):
smaller tiles lose to grid/scheduling overhead and the FP4-MMA minimum width.
Value banked default-on: (1) removes 0014's ~1.3% prefill cost by construction
(density-gated, not global); (2) auto-selects the small tile for col-tile-bound MoE
decode, reproducing 0014 cap64's tile=64 at npl128 by construction, so it preserves
the +4.8% on Qwen3-Coder-30B without the prefill cost; (3) prefill-safe and decode-
neutral on the SSM model, harmless where it does not help. Conservative by design:
at npl256 the qwen3coder decode density (16) equals the 256-expert prefill density
(16), indistinguishable to a pure-density gate, so density_max=8 forgoes 0014's
+2.3% @npl256 to keep 256-expert prefill safe; an ne12-aware refinement is future
work.
LLAMA_MOE_MMQ_X (patch 0014) is KEPT as a manual override that, when > 0, forces the
old blunt global cap and bypasses the auto-select (explicit A/B knob). The auto-
select is the default; LLAMA_MOE_AUTO_TILE=0 restores exact stock mmq_x selection.
LLAMA_MOE_DECODE_TILE / LLAMA_MOE_DENSITY_MAX tune the small tile / threshold.
Correctness: extends tests/test-backend-ops test_mul_mat_id with a ragged small-M
NVFP4/MXFP4 MoE decode-density gate (128 experts, top-8, m=768, k=2048, n in
{16,33,64,128,130,200,256,512} spanning the cap boundary and ragged token counts).
All 16 shapes pass CUDA-vs-CPU oracle on GB10 both default-on and with
LLAMA_MOE_AUTO_TILE=0; full MUL_MAT_ID suite 2/2 backends OK. Off the ids path
nothing changes (non-MoE mul_mat byte-identical to stock).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/src/ggml-cuda/mmq.cuh | 100 ++++++++++++++++++++++++++++++-------
tests/test-backend-ops.cpp | 16 ++++++
2 files changed, 99 insertions(+), 17 deletions(-)
diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
index cff608e..9718b12 100644
--- a/ggml/src/ggml-cuda/mmq.cuh
+++ b/ggml/src/ggml-cuda/mmq.cuh
@@ -4053,10 +4053,11 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
}
}
-// [paged patch 0014] MoE token-tile (mmq_x) cap, read once from env LLAMA_MOE_MMQ_X.
-// Returns 0 when unset / non-positive => disabled (stock mmq_x selection, byte-identical).
-// On the MUL_MAT_ID grouped-GEMM path this caps the per-expert column-tile width toward the
-// low MoE-decode per-expert token density, raising tile fill + occupancy (see mul_mat_q_case).
+// [paged patch 0014] MoE token-tile (mmq_x) MANUAL cap, read once from env LLAMA_MOE_MMQ_X.
+// Returns 0 when unset / non-positive => disabled (fall through to the patch-0015 auto-select).
+// When > 0 it forces a blunt GLOBAL cap on the per-expert column-tile width for the MUL_MAT_ID
+// grouped-GEMM path (decode AND prefill), overriding the density-aware auto-select below. Kept
+// as an explicit override / A-B knob; the default path is now the auto-select.
static inline int ggml_cuda_moe_mmq_x_cap() {
static const int cap = []() -> int {
const char * s = getenv("LLAMA_MOE_MMQ_X");
@@ -4065,6 +4066,43 @@ static inline int ggml_cuda_moe_mmq_x_cap() {
return cap;
}
+// [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select knobs (DEFAULT-ON).
+// LLAMA_MOE_AUTO_TILE=0 disables the auto-select => exact stock mmq_x selection.
+static inline bool ggml_cuda_moe_auto_tile_enabled() {
+ static const bool en = []() -> bool {
+ const char * s = getenv("LLAMA_MOE_AUTO_TILE");
+ return !(s && atoi(s) == 0);
+ }();
+ return en;
+}
+// The small high-occupancy token-tile chosen for low-density (decode) MoE matmuls. Default 64:
+// the measured GB10 sweet spot (full per-expert fill with >=4x routing-imbalance headroom).
+static inline int ggml_cuda_moe_decode_tile() {
+ static const int t = []() -> int {
+ const char * s = getenv("LLAMA_MOE_DECODE_TILE");
+ const int v = s ? atoi(s) : 0;
+ return v >= 8 ? v : 64;
+ }();
+ return t;
+}
+// Per-expert token-density ceiling under which the small tile is selected. Default 8: the cap must
+// fire for decode but NOT for a prefill ubatch, and the per-expert density of each is
+// n_tokens*n_used/n_experts. For the standard n_ubatch=512, n_used=8 the prefill density is
+// 4096/n_experts (= 32 at 128 experts, 16 at 256 experts); decode at npl<=128 is <=1024/n_experts
+// (= 8 at 128 experts, 4 at 256). Default 8 sits strictly between the two for every n_experts in
+// [128,511], so it caps decode and leaves the prefill ubatch on the big 128 tile - whereas the old
+// tile/4 (=16) equalled the 256-expert prefill density and cratered its S_PP by ~2% (measured on
+// Qwen3.6-35B-A3B NVFP4). 8 also keeps >=8x fill headroom at tile 64 so an imbalanced expert
+// segment never splits into an extra col-tile.
+static inline int ggml_cuda_moe_density_max() {
+ static const int d = []() -> int {
+ const char * s = getenv("LLAMA_MOE_DENSITY_MAX");
+ const int v = s ? atoi(s) : 0;
+ return v > 0 ? v : 8;
+ }();
+ return d;
+}
+
template <ggml_type type>
void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
const int id = ggml_cuda_get_device();
@@ -4076,25 +4114,53 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
const int mmq_x_max = get_mmq_x_max_host(cc);
const int mmq_y = get_mmq_y_host(cc);
- // [paged patch 0014] expert-aware MoE token-tile (mmq_x) cap.
- // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are
- // tokens sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count,
- // up to 128) in a single column-tile. At MoE decode the per-expert token density is low
- // (top-k of many experts: ~ne12*k/n_experts tokens/expert, e.g. ~8 at npl128 for
- // Qwen3-30B-A3B top-8/128), so each expert's single mmq_x-wide col-tile is mostly empty:
- // the MMA accumulator tile is mmq_x-wide at compile time and wastes throughput on the
- // padding columns while the larger y-tile lowers occupancy. Capping mmq_x toward the
- // per-expert density raises tile fill + occupancy with no extra weight reads (at
- // tokens/expert <= mmq_x there is still exactly one non-empty col-tile per expert; the
- // emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k kernel).
- // Default (env unset or <= 0) = disabled => mmq_x selection is byte-identical to stock;
- // off the ids path the cap never applies.
+ // [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select (DEFAULT-ON).
+ // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are tokens
+ // sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count, up to 128)
+ // in a single column-tile, i.e. it MAXIMIZES the tile (128 on Blackwell) for the aggregate
+ // batch. But the tile is then applied PER EXPERT, and at MoE decode the per-expert token
+ // density is tiny (top-k of many experts), so each expert's single 128-wide col-tile is mostly
+ // empty: the MMA accumulator tile is mmq_x-wide at compile time and burns throughput on the
+ // padding columns while the larger y-tile lowers occupancy. vLLM's fused-MoE does the opposite
+ // (a small per-expert BLOCK_SIZE_M). We reproduce that here, host-side only, by picking a
+ // SMALLER mmq_x when - and only when - the per-expert density is low:
+ //
+ // ne_get_rows = args.ncols_dst = ne12 * n_expert_used (total token-expert assignments)
+ // n_experts = args.nchannels_x = ne02
+ // n_active_est = min(n_experts, ne_get_rows) (upper bound on active experts)
+ // density = ceil(ne_get_rows / n_active_est) (avg tokens per active expert)
+ //
+ // Cap to the small tile (default 64) only when density <= density_max (default 8). 8 sits below
+ // every prefill-ubatch density and above every decode density for n_experts in [128,511] at the
+ // standard n_ubatch=512 (prefill 4096/n_experts, decode <=1024/n_experts), with >=8x fill headroom
+ // so a capped expert segment never splits a col-tile. Decode (per-expert density 4 at 256 experts,
+ // 8 at 128 experts @npl128) gets the fuller high-occupancy tile; the prefill ubatch (density 16 at
+ // 256 / 32 at 128 experts) stays ABOVE the threshold and keeps the big
+ // 128 compute tile - so unlike the blunt global cap (LLAMA_MOE_MMQ_X / patch 0014) this is
+ // prefill-safe by construction. The selection only ever picks an already-compiled, granularity-
+ // and shared-memory-validated mmq_x that the loop below would consider for a smaller batch; no
+ // new kernel. Off the ids path (expert_bounds == nullptr) nothing changes => non-MoE mul_mat
+ // and the gated f16/bf16 host-loop fallback stay byte-identical to stock.
+ // - LLAMA_MOE_MMQ_X=<n> : manual blunt global cap, overrides the auto-select (patch 0014).
+ // - LLAMA_MOE_AUTO_TILE=0 : disable the auto-select (exact stock selection).
+ // - LLAMA_MOE_DECODE_TILE=<n>, LLAMA_MOE_DENSITY_MAX=<n> : tune the tile / threshold.
int mmq_x_lim = mmq_x_max;
if (args.expert_bounds != nullptr) {
const int moe_cap = ggml_cuda_moe_mmq_x_cap();
if (moe_cap > 0) {
const int cap = moe_cap < 8 ? 8 : moe_cap;
mmq_x_lim = cap < mmq_x_max ? cap : mmq_x_max;
+ } else if (ggml_cuda_moe_auto_tile_enabled()) {
+ const int64_t ne_get_rows = args.ncols_dst;
+ const int64_t n_experts = args.nchannels_x;
+ if (ne_get_rows > 0 && n_experts > 0) {
+ const int64_t n_active = ne_get_rows < n_experts ? ne_get_rows : n_experts;
+ const int64_t density = (ne_get_rows + n_active - 1) / n_active;
+ const int tile = ggml_cuda_moe_decode_tile();
+ if (density <= (int64_t) ggml_cuda_moe_density_max() && tile < mmq_x_max) {
+ mmq_x_lim = tile;
+ }
+ }
}
}
diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index 15ae389..f219309 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -8575,6 +8575,22 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
// gpt-oss issue with Vulkan mmq_id
test_cases.emplace_back(new test_mul_mat_id(GGML_TYPE_MXFP4, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880));
+ // [paged P0] MXFP4/NVFP4 qwen3-30b-a3b MoE decode-density regression gate for the expert-
+ // density-aware mmq_x auto-select (patch 0015). Real expert-FFN slice (128 experts, top-8,
+ // m=768, k=2048) so this exercises the exact grouped FP4-MMA mmq kernel the model runs.
+ // Per-expert token density = n*n_used/n_mats = n/16; cover the decode band (density 1/4/8/16
+ // at n 16/64/128/256), ragged token counts (n 33/130/200: experts with 0/1/2 tokens, n not a
+ // multiple of the tile) where the tiny-M col-tiles change geometry and any masking can leak,
+ // and a prefill-density shape (n 512 => density 32) the auto-select must leave on the large
+ // 128 tile. n>=128 is exactly where stock picks mmq_x=128 and the auto-select picks 64, so the
+ // op-test (CPU oracle vs CUDA, deterministic) is the bit-exact regression gate for P1: it must
+ // pass with the auto-select on (default) and with LLAMA_MOE_AUTO_TILE=0 (stock selection).
+ for (ggml_type type_a : {GGML_TYPE_MXFP4, GGML_TYPE_NVFP4}) {
+ for (int n : {16, 33, 64, 128, 130, 200, 256, 512}) {
+ test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 128, 8, false, 768, n, 2048));
+ }
+ }
+
for (ggml_type type_a : all_types) {
test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 4, 2, false, 64, 16, 3*ggml_blck_size(type_a)));
}
--
2.43.0

View File

@@ -1,205 +0,0 @@
From 0a2677c6e6c608f9c0ec657faa0ff04a03370aa6 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Wed, 24 Jun 2026 07:44:25 +0000
Subject: [PATCH] feat(paged): dynamic decode-first prefill-token budget (patch
0016, continuous-batch P1)
Supersede patch 0013's STATIC per-step prefill cap with a DYNAMIC,
decode-first token budget: the P1 of the token-granular continuous-batch
scheduler scoped in CONTINUOUS_BATCH_SCHEDULER_SCOPE.md. This is a POLICY
change only inside update_slots(): no new slot states, no batch-formation
rewrite, zero libllama changes. llama-server already emits one unified
mixed prefill+decode batch per step (Phase 1 appends every ready decode
token unconditionally; Phase 2 fills prefill into the same batch); 0013
already ships that mixed ubatch. 0016 only changes the COUNT of prefill
tokens admitted per step.
The budget block already sits AFTER Phase 1's decode fill, so batch.n_tokens
== D (the live decode load) is known there. Instead of 0013's constant
LLAMA_PREFILL_BUDGET (which ignores D, needs per-workload tuning, and lets
one long prompt monopolise the step), compute a dynamic budget:
T = min(LLAMA_MAX_BATCH_TOKENS (default n_batch), n_batch), floored at
n_ubatch (the vLLM max_num_batched_tokens analogue / ITL trade knob)
prefill_budget_step = max(n_ubatch, T - D) (leftover after decode,
auto-shrinks as decode load rises so the step never inflates past T)
prefill_cap_per_slot = min(T, ceil(0.04*n_ctx)) floored at n_ubatch
(the long_prefill_token_threshold analogue: one long prompt cannot
eat the whole leftover; LLAMA_PREFILL_CAP overrides)
Phase 2's inner prompt-fill loop and outer admission break are bounded by
prefill_budget_step (across slots) and a new per-slot slot_prompt_added
counter (per-slot cap), instead of the static 0013 cap; the n_batch hard
ceiling stays as the compute bound. Decode is structurally claimed first
and never capped (Phase 1), so the decode-first guarantee is free.
Why it supersedes 0013: 0013 needs a hand-picked constant (256 for dense)
that is net-negative at low npl and costs MoE TTFT; the T - D budget is
self-tuning across npl 8..128 and across dense vs MoE, holding the GB10
decode ceiling (~161 dense / ~333 MoE tok/s @npl128) WITHOUT per-workload
tuning while collapsing burst TTFT. Steady-state decode throughput is NOT
lifted (that is the decode-kernel ceiling, scoped as P3); the P1 win is
TTFT + tuning-free robustness + clean supersession of 0013.
DEFAULT-OFF BYTE-IDENTICAL: with all knobs unset, behaviour is byte-identical
to stock. The degenerate T == n_batch case is byte-identical to stock/0013
(the determinism oracle): the leftover max(n_ubatch, n_batch - D) and the
n_batch per-slot cap both reach the existing `batch.n_tokens < n_batch`
ceiling at the same point, so no new bound fires. The legacy
LLAMA_PREFILL_BUDGET path is preserved exactly (honoured only when
LLAMA_MAX_BATCH_TOKENS is unset), so 0013 is cleanly subsumed. Orthogonal
to LLAMA_KV_PAGED: pure scheduler policy, identical decisions paged on/off.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
tools/server/server-context.cpp | 107 +++++++++++++++++++++++++-------
1 file changed, 85 insertions(+), 22 deletions(-)
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
index 5d83b30..f7a114c 100644
--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -2723,24 +2723,78 @@ private:
int32_t n_batch = llama_n_batch(ctx_tgt);
int32_t n_ubatch = llama_n_ubatch(ctx_tgt);
- // PAGED serving lever (patch 0013): decoupled per-step prefill-token budget.
- // Analogue of vLLM's --max-num-batched-tokens. Stock llama-server caps the prompt
- // tokens ingested per update_slots() step at n_batch only; with cont_batching the
- // sampled decode tokens of every generating slot are appended FIRST, then prompt
- // tokens fill the batch up to n_batch. A long prompt therefore grabs an ~n_batch
- // chunk in a SINGLE compute-heavy step, spiking the inter-token latency of every
- // co-batched decoder (head-of-line jitter). LLAMA_PREFILL_BUDGET caps the prompt
- // tokens added per step independently of n_batch, splitting a long prefill across
- // more steps so in-flight decode keeps advancing smoothly. Default (env unset or
- // <=0) = disabled => stock behavior is byte-identical. Orthogonal to LLAMA_KV_PAGED
- // (this is a pure scheduler knob; works with paged off).
- int32_t n_prefill_budget = 0; // 0 = disabled (stock n_batch-only chunking)
+ // PAGED serving lever (patch 0016, supersedes 0013): dynamic decode-first
+ // per-step prefill-token budget (continuous-batch scheduler P1). llama-server
+ // already builds ONE mixed batch per update_slots() step: Phase 1 (just above)
+ // appended every generating slot's sampled token UNCONDITIONALLY, so at this point
+ // batch.n_tokens == D is the live decode load; Phase 2 (below) fills the remaining
+ // batch capacity with prompt tokens. Patch 0013 capped Phase 2 with a STATIC
+ // constant (LLAMA_PREFILL_BUDGET) that ignores D, needs per-workload tuning, and
+ // lets one long prompt monopolise the step.
+ //
+ // This computes a DYNAMIC budget instead, the vLLM v1 token-budget analogue:
+ // a single total per-step token budget T, decode claims its D tokens first
+ // (already in the batch), and prefill gets the leftover T - D distributed across
+ // waiting prompts with a per-slot chunk cap. As decode load D rises the prefill
+ // leftover auto-shrinks, so the step never inflates past T at any concurrency:
+ // the budget self-tunes across the npl range and across dense vs MoE without a
+ // hand-picked constant (the 161/333 tok/s GB10 decode ceiling is held tuning-free
+ // instead of via 0013's hand-tuned 256). Decode is structurally claimed first and
+ // never capped (Phase 1), so the decode-first guarantee is free here.
+ //
+ // LLAMA_MAX_BATCH_TOKENS (T) total per-step token budget (decode + prefill),
+ // default n_batch, clamped to [n_ubatch, n_batch] so
+ // the compute loop stays a single llama_decode and
+ // prefill keeps an n_ubatch floor of progress.
+ // LLAMA_PREFILL_CAP per-slot max prompt tokens per step (the
+ // long_prefill_token_threshold analogue), default
+ // min(T, ceil(0.04*n_ctx)) floored at n_ubatch, so
+ // one long prompt cannot eat the whole leftover.
+ // LLAMA_PREFILL_BUDGET legacy static cap (patch 0013); honoured ONLY when
+ // LLAMA_MAX_BATCH_TOKENS is unset, for back-compat.
+ //
+ // DEFAULT-OFF BYTE-IDENTICAL: with all three knobs unset, and in the degenerate
+ // T == n_batch case, behaviour is byte-identical to stock. At T == n_batch the
+ // dynamic leftover max(n_ubatch, n_batch - D) and the n_batch per-slot cap both
+ // reach the existing `batch.n_tokens < n_batch` ceiling at the SAME point, so no
+ // new bound fires (the determinism oracle). Orthogonal to LLAMA_KV_PAGED: pure
+ // scheduler policy, identical decisions with paged on or off.
+ const int32_t n_decode_in_batch = batch.n_tokens; // D: Phase 1 appended D decode tokens above
+ int32_t prefill_budget_step = 0; // 0 = disabled (stock n_batch-only chunking)
+ int32_t prefill_cap_per_slot = 0; // 0 = disabled (no per-slot prompt-chunk cap)
{
- const char * env_pb = getenv("LLAMA_PREFILL_BUDGET");
- if (env_pb) {
+ int32_t mbt = 0;
+ if (const char * env_mbt = getenv("LLAMA_MAX_BATCH_TOKENS")) {
+ mbt = atoi(env_mbt);
+ }
+ if (mbt > 0) {
+ // dynamic decode-first budget (P1): T clamped to [n_ubatch, n_batch]
+ int32_t T = std::min(n_batch, mbt);
+ T = std::max(T, n_ubatch);
+ // leftover after decode, floored at n_ubatch so prefill never fully starves
+ prefill_budget_step = std::max(n_ubatch, T - n_decode_in_batch);
+ // per-slot prompt-chunk cap (long_prefill_token_threshold analogue)
+ int32_t cap = 0;
+ if (const char * env_cap = getenv("LLAMA_PREFILL_CAP")) {
+ cap = atoi(env_cap);
+ }
+ if (cap <= 0) {
+ const int32_t pct4 = (n_ctx + 24) / 25; // ceil(0.04 * n_ctx)
+ cap = std::min(T, std::max(n_ubatch, pct4));
+ }
+ cap = std::min(n_batch, std::max(n_ubatch, cap));
+ // at T == n_batch the leftover and cap both reach the n_batch ceiling
+ // together; pin the cap to n_batch so this case stays byte-identical
+ if (T >= n_batch) {
+ cap = n_batch;
+ }
+ prefill_cap_per_slot = cap;
+ } else if (const char * env_pb = getenv("LLAMA_PREFILL_BUDGET")) {
+ // legacy static budget (patch 0013), kept for back-compat when the
+ // dynamic knob is unset: a constant per-step prefill cap, no per-slot cap
const int v = atoi(env_pb);
if (v > 0) {
- n_prefill_budget = std::min(n_batch, std::max(1, v));
+ prefill_budget_step = std::min(n_batch, std::max(1, v));
}
}
}
@@ -3181,11 +3235,18 @@ private:
const int32_t n_before_user = slot.task->params.n_before_user;
const bool n_before_user_known = n_before_user > 0;
+ // (patch 0016) per-slot prompt tokens added this step, for the per-slot
+ // chunk cap (resets each slot); n_batch stays the hard compute ceiling
+ int32_t slot_prompt_added = 0;
+
// add prompt tokens for processing in the current batch
- // (patch 0013) also stop once the per-step prefill budget is spent, so a long
- // prompt is split across more steps and leaves batch room for co-batched decode
+ // (patch 0016) also stop once (a) the dynamic per-step prefill budget
+ // (the T - D leftover) is spent across all slots, or (b) this slot's
+ // per-slot chunk cap is hit, so a long prompt is split across more steps
+ // and leaves batch room for co-batched decode of the other slots
while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch &&
- (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) {
+ (prefill_budget_step == 0 || n_prompt_budgeted < prefill_budget_step) &&
+ (prefill_cap_per_slot == 0 || slot_prompt_added < prefill_cap_per_slot)) {
// get next token to process
llama_token cur_tok = input_tokens[slot.prompt.n_tokens()];
if (cur_tok == LLAMA_TOKEN_NULL) {
@@ -3211,7 +3272,8 @@ private:
slot.prompt.tokens.push_back(cur_tok);
slot.n_prompt_tokens_processed++;
- n_prompt_budgeted++; // (patch 0013) count toward the per-step prefill budget
+ n_prompt_budgeted++; // (patch 0016) toward the dynamic per-step prefill budget
+ slot_prompt_added++; // (patch 0016) toward this slot's per-step chunk cap
// stop the prompt batch exactly before the latest user input, so a checkpoint
// can be created after the previous messages
@@ -3321,9 +3383,10 @@ private:
break;
}
- // (patch 0013) stop adding prompts once the per-step prefill budget is spent,
- // leaving the remaining batch capacity for co-batched decode of other slots
- if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) {
+ // (patch 0016) stop admitting prompts once the dynamic per-step prefill
+ // budget (the T - D leftover) is spent, leaving the remaining batch
+ // capacity for co-batched decode of the other slots
+ if (prefill_budget_step > 0 && n_prompt_budgeted >= prefill_budget_step) {
break;
}
}
--
2.43.0

View File

@@ -1,245 +0,0 @@
From 089f78d2a2c04465a566d499dbe0a67c008435a8 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Wed, 24 Jun 2026 19:56:05 +0200
Subject: [PATCH] feat(paged): FP4 decode GEMM track-B P0 gate + default-off
occupancy instrumentation (patch 0017)
Track B targets the dense NVFP4 weight GEMM (~59% of the GB10 decode step). This lands the P0
bit-exact parity gate and the P1 occupancy levers (default-off / byte-identical) and records the
honest P1 result: the cheap host/occupancy tuning does NOT lift decode_agg on GB10 (sm_121) - the
kill-gate tripped - so nothing is enabled by default.
P0 gate (tests/test-backend-ops.cpp): NVFP4/MXFP4 dense decode-shape MUL_MAT cases at the weight-
row tiling boundary (m in {2048,1600,2050} = exact + ragged vs mmq_y 64/128, n in {32,128} = decode
M, k=2048), so the bit-exact CPU-vs-CUDA oracle covers the mmq_y / min-blocks paths. Green at
default and with every lever on: MUL_MAT 1115/1115, MUL_MAT_ID 805/805, NVFP4 0 fail.
P1 levers (ggml/src/ggml-cuda/mmq.cuh), all default-off => default build byte-identical to stock:
- GGML_CUDA_FP4_MMQ_Y (default 128): type-aware get_mmq_y_host/device plumbing for an NVFP4
weight-row tile override. mmq_y is rigidly nwarps*tile_C::I (=8*16=128, the mmq.cuh static_
assert), so mmq_y<128 also needs nwarps-down (a warp-remap through the shared vec_dot/loader),
left as the P2 kernel change; the host/device plumbing is in place and inert.
- GGML_CUDA_FP4_MINBLOCKS (default 1): NVFP4-only __launch_bounds__ min-resident-CTAs lever
(register-cap the FP4-MMA kernel so >1 CTA co-resides) - the bounded occupancy probe.
- GGML_CUDA_FP4_DENSE_MMQ_X (env, default off): dense col-tile re-read occupancy diagnostic.
Measured GB10 (llama-batched-bench -fa on -npp 128 -ntg 128 -npl 32,128), decode_agg (S_TG):
DENSE q36-27b-nvfp4 @npl128: P0 149.5 -> MINBLOCKS=2 147.9 (-1.1%) -> DENSE_MMQ_X=64 144.3
(-3.5%) -> =32 141.7 (-5.2%). Every occupancy probe regresses.
MoE q36-35b-a3b-nvfp4 @npl128: stock 336.3, MINBLOCKS=2 337.7 (+0.4%, noise), TILE16 324.0
(-3.7%), TILE8 316.6 (-5.9%). mmq_x-down regresses (reproduces patch 0015; GDN/BW-bound).
nsys (kill-gate evidence): the decode FP4 GEMM mul_mat_q<NVFP4,128,0> went 2.782s -> 3.025s
(avg 608us -> 661us, +8.7% slower) under MINBLOCKS=2 - register-capping spills, so occupancy did
not usefully rise. Verdict: the dense M=128 tile is already weight-read/one-read-optimal at
mmq_x=128, NOT occupancy-starved via the cheap levers; the only untested lever is the structural
mmq_y-down (nwarps=4 warp-remap), deferred to P2. Bit-exact gate holds throughout.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/src/ggml-cuda/mmq.cuh | 85 ++++++++++++++++++++++++++++++++++----
tests/test-backend-ops.cpp | 16 +++++++
2 files changed, 92 insertions(+), 9 deletions(-)
diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
index 9718b12..b53e38a 100644
--- a/ggml/src/ggml-cuda/mmq.cuh
+++ b/ggml/src/ggml-cuda/mmq.cuh
@@ -140,7 +140,24 @@ static constexpr __device__ int get_mmq_x_max_device() {
#endif // defined(AMD_MFMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE)
}
-static int get_mmq_y_host(const int cc) {
+// [paged patch 0017 / track B] Dense NVFP4 decode mmq_y (weight-row tile) override.
+// mmq_y tiles the N (weight-row) dimension of the FP4-MMA weight GEMM. Lowering it raises the
+// number of resident CTAs (smaller per-CTA shared footprint + smaller per-thread accumulator) to
+// hide LPDDR5x weight-load latency at the M=128 decode tile, WITHOUT re-reading weights: every
+// weight row lives in exactly one row-tile, so total weight traffic is unchanged (bandwidth-
+// neutral) - the dense-decode occupancy lever from FP4_GEMM_SCOPE_B.md s3/s4.1. mmq_y is a PURE
+// N-row tiling knob: the per-output reduction over K is identical for any mmq_y, so the result
+// stays BIT-EXACT (gated by test-backend-ops MUL_MAT NVFP4 decode shapes). Default 128 == exact
+// stock behaviour (a default build is byte-identical to stock); build -DGGML_CUDA_FP4_MMQ_Y=64
+// (or 96) to enable the tune. Applies ONLY to NVFP4 on Blackwell; every other type/arch untouched.
+#ifndef GGML_CUDA_FP4_MMQ_Y
+#define GGML_CUDA_FP4_MMQ_Y 128
+#endif
+
+static int get_mmq_y_host(const int cc, const ggml_type type = GGML_TYPE_COUNT) {
+ if (GGML_CUDA_FP4_MMQ_Y != 128 && type == GGML_TYPE_NVFP4 && blackwell_mma_available(cc)) {
+ return GGML_CUDA_FP4_MMQ_Y;
+ }
return GGML_CUDA_CC_IS_AMD(cc) ? (GGML_CUDA_CC_IS_RDNA1(cc) ? 64 : 128) :
((GGML_CUDA_CC_IS_NVIDIA(cc) && ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_VOLTA) ? 128 : 64);
}
@@ -154,7 +171,13 @@ if (type == GGML_TYPE_NVFP4 || type == GGML_TYPE_MXFP4) {
return MMQ_ITER_K;
}
+template <ggml_type type = GGML_TYPE_COUNT>
static constexpr __device__ int get_mmq_y_device() {
+#if defined(BLACKWELL_MMA_AVAILABLE)
+ if (type == GGML_TYPE_NVFP4 && GGML_CUDA_FP4_MMQ_Y != 128) {
+ return GGML_CUDA_FP4_MMQ_Y;
+ }
+#endif // defined(BLACKWELL_MMA_AVAILABLE)
#if defined(GGML_USE_HIP)
#if defined(RDNA1)
return 64;
@@ -170,6 +193,28 @@ static constexpr __device__ int get_mmq_y_device() {
#endif // defined(GGML_USE_HIP)
}
+// [paged patch 0017 / track B] Dense NVFP4 decode occupancy lever: min resident CTAs per SM.
+// The FP4-MMA mul_mat_q is REGISTER-bound to 1 CTA/SM (__launch_bounds__(256,1) => ~255 regs/thread
+// => one resident block, the under-occupancy that strands the kernel at ~3% of FP4 peak at M=128).
+// Raising the __launch_bounds__ min-blocks operand register-caps the compiler so N CTAs co-reside,
+// hiding LPDDR5x weight-load latency by CTA-parallelism (the scope s4.1 occupancy goal) WITHOUT a
+// structural mmq_y/nwarps change and WITHOUT extra weight reads (each weight tile still read once).
+// Register allocation cannot change results => BIT-EXACT (gated by test-backend-ops MUL_MAT NVFP4).
+// Default 1 == exact stock behaviour (byte-identical); build -DGGML_CUDA_FP4_MINBLOCKS=2 to enable.
+// Applies ONLY to NVFP4 on Blackwell; every other type/arch keeps the stock min-blocks.
+#ifndef GGML_CUDA_FP4_MINBLOCKS
+#define GGML_CUDA_FP4_MINBLOCKS 1
+#endif
+template <ggml_type type = GGML_TYPE_COUNT>
+static constexpr __device__ int mmq_get_min_blocks_device(const int stock) {
+#if defined(BLACKWELL_MMA_AVAILABLE)
+ if (type == GGML_TYPE_NVFP4 && GGML_CUDA_FP4_MINBLOCKS != 1) {
+ return GGML_CUDA_FP4_MINBLOCKS;
+ }
+#endif // defined(BLACKWELL_MMA_AVAILABLE)
+ return stock;
+}
+
// Decouple shared memory tile sizes from WARP_SIZE to allow for different warp sizes.
// The K dimension of the tiles has either,
// 1*MMQ_TILE_NE_K==32 (always for TILE_Y_K) or 2*MMQ_TILE_NE_K==64 (typically for TILE_X_K),
@@ -3454,7 +3499,7 @@ static __device__ __forceinline__ void mul_mat_q_process_tile(
constexpr int warp_size = ggml_cuda_get_physical_warp_size();
constexpr int nwarps = mmq_get_nwarps_device();
constexpr int qk = ggml_cuda_type_traits<type>::qk;
- constexpr int mmq_y = get_mmq_y_device();
+ constexpr int mmq_y = get_mmq_y_device<type>();
constexpr load_tiles_mmq_t load_tiles = mmq_type_traits<mmq_x, mmq_y, need_check, type>::load_tiles;
extern __shared__ int data_mul_mat_q[];
@@ -3531,13 +3576,13 @@ static __device__ __forceinline__ void mul_mat_q_process_tile(
template <ggml_type type, int mmq_x, bool need_check>
#if defined(GGML_USE_HIP)
#if defined(RDNA4) || defined(RDNA3) || defined(RDNA2) || defined(CDNA) || defined(GCN)
- __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 2)
+ __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device<type>(2))
#endif // defined(RDNA4) || defined(RDNA3) || defined(RDNA2) || defined(CDNA) || defined(GCN)
#else
#if __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA
- __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 1)
+ __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device<type>(1))
#else
- __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 2)
+ __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device<type>(2))
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA
#endif // defined(GGML_USE_HIP)
static __global__ void mul_mat_q(
@@ -3558,7 +3603,7 @@ static __global__ void mul_mat_q(
constexpr int warp_size = ggml_cuda_get_physical_warp_size();
constexpr int qk = ggml_cuda_type_traits<type>::qk;
- constexpr int mmq_y = get_mmq_y_device();
+ constexpr int mmq_y = get_mmq_y_device<type>();
const uint32_t nty = (nrows_x + mmq_y - 1) / mmq_y; // Number of tiles y
@@ -3790,7 +3835,7 @@ static __global__ void mul_mat_q_stream_k_fixup(
float * __restrict__ tmp_last_tile, const uint3 blocks_per_ne00, const int nrows_x, const int ncols_dst,
const int stride_col_dst, const uint3 nchannels_y, const int stride_channel_dst, const uint3 nsamples_y,
const int stride_sample_dst, const uint3 ntx) {
- constexpr int mmq_y = get_mmq_y_device();
+ constexpr int mmq_y = get_mmq_y_device<type>();
constexpr int qk = ggml_cuda_type_traits<type>::qk;
constexpr int ITER_K = get_iter_k(type);
constexpr int blocks_per_iter = ITER_K / qk;
@@ -3947,7 +3992,7 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
const int nsm = ggml_cuda_info().devices[id].nsm;
const int warp_size = ggml_cuda_info().devices[id].warp_size;
const int nwarps = mmq_get_nwarps_host(cc, warp_size);
- const int mmq_y = get_mmq_y_host(cc);
+ const int mmq_y = get_mmq_y_host(cc, type);
const dim3 block_dims(warp_size, nwarps, 1);
@@ -4103,6 +4148,21 @@ static inline int ggml_cuda_moe_density_max() {
return d;
}
+// [paged patch 0017 / track B] DENSE NVFP4 decode mmq_x re-read occupancy DIAGNOSTIC (env, default off).
+// GGML_CUDA_FP4_DENSE_MMQ_X=<n> caps the dense (non-MoE) NVFP4 col-tile to <n>, splitting the M=128
+// decode ubatch into ceil(128/n) col-tiles. Each col-tile re-reads the full weight set (fatal cost
+// in the BW-bound regime) but multiplies resident CTAs. This is the scope s4.1 A/B probe: if
+// decode_agg RISES with cap=64 despite the 2x weight read, occupancy is badly broken (the kernel is
+// compute/occupancy-bound, so mmq_y-down / min-blocks has large upside); if it FALLS, the tile is
+// already bandwidth-saturated and the occupancy ceiling is lower. Unset/<=0 => stock selection.
+static inline int ggml_cuda_fp4_dense_mmq_x_cap() {
+ static const int c = []() -> int {
+ const char * s = getenv("GGML_CUDA_FP4_DENSE_MMQ_X");
+ return s ? atoi(s) : 0;
+ }();
+ return c;
+}
+
template <ggml_type type>
void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
const int id = ggml_cuda_get_device();
@@ -4112,7 +4172,7 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
const int nwarps = mmq_get_nwarps_host(cc, warp_size);
const int mmq_x_max = get_mmq_x_max_host(cc);
- const int mmq_y = get_mmq_y_host(cc);
+ const int mmq_y = get_mmq_y_host(cc, type);
// [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select (DEFAULT-ON).
// On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are tokens
@@ -4145,6 +4205,13 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
// - LLAMA_MOE_AUTO_TILE=0 : disable the auto-select (exact stock selection).
// - LLAMA_MOE_DECODE_TILE=<n>, LLAMA_MOE_DENSITY_MAX=<n> : tune the tile / threshold.
int mmq_x_lim = mmq_x_max;
+ if (args.expert_bounds == nullptr && type == GGML_TYPE_NVFP4) {
+ // dense NVFP4 decode mmq_x re-read occupancy diagnostic (see ggml_cuda_fp4_dense_mmq_x_cap).
+ const int cap = ggml_cuda_fp4_dense_mmq_x_cap();
+ if (cap > 0 && cap < mmq_x_max) {
+ mmq_x_lim = cap < 8 ? 8 : cap;
+ }
+ }
if (args.expert_bounds != nullptr) {
const int moe_cap = ggml_cuda_moe_mmq_x_cap();
if (moe_cap > 0) {
diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index f219309..291c275 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -8591,6 +8591,22 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
}
}
+ // [paged P0 / track B] NVFP4/MXFP4 dense decode-shape mmq_y-down bit-exact gate.
+ // The dense FP4 weight GEMM is the track-B target; P1 lowers mmq_y (the weight-row tile) on the
+ // NVFP4 decode path to raise resident-CTA occupancy. mmq_y is a pure N-row tiling knob, so a
+ // smaller mmq_y must stay BIT-EXACT (identical per-output reduction over K) - this gate proves
+ // it. m = weight rows (N, tiled by mmq_y): 2048 (exact at mmq_y 64 & 128), 1600 (ragged vs 128),
+ // 2050 (ragged vs both 64 & 128 -> exercises the need_check last-row-tile at both). n = decode
+ // token count M = 32 and 128 (the scope decode shapes, tiled by mmq_x). k = 2048 hidden. Must
+ // pass with the default build (mmq_y=128) AND a mmq_y=64 build, CUDA-vs-CPU oracle, bit-exact.
+ for (ggml_type type_a : {GGML_TYPE_MXFP4, GGML_TYPE_NVFP4}) {
+ for (int64_t m : {2048, 1600, 2050}) {
+ for (int64_t n : {32, 128}) {
+ test_cases.emplace_back(new test_mul_mat(type_a, GGML_TYPE_F32, m, n, 2048, {1, 1}, {1, 1}));
+ }
+ }
+ }
+
for (ggml_type type_a : all_types) {
test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 4, 2, false, 64, 16, 3*ggml_blck_size(type_a)));
}
--
2.43.0

View File

@@ -1,349 +0,0 @@
From 17f16e8f6d8dbc689d5151c44759792d683c957b Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Thu, 25 Jun 2026 00:44:13 +0200
Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet in-place SSM state
write-back (patch 0018)
Decode on the Qwen3.6 hybrid-SSM models (arch qwen35, 48 gated-DeltaNet :
16 full-attention layers) was dominated by recurrent-state plumbing, not the
FP4 GEMM. Per SSM layer per step the fused gated_delta_net op wrote its new
recurrent state into graph scratch, then a separate ggml_cpy persisted it into
the recurrent-state cache. nsys attributed 18.9% of decode GPU time to that
~225 MB/copy D2D memcpy (1584 ops, 356 GB over the A2 decompose window).
This mirrors vLLM fused_recurrent_gated_delta_rule (state kept in place):
ggml_gated_delta_net_inplace writes the final recurrent state directly into the
active sequences contiguous cache slot (at kv_head), removing the copy-back. The
op output then carries only the attention scores; the SSM arithmetic is
unchanged (bit-identical greedy output vs the copy-back baseline).
- new op builder ggml_gated_delta_net_inplace (src[6] = state_dst cache view)
- CUDA + CPU honor src[6]; final-state (K==1, keep_rs off) write redirected there
- delta-net-base build_recurrent_attn uses it on the fused decode/prefill path,
dropping the ggml_cpy; rollback (n_rs_seq>0) path unchanged
Measured (q36-27b-nvfp4, decode_agg S_TG, npp128 ntg128, -fa on, paged on):
npl 32 : 113.74 -> 136.39 t/s (+19.9 percent)
npl 128: 146.23 -> 180.53 t/s (+23.5 percent, = predicted copy-removal ceiling)
MoE q36-35b-a3b-nvfp4: npl128 313.36 -> 372.62 t/s (+18.9 percent).
nsys D2D memcpy bucket 18.9 -> 0.23 percent (356 -> 2.93 GB). vLLM share
(391 @128) 37.4 -> 46.2 percent. get_rows state gather (now 18.8 percent) is the
next lever.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/include/ggml.h | 14 ++++++
ggml/src/ggml-cpu/ops.cpp | 13 ++++-
ggml/src/ggml-cuda/gated_delta_net.cu | 39 ++++++++++-----
ggml/src/ggml.c | 68 +++++++++++++++++++++++++++
src/models/delta-net-base.cpp | 30 ++++++++++++
5 files changed, 152 insertions(+), 12 deletions(-)
diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
index 823f5a9..4e7ab32 100644
--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
@@ -2579,6 +2579,20 @@ extern "C" {
struct ggml_tensor * state,
int64_t K);
+ // same recurrence as ggml_gated_delta_net with K == 1, but the final recurrent state is written
+ // in place into state_dst (a view into the recurrent-state cache) instead of being appended to
+ // the op output, eliminating the per-step state copy-back during decode. state_dst must be a
+ // contiguous [S_v*S_v*H, n_seqs] view (per-seq stride == dense state size).
+ GGML_API struct ggml_tensor * ggml_gated_delta_net_inplace(
+ struct ggml_context * ctx,
+ struct ggml_tensor * q,
+ struct ggml_tensor * k,
+ struct ggml_tensor * v,
+ struct ggml_tensor * g,
+ struct ggml_tensor * beta,
+ struct ggml_tensor * state,
+ struct ggml_tensor * state_dst);
+
// custom operators
typedef void (*ggml_custom1_op_t)(struct ggml_tensor * dst , const struct ggml_tensor * a, int ith, int nth, void * userdata);
diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
index 63c07a2..9457add 100644
--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
@@ -10600,6 +10600,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
ggml_tensor * src_g = dst->src[3];
ggml_tensor * src_beta = dst->src[4];
ggml_tensor * src_state = dst->src[5];
+ ggml_tensor * src_state_dst = dst->src[6]; // optional in-place final-state write-back target
const int64_t S_v = src_v->ne[0];
const int64_t H = src_v->ne[1];
@@ -10660,6 +10661,16 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
const float scale = 1.0f / sqrtf((float) S_v);
+ // when src_state_dst is provided (in-place decode write-back) the final state is written
+ // directly into the persistent cache view, removing the separate state copy-back node.
+ float * inplace_state_base = nullptr;
+ if (src_state_dst != nullptr) {
+ GGML_ASSERT(K == 1);
+ GGML_ASSERT(src_state_dst->nb[0] == sizeof(float));
+ GGML_ASSERT(src_state_dst->nb[1] == (size_t) S_v * S_v * H * sizeof(float));
+ inplace_state_base = (float *) src_state_dst->data;
+ }
+
for (int64_t ir = ir0; ir < ir1; ++ir) {
const int64_t iv1 = ir % H; // head_index
const int64_t iv3 = ir / H; // sequence
@@ -10674,7 +10685,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
// For K>1, work in scratch and copy out per-token when the slot is in range.
float * s_out = (K > 1)
? state_work
- : state_out_base + (iv3 * H + iv1) * S_v * S_v;
+ : (inplace_state_base ? inplace_state_base : state_out_base) + (iv3 * H + iv1) * S_v * S_v;
// copy input state into the working buffer and operate in-place
// state layout [S_v, S_v, H, n_seqs]: seq iv3 starts at iv3 * state_seq_stride.
diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
index a547360..61a2b91 100644
--- a/ggml/src/ggml-cuda/gated_delta_net.cu
+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
@@ -25,7 +25,8 @@ gated_delta_net_cuda(const float * q,
const uint3 neqk1_magic,
const uint3 rq3_magic,
float scale,
- int K) {
+ int K,
+ float * state_dst) {
const uint32_t h_idx = blockIdx.x;
const uint32_t sequence = blockIdx.y;
// each warp owns one column, using warp-level primitives to reduce across rows
@@ -37,7 +38,10 @@ gated_delta_net_cuda(const float * q,
const int64_t attn_score_elems = S_v * H * n_tokens * n_seqs;
float * attn_data = dst;
- float * state = dst + attn_score_elems;
+ // when state_dst is provided (in-place decode write-back) the final recurrent state is written
+ // directly into the persistent cache view instead of being appended to the op output; this
+ // eliminates the per-layer per-step D2D state copy-back. Only used when keep_rs_t == false.
+ float * state = (state_dst != nullptr) ? state_dst : (dst + attn_score_elems);
// input state holds s0 only: [S_v, S_v, H, n_seqs] — seq stride is D = H * S_v * S_v.
// output state layout (per-slot D * n_seqs) — same per-(seq,head) offset as before.
@@ -171,7 +175,7 @@ template <bool KDA, bool keep_rs_t>
static void launch_gated_delta_net(
const float * q_d, const float * k_d, const float * v_d,
const float * g_d, const float * b_d, const float * s_d,
- float * dst_d,
+ float * dst_d, float * state_dst_d,
int64_t S_v, int64_t H, int64_t n_tokens, int64_t n_seqs,
int64_t sq1, int64_t sq2, int64_t sq3,
int64_t sv1, int64_t sv2, int64_t sv3,
@@ -195,26 +199,26 @@ static void launch_gated_delta_net(
ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params,
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
+ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
break;
case 32:
ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params,
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
+ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
break;
case 64: {
ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params,
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
+ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
break;
}
case 128: {
ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params,
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
+ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
break;
}
default:
@@ -230,6 +234,7 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
ggml_tensor * src_g = dst->src[3];
ggml_tensor * src_beta = dst->src[4];
ggml_tensor * src_state = dst->src[5];
+ ggml_tensor * src_state_dst = dst->src[6]; // optional in-place state write-back target
GGML_TENSOR_LOCALS(int64_t, neq, src_q, ne);
GGML_TENSOR_LOCALS(size_t , nbq, src_q, nb);
@@ -260,6 +265,15 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
const float * s_d = (const float *) src_state->data;
float * dst_d = (float *) dst->data;
+ float * state_dst_d = nullptr;
+ if (src_state_dst != nullptr) {
+ // in-place final-state cache view: per-seq stride must be the dense state size D = S_v*S_v*H
+ GGML_ASSERT(src_state_dst->type == GGML_TYPE_F32);
+ GGML_ASSERT(src_state_dst->nb[0] == sizeof(float));
+ GGML_ASSERT(src_state_dst->nb[1] == (size_t) S_v * S_v * H * sizeof(float));
+ state_dst_d = (float *) src_state_dst->data;
+ }
+
GGML_ASSERT(ggml_is_contiguous_rows(src_q));
GGML_ASSERT(ggml_is_contiguous_rows(src_k));
GGML_ASSERT(ggml_is_contiguous_rows(src_v));
@@ -288,23 +302,26 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
const int K = ggml_get_op_params_i32(dst, 0);
const bool keep_rs = K > 1;
+ // in-place write-back is only valid for the single-snapshot (final-state) case
+ GGML_ASSERT(state_dst_d == nullptr || !keep_rs);
+
if (kda) {
if (keep_rs) {
- launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
+ launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
} else {
- launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
+ launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
}
} else {
if (keep_rs) {
- launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
+ launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
} else {
- launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
+ launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
}
diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
index adbe52b..b8d34bf 100644
--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
@@ -6285,6 +6285,74 @@ struct ggml_tensor * ggml_gated_delta_net(
return result;
}
+// ggml_gated_delta_net_inplace
+//
+// Same recurrence as ggml_gated_delta_net with K == 1, but the final recurrent state is written
+// in place into `state_dst` (a view into the persistent recurrent-state cache) instead of being
+// appended to the op output. This removes the per-layer per-step D2D state copy-back during decode.
+// The op output holds ONLY the attention scores; the state region is still allocated (unused) so
+// the attention-output view layout is identical to ggml_gated_delta_net.
+struct ggml_tensor * ggml_gated_delta_net_inplace(
+ struct ggml_context * ctx,
+ struct ggml_tensor * q,
+ struct ggml_tensor * k,
+ struct ggml_tensor * v,
+ struct ggml_tensor * g,
+ struct ggml_tensor * beta,
+ struct ggml_tensor * state,
+ struct ggml_tensor * state_dst) {
+ GGML_ASSERT(ggml_is_contiguous_rows(q));
+ GGML_ASSERT(ggml_is_contiguous_rows(k));
+ GGML_ASSERT(ggml_is_contiguous_rows(v));
+ GGML_ASSERT(ggml_is_contiguous(g));
+ GGML_ASSERT(ggml_is_contiguous(beta));
+ GGML_ASSERT(ggml_is_contiguous(state));
+
+ GGML_ASSERT(q->type == GGML_TYPE_F32);
+ GGML_ASSERT(k->type == GGML_TYPE_F32);
+ GGML_ASSERT(v->type == GGML_TYPE_F32);
+ GGML_ASSERT(g->type == GGML_TYPE_F32);
+ GGML_ASSERT(beta->type == GGML_TYPE_F32);
+ GGML_ASSERT(state->type == GGML_TYPE_F32);
+ GGML_ASSERT(state_dst != NULL);
+ GGML_ASSERT(state_dst->type == GGML_TYPE_F32);
+
+ const int64_t S_v = v->ne[0];
+ const int64_t H = v->ne[1];
+ const int64_t n_tokens = v->ne[2];
+ const int64_t n_seqs = v->ne[3];
+
+ GGML_ASSERT(g->ne[0] == 1 || g->ne[0] == S_v);
+ GGML_ASSERT(beta->ne[0] == 1);
+
+ GGML_ASSERT(state->ne[0] == S_v);
+ GGML_ASSERT(state->ne[1] == S_v);
+ GGML_ASSERT(state->ne[2] == H);
+ GGML_ASSERT(state->ne[3] == n_seqs);
+
+ // state_dst holds the per-seq final state contiguously: [S_v*S_v*H, >= n_seqs]
+ GGML_ASSERT(state_dst->ne[0] == S_v * S_v * H);
+ GGML_ASSERT(state_dst->ne[1] >= n_seqs);
+ GGML_ASSERT(state_dst->nb[0] == sizeof(float));
+
+ const int64_t state_rows = S_v * n_seqs; // K == 1
+ const int64_t ne[4] = { S_v * H, n_tokens * n_seqs + state_rows, 1, 1 };
+ struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne);
+
+ ggml_set_op_params_i32(result, 0, 1); // K == 1
+
+ result->op = GGML_OP_GATED_DELTA_NET;
+ result->src[0] = q;
+ result->src[1] = k;
+ result->src[2] = v;
+ result->src[3] = g;
+ result->src[4] = beta;
+ result->src[5] = state;
+ result->src[6] = state_dst;
+
+ return result;
+}
+
////////////////////////////////////////////////////////////////////////////////
struct ggml_hash_set ggml_hash_set_new(size_t size) {
diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp
index ad9ce77..26a718b 100644
--- a/src/models/delta-net-base.cpp
+++ b/src/models/delta-net-base.cpp
@@ -546,6 +546,36 @@ ggml_tensor * llm_build_delta_net_base::build_recurrent_attn(
const bool keep = cparams.n_rs_seq > 0;
if (!keep) {
+ const bool fused = (n_seq_tokens == 1) ? cparams.fused_gdn_ar : cparams.fused_gdn_ch;
+
+ if (fused) {
+ // In-place state write-back: the fused gated-DeltaNet op writes the new recurrent state
+ // directly into the persistent cache slot for the active sequences (a contiguous block
+ // at kv_head), eliminating the per-layer per-step ~full-state D2D copy-back that
+ // dominated decode. The op output then carries only the attention scores.
+ ggml_tensor * state_dst = ggml_view_2d(ctx0, ssm_states_all, hparams.n_embd_s(), n_seqs,
+ ssm_states_all->nb[1], kv_head * hparams.n_embd_s() * ggml_element_size(ssm_states_all));
+
+ ggml_tensor * result = ggml_gated_delta_net_inplace(ctx0, q, k, v, g, b, s, state_dst);
+ if (n_seq_tokens == 1) {
+ cb(result, LLAMA_TENSOR_NAME_FGDN_AR, il);
+ } else {
+ cb(result, LLAMA_TENSOR_NAME_FGDN_CH, il);
+ }
+
+ ggml_tensor * output = ggml_view_4d(ctx0, result,
+ S_v, H_v, n_seq_tokens, n_seqs,
+ ggml_row_size(result->type, S_v),
+ ggml_row_size(result->type, S_v * H_v),
+ ggml_row_size(result->type, S_v * H_v * n_seq_tokens), 0);
+ cb(output, "attn_output", il);
+
+ // the state write is a side effect of the op; pull the op into the graph via the output
+ ggml_build_forward_expand(gf, output);
+
+ return output;
+ }
+
auto attn_out = build_delta_net(q, k, v, g, b, s, il);
ggml_tensor * output = attn_out.first;
ggml_tensor * new_state = attn_out.second;
--
2.43.0

View File

@@ -1,678 +0,0 @@
From 46d7dd80bbce7f3c1dbf9363d6527c8c9b687a6b Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Thu, 25 Jun 2026 01:45:02 +0200
Subject: [PATCH] feat(paged): qwen35 SSM decode fused recurrent-state gather
(patch 0019)
Step 2 of the SSM decode-throughput work. After Step 1 (in-place state
write-back, patch 0018) the largest non-GEMM decode bucket was the recurrent-
state get_rows gather (18.8% of decode GPU time): build_rs materialized each
sequence's prior state into a contiguous scratch via ggml_get_rows before the
gated-DeltaNet op read it.
This eliminates that materialization, mirroring ggml_ssm_scan's ids source.
ggml_gated_delta_net_inplace_ids takes the FULL recurrent-state cache plus the
s_copy ids (src[5] = full cache, src[7] = ids, op_param[1] = rs_head) and reads
each sequence's prior state directly from cache[ids[seq]]. Combined with Step 1's
in-place write the op now reads AND writes the cache directly: no recurrent-state
materialization at all. build_recurrent_attn feeds the full cache + ids through
the build_rs get_state_rows lambda exactly like mamba-base, keeping the rs_zero
clear and the extra-states copy around the op.
Race-free by construction on CUDA. In-place write plus an ids read of the same
cache is only safe when read slot == write slot; s_copy is identity
(rs_head + s) for stable continuing sequences (the whole AR decode path) but can
remap on reorder or rs_zero (e.g. multiple new sequences in one prefill ubatch).
The recurrence kernel handles both per (seq, head) block on device: identity
sequences read s0 in place from the destination slot (the kernel loads all of s0
into registers before writing, so reading and writing the same slot is safe),
and non-identity sequences read from a disjoint scratch that a small gather
kernel copies from cache[ids[seq]] first, so the recurrence never reads a slot
another block writes. The CPU op mirrors this (host identity check + a serial
gather in the dispatcher). ids stays a device pointer (read only in-kernel; it is
device-resident at op-execute time). Bit-identical to the get_rows path in every
case.
- new builder ggml_gated_delta_net_inplace_ids; CUDA gather kernel
(gdn_gather_nonident) + per-block read-base select in gated_delta_net_cuda;
CPU identity guard + serial gather fallback in the dispatcher
- delta-net-base build_recurrent_attn gains a gather-free overload; qwen35 and
qwen35moe drop the pre-gather. qwen3next, kimi-linear, the non-fused path and
the rollback (n_rs_seq > 0) path are unchanged.
Measured (decode_agg S_TG, npp128 ntg128, -fa on, paged on, fusion off):
dense q36-27b-nvfp4 : npl 32 137.64 -> 170.68 (+24.0 percent)
npl 128 186.25 -> 256.57 (+37.8 percent, 47.6 -> 65.6 percent of vLLM 391)
MoE q36-35b-a3b-nvfp4: npl 32 299.68 -> 366.69 (+22.4 percent)
npl 128 409.30 -> 553.63 (+35.3 percent)
Greedy (--temp 0 --seed 1) llama-completion bit-identical vs the Step-1 build
(dense model text md5 match, MoE byte-identical, step2 run1 == run2). nsys
k_get_rows_float bucket 18.8 -> 0.7 percent; the new gdn_gather_nonident kernel
is 1.7 percent (no-op at decode, median 1.2 us). The residual decode gap to vLLM
is now the FP4 GEMM (~48 percent of decode), a separate kernel track.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
SSM_DECODE_FIX_RESULTS.md | 86 +++++++++++++++++++++++++++
ggml/include/ggml.h | 17 ++++++
ggml/src/ggml-cpu/ops.cpp | 49 ++++++++++++++-
ggml/src/ggml-cuda/gated_delta_net.cu | 85 ++++++++++++++++++++++----
ggml/src/ggml.c | 76 +++++++++++++++++++++++
src/models/delta-net-base.cpp | 63 ++++++++++++++++++++
src/models/models.h | 13 ++++
src/models/qwen35.cpp | 6 +-
src/models/qwen35moe.cpp | 6 +-
9 files changed, 378 insertions(+), 23 deletions(-)
diff --git a/SSM_DECODE_FIX_RESULTS.md b/SSM_DECODE_FIX_RESULTS.md
index 2e7c8c2..77879e4 100644
--- a/SSM_DECODE_FIX_RESULTS.md
+++ b/SSM_DECODE_FIX_RESULTS.md
@@ -96,3 +96,89 @@ precedent (`ssm_scan` `ids`) and is the clear next move. The residual gap to vLL
after both SSM steps is the FP4 GEMM (~37% of decode), which is a separate kernel
track. No paged/graph/block-table change can move decode on this model (full
attention is 0.4% of decode).
+
+## STEP 2 (patch 0019): fuse the recurrent-state gather into the op
+
+After Step 1 the largest non-GEMM decode bucket was the recurrent-state
+`get_rows` gather (18.8% of decode GPU time): `build_rs` materialized each
+sequence's prior state into a contiguous scratch via `ggml_get_rows` before the
+gated-DeltaNet op read it. Step 2 eliminates that materialization, mirroring
+`ggml_ssm_scan`'s `ids` source.
+
+`ggml_gated_delta_net_inplace_ids` takes the FULL recurrent-state cache plus the
+`s_copy` ids (`src[5]` = full cache `[S_v, S_v, H, n_rs_slots]`, `src[7]` = ids,
+`op_param[1]` = `rs_head`) and reads each sequence's prior state directly from
+`cache[ids[seq]]`. Combined with Step 1's in-place write the op now reads AND
+writes the cache directly: no recurrent-state materialization at all. The
+`build_recurrent_attn` fused path feeds the full cache and ids through the
+`build_rs` `get_state_rows` lambda exactly like `mamba-base.cpp`, keeping the
+`rs_zero` clear and the extra-states copy around the op.
+
+### Race-free by construction (CUDA)
+
+In-place write plus an ids read of the same cache is only safe when the read slot
+equals the write slot. `s_copy(s) = cells[s + head].src0`, which is identity
+(`rs_head + s`) for stable continuing sequences (the entire AR decode path) but
+can remap on sequence reorder or `rs_zero` (e.g. multiple new sequences in one
+prefill ubatch). The kernel handles both per (seq, head) block on device:
+
+- identity sequences read `s0` in place from the destination slot `state_dst`
+ (the kernel loads all of `s0` into registers before it writes the new state,
+ so reading and writing the same slot is race-free) -- no materialization;
+- non-identity sequences read from a disjoint scratch that a small
+ `gdn_gather_nonident_kernel` copies from `cache[ids[seq]]` first, so the
+ recurrence never reads a slot another block writes.
+
+`ids` stays a device pointer (dereferenced only in the kernels; the input is
+device-resident at op-execute time, so a host read segfaults). The CPU op
+mirrors the same logic (host identity check + a serial gather in the dispatcher
+for the non-identity case). The math is unchanged, so the result is bit-identical
+to the `get_rows` path in every case.
+
+Gated to the `qwen35` / `qwen35moe` fused decode/prefill path; `qwen3next`,
+`kimi-linear`, the non-fused path and the rollback (`n_rs_seq > 0`) path are
+untouched (they keep the materialized-state overload).
+
+### Measured decode_agg (`S_TG` t/s, npp 128, ntg 128, -fa on, paged on, fusion off)
+
+Dense `q36-27b-nvfp4`:
+
+| npl | Step 1 (baseline) | Step 2 | delta | % of vLLM (391 @128) |
+|-----|-------------------|----------|---------|----------------------|
+| 32 | 137.64 | 170.68 | +24.0% | - |
+| 128 | 186.25 | 256.57 | +37.8% | 47.6% -> 65.6% |
+
+The npl-128 result (256.57 t/s) beats the predicted ~247 t/s Step-2 ceiling.
+
+MoE `q36-35b-a3b-nvfp4`:
+
+| npl | Step 1 (baseline) | Step 2 | delta |
+|-----|-------------------|----------|---------|
+| 32 | 299.68 | 366.69 | +22.4% |
+| 128 | 409.30 | 553.63 | +35.3% |
+
+(Step-1 baselines re-measured in the same session; the brief's reference figures
+were 136 / 180 dense and 279 / 373 MoE.)
+
+### Bit-exact gate
+
+Greedy (`--temp 0 --seed 1`) `llama-completion` output (256 tokens, paged on,
+fusion off) vs the Step-1 build:
+
+- dense `q36-27b-nvfp4`: model text byte-identical (md5 match);
+- MoE `q36-35b-a3b-nvfp4`: byte-identical;
+- Step-2 dense run1 == run2 (deterministic, no race).
+
+### nsys confirmation (npp 128, ntg 24, npl 128, fusion off, eager)
+
+The recurrent-state gather bucket collapsed:
+
+| kernel | Step 1 | Step 2 |
+|----------------------------|----------|-----------------------------------------|
+| `k_get_rows_float` | 18.8% | 0.7% (residual: embeddings / conv-state)|
+| `gdn_gather_nonident` | - | 1.7% (no-op at decode, median ~1.2 us) |
+| `gated_delta_net_cuda` | 26.0% | 22.5% |
+| FP4 GEMM family | ~37.5% | ~48% (now the dominant residual) |
+
+The SSM state gather is effectively eliminated. The residual decode gap to vLLM
+is now the FP4 GEMM (~48% of decode), a separate kernel track.
diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
index 4e7ab32..951dd21 100644
--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
@@ -2593,6 +2593,23 @@ extern "C" {
struct ggml_tensor * state,
struct ggml_tensor * state_dst);
+ // Step 2: same recurrence as ggml_gated_delta_net_inplace, but the prior recurrent state is read
+ // directly from the full state cache via per-sequence indices (ids == s_copy), mirroring
+ // ggml_ssm_scan, instead of from a materialized ggml_get_rows gather. `state` is the FULL cache
+ // [S_v, S_v, H, n_rs_slots]; `ids` are the per-seq source slots; `rs_head` is the destination
+ // base slot. Eliminates the recurrent-state gather on the decode path.
+ GGML_API struct ggml_tensor * ggml_gated_delta_net_inplace_ids(
+ struct ggml_context * ctx,
+ struct ggml_tensor * q,
+ struct ggml_tensor * k,
+ struct ggml_tensor * v,
+ struct ggml_tensor * g,
+ struct ggml_tensor * beta,
+ struct ggml_tensor * state,
+ struct ggml_tensor * state_dst,
+ struct ggml_tensor * ids,
+ int rs_head);
+
// custom operators
typedef void (*ggml_custom1_op_t)(struct ggml_tensor * dst , const struct ggml_tensor * a, int ith, int nth, void * userdata);
diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
index 9457add..b6a1976 100644
--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
@@ -10633,7 +10633,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
const int64_t K = ggml_get_op_params_i32(dst, 0);
GGML_ASSERT(K >= 1);
// per-seq stride in floats (seq s starts at state + s * seq_stride)
- const int64_t state_seq_stride = src_state->nb[3] / sizeof(float);
+ int64_t state_seq_stride = src_state->nb[3] / sizeof(float);
const int64_t per_thread = S_v + (K > 1 ? S_v * S_v : 0);
const int ith = params->ith;
@@ -10654,6 +10654,26 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
const float * state_in_base = (const float *)src_state->data;
+ // Step 2: fused recurrent-state gather (ids == s_copy in src[7]). Read the prior state directly
+ // from the full cache at cache[ids[seq]] instead of from a materialized gather. For the identity
+ // decode case the prior state is the in-place destination block [rs_head, rs_head+n_seqs);
+ // otherwise the dispatcher has gathered cache[ids[seq]] into the (unused) output-state scratch
+ // region. Bit-identical to the get_rows path.
+ ggml_tensor * src_ids = dst->src[7];
+ if (src_ids != nullptr) {
+ const int64_t D = S_v * S_v * H;
+ const int32_t rs_head = ggml_get_op_params_i32(dst, 1);
+ const int32_t * ids = (const int32_t *) src_ids->data;
+ bool identity = true;
+ for (int64_t s = 0; s < n_seqs; ++s) {
+ if (ids[s] != rs_head + (int32_t) s) { identity = false; break; }
+ }
+ state_seq_stride = D;
+ state_in_base = identity
+ ? (const float *) src_state->data + (int64_t) rs_head * D
+ : (const float *) state_out_base; // gathered by the dispatcher (non-identity)
+ }
+
//const int64_t rq1 = nev1 / neq1;
//const int64_t rk1 = nev1 / nek1;
const int64_t rq3 = nev3 / neq3;
@@ -10777,6 +10797,33 @@ static void ggml_compute_forward_gated_delta_net_f32(
if (ith == 0) {
ggml_threadpool_chunk_set(params->threadpool, nth);
+
+ // Step 2: non-identity ids fallback -- serially gather each sequence's prior state from
+ // cache[ids[seq]] into the (otherwise unused) output-state scratch region before the parallel
+ // recurrence, so the in-place write never aliases another sequence's read.
+ ggml_tensor * src_ids = dst->src[7];
+ if (src_ids != nullptr) {
+ const ggml_tensor * src_state = dst->src[5];
+ const int64_t S_v = V->ne[0];
+ const int64_t H = V->ne[1];
+ const int64_t n_tokens = V->ne[2];
+ const int64_t n_seqs = V->ne[3];
+ const int64_t D = S_v * S_v * H;
+ const int32_t rs_head = ggml_get_op_params_i32(dst, 1);
+ const int32_t * ids = (const int32_t *) src_ids->data;
+ bool identity = true;
+ for (int64_t s = 0; s < n_seqs; ++s) {
+ if (ids[s] != rs_head + (int32_t) s) { identity = false; break; }
+ }
+ if (!identity) {
+ const int64_t attn_score_elems = S_v * H * n_tokens * n_seqs;
+ const float * cache = (const float *) src_state->data;
+ float * scratch = (float *) dst->data + attn_score_elems;
+ for (int64_t s = 0; s < n_seqs; ++s) {
+ memcpy(scratch + s * D, cache + (int64_t) ids[s] * D, D * sizeof(float));
+ }
+ }
+ }
}
ggml_barrier(params->threadpool);
diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
index 61a2b91..86d5e2a 100644
--- a/ggml/src/ggml-cuda/gated_delta_net.cu
+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
@@ -1,6 +1,34 @@
#include "gated_delta_net.cuh"
#include "ggml-cuda/common.cuh"
+// Step 2: gather only the NON-identity sequences' prior recurrent state from the full cache into a
+// disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the
+// destination slot by the recurrence kernel and are skipped here. One block per sequence.
+__global__ void gdn_gather_nonident_kernel(const float * cache, const int32_t * ids, int rs_head,
+ float * scratch, int64_t D, int n_seqs) {
+ const int s = blockIdx.x;
+ if (s >= n_seqs) {
+ return;
+ }
+ const int r = ids[s];
+ if (r == rs_head + s) {
+ return; // identity: prior state already lives in the in-place destination slot
+ }
+ const float * src = cache + (int64_t) r * D;
+ float * dst = scratch + (int64_t) s * D;
+ for (int64_t i = threadIdx.x; i < D; i += blockDim.x) {
+ dst[i] = src[i];
+ }
+}
+
+static void ggml_cuda_gdn_gather_nonident(const float * cache, const int32_t * ids, int rs_head,
+ float * scratch, int64_t D, int64_t n_seqs, cudaStream_t stream) {
+ if (n_seqs <= 0) {
+ return;
+ }
+ gdn_gather_nonident_kernel<<<(unsigned) n_seqs, 256, 0, stream>>>(cache, ids, rs_head, scratch, D, (int) n_seqs);
+}
+
template <int S_v, bool KDA, bool keep_rs_t>
__global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * 4, 2)
gated_delta_net_cuda(const float * q,
@@ -26,7 +54,9 @@ gated_delta_net_cuda(const float * q,
const uint3 rq3_magic,
float scale,
int K,
- float * state_dst) {
+ float * state_dst,
+ const int32_t * ids,
+ int rs_head) {
const uint32_t h_idx = blockIdx.x;
const uint32_t sequence = blockIdx.y;
// each warp owns one column, using warp-level primitives to reduce across rows
@@ -48,7 +78,15 @@ gated_delta_net_cuda(const float * q,
const int64_t state_in_offset = sequence * H * S_v * S_v + h_idx * S_v * S_v;
const int64_t state_out_offset = (sequence * H + h_idx) * S_v * S_v;
state += state_out_offset;
- curr_state += state_in_offset + col * S_v;
+ // Step 2: select the prior-state read base per sequence. For the ids variant, identity
+ // sequences (ids[seq] == rs_head + seq) read s0 directly from the in-place destination slot
+ // state_dst (no materialization); non-identity sequences read from the pre-gathered scratch
+ // (curr_state). state_in_offset == state_out_offset, so both bases use the same per-(seq,head)
+ // offset. The whole s0 is loaded into registers before the new state is written, so reading and
+ // writing the same slot per block (identity) is race-free.
+ const float * read_state = (ids != nullptr && ids[sequence] == rs_head + (int) sequence)
+ ? state_dst : curr_state;
+ read_state += state_in_offset + col * S_v;
attn_data += (sequence * n_tokens * H + h_idx) * S_v;
constexpr int warp_size = ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v;
@@ -61,7 +99,7 @@ gated_delta_net_cuda(const float * q,
#pragma unroll
for (int r = 0; r < rows_per_lane; r++) {
const int i = r * warp_size + lane;
- s_shard[r] = curr_state[i];
+ s_shard[r] = read_state[i];
}
for (int t = 0; t < n_tokens; t++) {
@@ -176,6 +214,7 @@ static void launch_gated_delta_net(
const float * q_d, const float * k_d, const float * v_d,
const float * g_d, const float * b_d, const float * s_d,
float * dst_d, float * state_dst_d,
+ const int32_t * ids_d, int rs_head,
int64_t S_v, int64_t H, int64_t n_tokens, int64_t n_seqs,
int64_t sq1, int64_t sq2, int64_t sq3,
int64_t sv1, int64_t sv2, int64_t sv3,
@@ -199,26 +238,26 @@ static void launch_gated_delta_net(
ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params,
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
break;
case 32:
ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params,
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
break;
case 64: {
ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params,
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
break;
}
case 128: {
ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params,
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
break;
}
default:
@@ -262,7 +301,6 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
const float * g_d = (const float *) src_g->data;
const float * b_d = (const float *) src_beta->data;
- const float * s_d = (const float *) src_state->data;
float * dst_d = (float *) dst->data;
float * state_dst_d = nullptr;
@@ -274,6 +312,29 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
state_dst_d = (float *) src_state_dst->data;
}
+ // Step 2: fused recurrent-state gather (src[7] = ids == s_copy). Read the prior state directly
+ // from the full cache via ids instead of from a materialized ggml_get_rows gather. The recurrence
+ // kernel reads identity sequences (ids[seq] == rs_head + seq) in place from state_dst (no
+ // materialization at all); any non-identity sequence (reorder / rs_zero remap) is gathered here
+ // into a disjoint scratch that the kernel reads instead. The gather writes a disjoint buffer and
+ // the recurrence never reads a slot another block writes, so it is race-free and bit-identical to
+ // the get_rows path. ids stays a DEVICE pointer (dereferenced only inside the kernels).
+ ggml_tensor * src_ids = dst->src[7];
+ const float * s_d = (const float *) src_state->data;
+ const int32_t * ids_d = nullptr;
+ int rs_head = 0;
+ ggml_cuda_pool_alloc<float> ids_state_scratch(ctx.pool());
+ if (src_ids != nullptr) {
+ GGML_ASSERT(state_dst_d != nullptr);
+ GGML_ASSERT(src_ids->type == GGML_TYPE_I32);
+ rs_head = ggml_get_op_params_i32(dst, 1);
+ ids_d = (const int32_t *) src_ids->data;
+ const int64_t D = S_v * S_v * H;
+ float * scratch = ids_state_scratch.alloc((size_t) D * n_seqs);
+ ggml_cuda_gdn_gather_nonident(s_d, ids_d, rs_head, scratch, D, n_seqs, ctx.stream());
+ s_d = scratch;
+ }
+
GGML_ASSERT(ggml_is_contiguous_rows(src_q));
GGML_ASSERT(ggml_is_contiguous_rows(src_k));
GGML_ASSERT(ggml_is_contiguous_rows(src_v));
@@ -307,21 +368,21 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
if (kda) {
if (keep_rs) {
- launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+ launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
} else {
- launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+ launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
}
} else {
if (keep_rs) {
- launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+ launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
} else {
- launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+ launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
}
diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
index b8d34bf..1762037 100644
--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
@@ -6353,6 +6353,82 @@ struct ggml_tensor * ggml_gated_delta_net_inplace(
return result;
}
+// ggml_gated_delta_net_inplace_ids
+//
+// Same recurrence as ggml_gated_delta_net_inplace, but the prior recurrent state is read directly
+// from the FULL state cache `state` ([S_v, S_v, H, n_rs_slots]) at cache[ids[seq]] (mirroring
+// ggml_ssm_scan's ids source) instead of from a materialized ggml_get_rows gather. `rs_head` is the
+// destination base slot, used by the backend to detect the common identity case (ids[s] == rs_head
+// + s), where the prior state already lives in the in-place destination slots.
+struct ggml_tensor * ggml_gated_delta_net_inplace_ids(
+ struct ggml_context * ctx,
+ struct ggml_tensor * q,
+ struct ggml_tensor * k,
+ struct ggml_tensor * v,
+ struct ggml_tensor * g,
+ struct ggml_tensor * beta,
+ struct ggml_tensor * state,
+ struct ggml_tensor * state_dst,
+ struct ggml_tensor * ids,
+ int rs_head) {
+ GGML_ASSERT(ggml_is_contiguous_rows(q));
+ GGML_ASSERT(ggml_is_contiguous_rows(k));
+ GGML_ASSERT(ggml_is_contiguous_rows(v));
+ GGML_ASSERT(ggml_is_contiguous(g));
+ GGML_ASSERT(ggml_is_contiguous(beta));
+ GGML_ASSERT(ggml_is_contiguous(state));
+
+ GGML_ASSERT(q->type == GGML_TYPE_F32);
+ GGML_ASSERT(k->type == GGML_TYPE_F32);
+ GGML_ASSERT(v->type == GGML_TYPE_F32);
+ GGML_ASSERT(g->type == GGML_TYPE_F32);
+ GGML_ASSERT(beta->type == GGML_TYPE_F32);
+ GGML_ASSERT(state->type == GGML_TYPE_F32);
+ GGML_ASSERT(state_dst != NULL && state_dst->type == GGML_TYPE_F32);
+ GGML_ASSERT(ids != NULL && ids->type == GGML_TYPE_I32);
+
+ const int64_t S_v = v->ne[0];
+ const int64_t H = v->ne[1];
+ const int64_t n_tokens = v->ne[2];
+ const int64_t n_seqs = v->ne[3];
+
+ GGML_ASSERT(g->ne[0] == 1 || g->ne[0] == S_v);
+ GGML_ASSERT(beta->ne[0] == 1);
+
+ // state is the FULL recurrent-state cache: [S_v, S_v, H, n_rs_slots], n_rs_slots >= n_seqs
+ GGML_ASSERT(state->ne[0] == S_v);
+ GGML_ASSERT(state->ne[1] == S_v);
+ GGML_ASSERT(state->ne[2] == H);
+ GGML_ASSERT(state->ne[3] >= n_seqs);
+
+ // state_dst holds the per-seq final state contiguously: [S_v*S_v*H, >= n_seqs]
+ GGML_ASSERT(state_dst->ne[0] == S_v * S_v * H);
+ GGML_ASSERT(state_dst->ne[1] >= n_seqs);
+ GGML_ASSERT(state_dst->nb[0] == sizeof(float));
+
+ // ids: per-seq source slot into the full cache (s_copy_main)
+ GGML_ASSERT(ids->ne[0] >= n_seqs);
+
+ const int64_t state_rows = S_v * n_seqs; // K == 1
+ const int64_t ne[4] = { S_v * H, n_tokens * n_seqs + state_rows, 1, 1 };
+ struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne);
+
+ ggml_set_op_params_i32(result, 0, 1); // K == 1
+ ggml_set_op_params_i32(result, 1, rs_head); // destination base slot (for the ids identity check)
+
+ result->op = GGML_OP_GATED_DELTA_NET;
+ result->src[0] = q;
+ result->src[1] = k;
+ result->src[2] = v;
+ result->src[3] = g;
+ result->src[4] = beta;
+ result->src[5] = state; // FULL cache (read via ids)
+ result->src[6] = state_dst; // in-place final-state write-back target
+ result->src[7] = ids; // per-seq source slots (s_copy)
+
+ return result;
+}
+
////////////////////////////////////////////////////////////////////////////////
struct ggml_hash_set ggml_hash_set_new(size_t size) {
diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp
index 26a718b..194e611 100644
--- a/src/models/delta-net-base.cpp
+++ b/src/models/delta-net-base.cpp
@@ -524,6 +524,69 @@ ggml_tensor * llm_build_delta_net_base::build_conv_state(
return conv_input;
}
+// Step 2: gather-free recurrent attention. Mirrors mamba-base's get_ssm_rows pattern: the fused
+// gated-DeltaNet op reads each sequence's prior state directly from the full cache via the s_copy
+// ids (no ggml_get_rows materialization) and writes the new state in place (Step 1). The non-fused
+// and rollback paths fall back to materializing the prior state and delegating below.
+ggml_tensor * llm_build_delta_net_base::build_recurrent_attn(
+ llm_graph_input_rs * inp,
+ ggml_tensor * ssm_states_all,
+ ggml_tensor * q,
+ ggml_tensor * k,
+ ggml_tensor * v,
+ ggml_tensor * g,
+ ggml_tensor * b,
+ int il) {
+ const auto * mctx_cur = inp->mctx;
+ const auto kv_head = mctx_cur->get_head();
+
+ const int64_t S_v = v->ne[0];
+ const int64_t H_v = v->ne[1];
+ const int64_t n_seqs = v->ne[3];
+ const int64_t n_seq_tokens = q->ne[2];
+
+ const bool keep = cparams.n_rs_seq > 0;
+ const bool fused = (n_seq_tokens == 1) ? cparams.fused_gdn_ar : cparams.fused_gdn_ch;
+
+ if (!keep && fused) {
+ // build_rs feeds the FULL state cache + the s_copy ids into the op (via the get_state_rows
+ // lambda, exactly like mamba-base's ggml_ssm_scan) and still performs the rs_zero clear and
+ // the extra-states copy around it. The op reads curr_state from cache[ids[seq]] and writes
+ // the final state in place at kv_head; no recurrent-state materialization at all.
+ auto get_state_op = [&](ggml_context * ctx, ggml_tensor * states, ggml_tensor * ids) -> ggml_tensor * {
+ ggml_tensor * cache4d = ggml_reshape_4d(ctx, states, S_v, S_v, H_v, states->ne[1]);
+ ggml_tensor * state_dst = ggml_view_2d(ctx, ssm_states_all, hparams.n_embd_s(), n_seqs,
+ ssm_states_all->nb[1], kv_head * hparams.n_embd_s() * ggml_element_size(ssm_states_all));
+ return ggml_gated_delta_net_inplace_ids(ctx, q, k, v, g, b, cache4d, state_dst, ids, (int) kv_head);
+ };
+
+ ggml_tensor * result = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs, get_state_op);
+ if (n_seq_tokens == 1) {
+ cb(result, LLAMA_TENSOR_NAME_FGDN_AR, il);
+ } else {
+ cb(result, LLAMA_TENSOR_NAME_FGDN_CH, il);
+ }
+
+ ggml_tensor * output = ggml_view_4d(ctx0, result,
+ S_v, H_v, n_seq_tokens, n_seqs,
+ ggml_row_size(result->type, S_v),
+ ggml_row_size(result->type, S_v * H_v),
+ ggml_row_size(result->type, S_v * H_v * n_seq_tokens), 0);
+ cb(output, "attn_output", il);
+
+ // the state write is a side effect of the op; pull the op into the graph via the output
+ ggml_build_forward_expand(gf, output);
+
+ return output;
+ }
+
+ // non-fused / rollback: materialize the prior state via gather and delegate to the
+ // state-taking overload (its fused !keep branch performs the Step-1 in-place write).
+ ggml_tensor * s = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
+ s = ggml_reshape_4d(ctx0, s, S_v, S_v, H_v, n_seqs);
+ return build_recurrent_attn(inp, ssm_states_all, q, k, v, g, b, s, il);
+}
+
ggml_tensor * llm_build_delta_net_base::build_recurrent_attn(
llm_graph_input_rs * inp,
ggml_tensor * ssm_states_all,
diff --git a/src/models/models.h b/src/models/models.h
index 2ac8415..98b89e9 100644
--- a/src/models/models.h
+++ b/src/models/models.h
@@ -88,6 +88,19 @@ struct llm_build_delta_net_base : public llm_graph_context {
ggml_tensor * b,
ggml_tensor * s,
int il);
+
+ // Step 2: gather-free variant. Reads the prior recurrent state directly from the full cache via
+ // the s_copy ids (no ggml_get_rows materialization) on the fused decode/prefill path, and
+ // delegates to the state-taking overload for the non-fused and rollback paths.
+ ggml_tensor * build_recurrent_attn(
+ llm_graph_input_rs * inp,
+ ggml_tensor * ssm_states_all,
+ ggml_tensor * q,
+ ggml_tensor * k,
+ ggml_tensor * v,
+ ggml_tensor * g,
+ ggml_tensor * b,
+ int il);
};
struct llm_build_rwkv6_base : public llm_graph_context {
diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp
index 6783d98..0be3247 100644
--- a/src/models/qwen35.cpp
+++ b/src/models/qwen35.cpp
@@ -385,10 +385,6 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear(
ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
- ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
- state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs);
- cb(state, "state_predelta", il);
-
ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
cb(conv_output_proper, "conv_output_raw", il);
@@ -445,7 +441,7 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear(
cb(k_conv, "k_conv_predelta", il);
cb(v_conv, "v_conv_predelta", il);
- ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, state, il);
+ ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, il);
// z: [head_dim, n_heads, n_tokens, n_seqs] -> [n_heads * n_tokens * n_seqs, head_dim]
ggml_tensor * z_2d = ggml_reshape_4d(ctx0, z, head_v_dim, num_v_heads, n_seq_tokens, n_seqs);
diff --git a/src/models/qwen35moe.cpp b/src/models/qwen35moe.cpp
index eb5e9a4..2995f04 100644
--- a/src/models/qwen35moe.cpp
+++ b/src/models/qwen35moe.cpp
@@ -409,10 +409,6 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear(
ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
- ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
- state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs);
- cb(state, "state_predelta", il);
-
ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
cb(conv_output_proper, "conv_output_raw", il);
@@ -469,7 +465,7 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear(
cb(k_conv, "k_conv_predelta", il);
cb(v_conv, "v_conv_predelta", il);
- ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, state, il);
+ ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, il);
// z: [head_dim, n_heads, n_tokens, n_seqs] -> [n_heads * n_tokens * n_seqs, head_dim]
ggml_tensor * z_2d = ggml_reshape_4d(ctx0, z, head_v_dim, num_v_heads, n_seq_tokens, n_seqs);
--
2.43.0

View File

@@ -1,225 +0,0 @@
From df1cc97b68df048834ab735c944b71c3a2e8737e Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Thu, 25 Jun 2026 12:40:49 +0200
Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet o_proj MMVQ->MMQ reshape
(patch 0020)
Lever 1, the single biggest decode-parity lever for the Qwen3.6 hybrid-SSM
models (arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). Post-SSM
(patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391; profiling
both engines pinned the largest llama-specific overage to the gated-DeltaNet
OUTPUT projection (ssm_out).
The GDN op left its output in SSM layout and the graph reshaped it to 3D
[value_dim, n_seq_tokens=1, n_seqs=128] before the ssm_out matmul, so
src1->ne[1]=1. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the 128
sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does not amortize the
ssm_out weight read across the 128 sequences (one 5120x128 grid, 48 calls/step,
the 40%-vs-62% GPU-utilization gap). vLLM packs the same projection into one
M=128 GEMM. The in-projection was already 2D -> MMQ; only the output was 3D.
The fix collapses the GDN output to 2D [value_dim, n_seq_tokens * n_seqs]
(= [6144, 128] at decode) before the ssm_out ggml_mul_mat, so src1->ne[1]=128
routes to the MMQ M=128 tensor-core GEMM (which amortizes the weight read across
all 128 tokens). The result is then already 2D, so the redundant post-matmul
reshape_2d is dropped. Same contiguous data, just a 2D vs 3D view: bit-identical.
Gated to the gated-DeltaNet path (qwen35 / qwen35moe / qwen3next); other archs
untouched.
Bit-identical greedy (--temp 0 --seed 1) vs the post-SSM baseline on both
q36-27b-nvfp4 (dense) and q36-35b-a3b-nvfp4 (MoE), byte/md5-identical.
test-backend-ops MUL_MAT and MUL_MAT_ID OK.
decode_agg S_TG (llama-batched-bench, -fa on, npp128 ntg128, npl 32/128):
dense q36-27b: 170.52 / 254.92 -> 200.00 / 335.80 t/s (+17.3% / +31.7%)
MoE q36-35b-a3b: 373.28 / 560.66 -> 420.77 / 691.24 t/s (+12.7% / +23.3%)
Dense @128 = 335.80 t/s = 85.9% of vLLM 391 (up from 65%; target 82-85% hit).
nsys: the o_proj mul_mat_vec_q<NVFP4,m=1> bucket (132.8 ms / 48 inst) collapses
to zero; mul_mat_q<NVFP4,m=128> absorbs it (+1200 inst, +363 ms) with a LOWER
per-call average (620.8 -> 582.7 us). Realized o_proj-as-MMQ cost ~0.30 ms/call
vs 2.77 ms/call for the old GEMV.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
LEVER1_OPROJ_MMQ_RESULTS.md | 77 +++++++++++++++++++++++++++++++++++++
src/models/qwen35.cpp | 13 ++++---
src/models/qwen35moe.cpp | 13 ++++---
src/models/qwen3next.cpp | 13 ++++---
4 files changed, 98 insertions(+), 18 deletions(-)
create mode 100644 LEVER1_OPROJ_MMQ_RESULTS.md
diff --git a/LEVER1_OPROJ_MMQ_RESULTS.md b/LEVER1_OPROJ_MMQ_RESULTS.md
new file mode 100644
index 0000000..9a5721f
--- /dev/null
+++ b/LEVER1_OPROJ_MMQ_RESULTS.md
@@ -0,0 +1,77 @@
+# Lever 1: gated-DeltaNet output-projection MMQ reshape (patch 0020)
+
+The single biggest decode-parity lever for the Qwen3.6 hybrid-SSM models
+(arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). A two-line,
+bit-exact tensor reshape that re-routes the per-layer SSM output projection
+from a batch-1 FP4 GEMV (MMVQ) to a batch-128 tensor-core GEMM (MMQ).
+
+## The mechanism (profiled, both engines)
+
+Post-SSM (patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391.
+The largest llama-specific overage was the gated-DeltaNet OUTPUT projection
+(ssm_out). The GDN op left its output in SSM layout and the graph reshaped it
+to 3D `[value_dim, n_seq_tokens=1, n_seqs=128]` before the ssm_out matmul, so
+`src1->ne[1] = 1`. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the
+128 sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does NOT amortize
+the ssm_out weight read across the 128 sequences. vLLM packs the same projection
+into a single M=128 GEMM. The in-projection was already fine (2D input -> MMQ);
+only the output projection was in 3D SSM layout.
+
+## The fix
+
+In the GDN output path of qwen35.cpp / qwen35moe.cpp / qwen3next.cpp, collapse
+the final GDN output to 2D `[value_dim, n_seq_tokens * n_seqs]` (= [6144, 128] at
+decode) BEFORE the ssm_out `ggml_mul_mat`, so `src1->ne[1] = 128` routes to the
+MMQ M=128 GEMM. The result is then already 2D `[n_embd, n_seq_tokens * n_seqs]`,
+so the redundant post-matmul reshape_2d is dropped. Same contiguous data, just a
+2D vs 3D view => bit-identical. MMQ on NVFP4 at this exact M=128 shape was already
+proven by the in-projection.
+
+```
+- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
++ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
+ ...
+ cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
+- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
+```
+
+## Gates (all PASS)
+
+- Bit-identical greedy (--temp 0 --seed 1, -n 200, llama-completion) vs the
+ post-SSM baseline build:
+ - dense q36-27b-nvfp4: md5 b90681a7728faadc44492b0bcd6181cc (IDENTICAL)
+ - MoE q36-35b-a3b-nvfp4: md5 f37c7ca1edd752e3bd82e99b4e8744b6 (IDENTICAL)
+- test-backend-ops MUL_MAT: OK ; MUL_MAT_ID: OK
+- Coherent dense + MoE output (greedy text inspected).
+
+## decode_agg (llama-batched-bench, -fa on, -npp 128 -ntg 128 -npl 32,128 -c 33000)
+
+S_TG t/s (decode aggregate):
+
+| model | npl | baseline | Lever 1 | delta |
+|------------------|-----|----------|---------|---------|
+| dense q36-27b | 32 | 170.52 | 200.00 | +17.3% |
+| dense q36-27b | 128 | 254.92 | 335.80 | +31.7% |
+| MoE q36-35b-a3b| 32 | 373.28 | 420.77 | +12.7% |
+| MoE q36-35b-a3b| 128 | 560.66 | 691.24 | +23.3% |
+
+Dense @128: 335.80 t/s = 85.9% of vLLM 391 (target 82-85% HIT/exceeded;
+up from 65% post-SSM).
+
+## nsys (cuda_gpu_kern_sum, -npp 128 -ntg 24 -npl 128)
+
+The o_proj FP4 batch-1 GEMV bucket is eliminated and the work moves to MMQ M=128:
+
+| kernel | baseline | Lever 1 |
+|-------------------------------------|--------------------|------------------|
+| mul_mat_vec_q<NVFP4, m=1> (o_proj) | 132.8 ms / 48 inst | 0 ms / 0 inst |
+| mul_mat_q<NVFP4, m=128> | 5463 ms / 8800 inst| 5827 ms /10000 inst|
+
+The 132.8 ms o_proj GEMV bucket collapses to zero; mul_mat_q M=128 absorbs it
+(+1200 instances, +363 ms over the window), and its per-call average DROPS
+(620.8 us -> 582.7 us) because the added o_proj GEMMs are individually cheaper
+than the average projection GEMM. Realized o_proj-as-MMQ marginal cost
+~363.5 ms / 1200 = ~0.30 ms/call, versus the 2.77 ms/call (132.8 ms / 48) of the
+old GEMV: the amortized weight read is the win.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp
index 0be3247..0874c43 100644
--- a/src/models/qwen35.cpp
+++ b/src/models/qwen35.cpp
@@ -449,17 +449,18 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear(
// Apply gated normalization: self.norm(core_attn_out, z)
ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il);
- // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim]
- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
+ // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the
+ // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior
+ // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ
+ // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous
+ // data, just a 2D vs 3D view, so the result is bit-identical.
+ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
cb(final_output, "final_output", il);
- // Output projection
+ // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs])
cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
cb(cur, "linear_attn_out", il);
- // Reshape back to original dimensions
- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
-
return cur;
}
diff --git a/src/models/qwen35moe.cpp b/src/models/qwen35moe.cpp
index 2995f04..1f6f643 100644
--- a/src/models/qwen35moe.cpp
+++ b/src/models/qwen35moe.cpp
@@ -473,17 +473,18 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear(
// Apply gated normalization: self.norm(core_attn_out, z)
ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il);
- // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim]
- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
+ // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the
+ // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior
+ // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ
+ // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous
+ // data, just a 2D vs 3D view, so the result is bit-identical.
+ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
cb(final_output, "final_output", il);
- // Output projection
+ // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs])
cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
cb(cur, "linear_attn_out", il);
- // Reshape back to original dimensions
- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
-
return cur;
}
diff --git a/src/models/qwen3next.cpp b/src/models/qwen3next.cpp
index 97200a4..bfdf026 100644
--- a/src/models/qwen3next.cpp
+++ b/src/models/qwen3next.cpp
@@ -519,17 +519,18 @@ ggml_tensor * llama_model_qwen3next::graph::build_layer_attn_linear(
// Apply gated normalization: self.norm(core_attn_out, z)
ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il);
- // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim]
- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
+ // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the
+ // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior
+ // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ
+ // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous
+ // data, just a 2D vs 3D view, so the result is bit-identical.
+ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
cb(final_output, "final_output", il);
- // Output projection
+ // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs])
cur = build_lora_mm(model.layers[il].ssm_out, final_output);
cb(cur, "linear_attn_out", il);
- // Reshape back to original dimensions
- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
-
return cur;
}
--
2.43.0

View File

@@ -1,769 +0,0 @@
From 58426b58aaf5431a59d499d513b2fe2d6ab990d8 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Thu, 25 Jun 2026 18:55:54 +0200
Subject: [PATCH] feat(paged): qwen35 decode conv-state in-place fusion (patch
0021)
The no-regret bit-exact conv-state cleanup from the GDN recurrence byte-gate
design (point 3). After the recurrence verdict (NO-BUILD: the gated-DeltaNet
recurrence is already single-pass at the f32 byte floor), the decode conv path
was the only remaining bit-exact lever.
New fused op ggml_ssm_conv_update_inplace (reuses GGML_OP_SSM_CONV, discriminated
by a non-null src[3]). On the single-token decode path it replaces the four-op
conv chain - qkv transpose + ggml_concat (concat_cont) + ggml_ssm_conv + ggml_silu
+ ggml_cpy of the shifted ring state (cpy_scalar) - with one kernel that, per
(channel, sequence), assembles the width-K window in registers from the K-1 cached
taps plus the current qkv_mixed token, computes the depthwise conv with the SAME
ascending-tap FMA order as ssm_conv_f32 at i==0, folds silu, writes the conv
output, and writes the 1-token-shifted ring state back IN PLACE into the conv
cache slot at kv_head. This is vLLM causal_conv1d_update; it mirrors the 0018
in-place write-back and 0019 patterns. Read source (the build_rs tap gather) and
write target (the cache view) are disjoint buffers, so it is race-free by
construction with no ids/identity logic.
- ggml.h/ggml.c: builder (src0=conv_states [K-1,ch,n_seqs], src1=conv_kernel,
src2=x_cur [ch,1,n_seqs], src3=conv_state_dst [(K-1)*ch,n_seqs] in-place ring;
op_params[0]=fuse_silu)
- ggml-cuda/ssm-conv.cu: ssm_conv_update_f32<apply_silu,d_conv> kernel +
ggml_cuda_op_ssm_conv_update + src[3]-discriminated branch in ggml_cuda_op_ssm_conv
- ggml-cpu/ops.cpp: ggml_compute_forward_ssm_conv_update_f32 (threads over channels)
+ branch in ggml_compute_forward_ssm_conv
- delta-net-base.cpp/models.h: build_conv_state_fused (keeps the cheap build_rs
conv-tap gather; fuses conv+silu+shifted write-back)
- qwen35.cpp, qwen35moe.cpp, qwen3next.cpp: route the single-token decode path
(n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar); prefill/chunked/rollback keep
the original chain
- tests/test-backend-ops.cpp: test_ssm_conv_update (16 cases) vs the CPU reference
test-backend-ops: SSM_CONV 45/45, SSM_CONV_UPDATE 16/16, SSM_CONV_BIAS_SILU 90/90.
Greedy (--temp 0 --seed 1 --ignore-eos -n 256) byte-identical to the Lever-1
(0019/0020) baseline: q36-27b-nvfp4 md5 675cd522..., q36-35b-a3b-nvfp4 md5
ac163882... both BYTE-IDENTICAL.
decode_agg S_TG (npp128 ntg128, -fa on, CUDA-graph), same session:
dense q36-27b-nvfp4 : npl 32 199.76 -> 202.99 (+1.6%)
npl 128 336.35 -> 347.14 (+3.2%, 86.0 -> 88.8 percent of vLLM 391)
MoE q36-35b-a3b : npl 32 421.72 -> 432.39 (+2.5%)
npl 128 689.74 -> 713.54 (+3.5%)
Lift holds in eager too (dense npl128 333.62 -> 342.97). Step -11.9 ms/step
(dense npl128: 380.6 -> 368.7). nsys eager decode: concat_cont (1152 calls) and the
decode cpy_scalar GONE; ssm_conv_f32 at decode replaced by ssm_conv_update (1152);
conv-path ~20.9 -> ~7.6 ms/step. Bit-exact, no regression, de-risks the bf16-state
conv-cache plumbing.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
CONV_STATE_FUSION_RESULTS.md | 106 +++++++++++++++++++++++++++++++
ggml/include/ggml.h | 16 +++++
ggml/src/ggml-cpu/ops.cpp | 73 ++++++++++++++++++++-
ggml/src/ggml-cuda/ssm-conv.cu | 112 +++++++++++++++++++++++++++++++++
ggml/src/ggml.c | 54 ++++++++++++++++
src/models/delta-net-base.cpp | 51 +++++++++++++++
src/models/models.h | 14 +++++
src/models/qwen35.cpp | 23 +++++--
src/models/qwen35moe.cpp | 23 +++++--
src/models/qwen3next.cpp | 29 ++++++---
tests/test-backend-ops.cpp | 47 ++++++++++++++
11 files changed, 526 insertions(+), 22 deletions(-)
create mode 100644 CONV_STATE_FUSION_RESULTS.md
diff --git a/CONV_STATE_FUSION_RESULTS.md b/CONV_STATE_FUSION_RESULTS.md
new file mode 100644
index 0000000..f59b6e5
--- /dev/null
+++ b/CONV_STATE_FUSION_RESULTS.md
@@ -0,0 +1,106 @@
+# Patch 0021: qwen35 decode conv-state in-place fusion (no-regret, bit-exact)
+
+The no-regret conv-state cleanup from the GDN_RECURRENCE_BYTE_GATE design, point (3).
+After the recurrence byte-gate (NO-BUILD: the GDN recurrence is already single-pass at
+the f32 byte floor), the conv path was the only remaining bit-exact decode lever.
+
+## What changed
+
+A new fused op `ggml_ssm_conv_update_inplace` (reuses `GGML_OP_SSM_CONV`, discriminated by a
+non-null `src[3]`) that, on the single-token decode path, replaces the four-op conv chain:
+
+ qkv_mixed transpose -> ggml_concat (build width-K window) [concat_cont 8.14 ms/step]
+ -> ggml_ssm_conv (depthwise conv) [ssm_conv_f32 ~8.6 ms/step]
+ -> ggml_silu [folded into ssm_conv on CUDA]
+ -> ggml_cpy of the shifted ring state into the conv cache [cpy_scalar 5.76 ms/step]
+
+with ONE kernel that, per (channel, sequence), assembles the width-K window in registers from
+the K-1 cached taps + the current `qkv_mixed` token, computes the depthwise conv with the SAME
+ascending-tap FMA order as `ssm_conv_f32` at i==0, folds silu, writes the conv output, and writes
+the 1-token-shifted ring state back IN PLACE into the conv cache slot at kv_head (the exact slot
+the baseline `ggml_cpy` wrote). Mirrors the 0018 in-place write-back + 0019 patterns. This is
+vLLM's `causal_conv1d_update`.
+
+Files:
+- `ggml/include/ggml.h`, `ggml/src/ggml.c`: new builder `ggml_ssm_conv_update_inplace`
+ (src[0]=conv_states [K-1,channels,n_seqs], src[1]=conv_kernel, src[2]=x_cur [channels,1,n_seqs],
+ src[3]=conv_state_dst [(K-1)*channels,n_seqs] in-place ring; op_params[0]=fuse_silu).
+- `ggml/src/ggml-cuda/ssm-conv.cu`: kernel `ssm_conv_update_f32<apply_silu,d_conv>` (one thread per
+ (channel,seq)) + `ggml_cuda_op_ssm_conv_update` + a `src[3]`-discriminated branch at the top of
+ `ggml_cuda_op_ssm_conv`.
+- `ggml/src/ggml-cpu/ops.cpp`: `ggml_compute_forward_ssm_conv_update_f32` (threads split over
+ channels) + branch in `ggml_compute_forward_ssm_conv`.
+- `src/models/delta-net-base.cpp` + `models.h`: `build_conv_state_fused` (keeps the cheap build_rs
+ conv-tap gather; fuses conv+silu+shifted write-back). Read source (gathered scratch) and write
+ target (cache view) are disjoint buffers -> race-free by construction; no ids/identity logic needed.
+- `src/models/qwen35.cpp`, `qwen35moe.cpp`, `qwen3next.cpp`: route the single-token decode path
+ (`n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar`) to `build_conv_state_fused`; prefill/chunked/
+ rollback keep the existing concat+ssm_conv+silu+cpy chain.
+- `tests/test-backend-ops.cpp`: `test_ssm_conv_update` (16 cases) comparing the fused conv output
+ vs the CPU reference across backends.
+
+## Gate: test-backend-ops (CUDA0 vs CPU reference)
+
+- SSM_CONV: 45/45 OK (unchanged path intact)
+- SSM_CONV_UPDATE: 16/16 OK (new op; d_conv 3/4 x channels 256/3328 x n_seqs 1/4/32/128)
+- SSM_CONV_BIAS_SILU: 90/90 OK
+
+## Gate: greedy bit-exactness (--temp 0 --seed 1 --ignore-eos -n 256, -no-cnv, -fa on)
+
+Byte-identical to the clean Lever-1 (0019/0020) baseline, both models:
+
+| model | baseline md5 | fused md5 | result |
+|--------------------|----------------------------------|----------------------------------|-----------------|
+| q36-27b-nvfp4 | 675cd52265f2b3d7695c8739946d55ea | 675cd52265f2b3d7695c8739946d55ea | BYTE-IDENTICAL |
+| q36-35b-a3b-nvfp4 | ac163882eb3812ef08d4c73e6d9a0abf | ac163882eb3812ef08d4c73e6d9a0abf | BYTE-IDENTICAL |
+
+## decode_agg S_TG (npp128 ntg128, -fa on, -c 33000), same-session before/after
+
+Dense q36-27b-nvfp4:
+
+| mode | npl | baseline | fused | delta |
+|-----------|-----|----------|--------|---------|
+| CUDA-graph| 32 | 199.76 | 202.99 | +1.6% |
+| CUDA-graph| 128 | 336.35 | 347.14 | +3.2% |
+| eager | 32 | 196.07 | 197.61 | +0.8% |
+| eager | 128 | 333.62 | 342.97 | +2.8% |
+
+MoE q36-35b-a3b-nvfp4:
+
+| mode | npl | baseline | fused | delta |
+|-----------|-----|----------|--------|---------|
+| CUDA-graph| 32 | 421.72 | 432.39 | +2.5% |
+| CUDA-graph| 128 | 689.74 | 713.54 | +3.5% |
+| eager | 32 | 421.05 | 432.46 | +2.7% |
+| eager | 128 | 689.15 | 713.87 | +3.6% |
+
+Dense npl128 (production CUDA-graph) lands at 347.14 t/s, in the predicted 346-349 band, and at
+**88.8% of vLLM 391** (up from 86.0%). The lift holds in BOTH graph and eager modes.
+
+## Step time + nsys kernel delta
+
+Per-step decode time (dense npl128, T_TG / ntg=128):
+- baseline 48.711 s / 128 = 380.6 ms/step
+- fused 47.197 s / 128 = 368.7 ms/step -> **-11.9 ms/step** (matches the predicted +12-14 ms)
+- MoE npl128: 185.6 -> 179.4 ms/step (-6.2 ms/step)
+
+nsys eager decode (npp128 ntg24 npl128, 24 decode steps x 48 GDN layers), conv-path kernels:
+
+| kernel | baseline calls | fused calls | per-step (eager) |
+|---------------------|----------------|-------------|------------------|
+| concat_cont (decode)| 1152 | 0 (GONE) | 7.95 -> 0 ms |
+| cpy_scalar (decode) | 1152 of 3648 | 0 (GONE) | 4.29 -> 0 ms |
+| ssm_conv_f32 (decode)| 1152 of 2736 | 0 (prefill-only) | 8.65 -> 0 ms |
+| ssm_conv_update | 0 | 1152 | 0 -> 7.56 ms |
+
+Decode conv path eager GPU time: ~20.9 ms/step -> ~7.56 ms/step = ~13.3 ms/step saved. concat_cont
+and the decode cpy_scalar are eliminated; ssm_conv at decode is replaced by the fused update kernel.
+prefill keeps the original chain (concat_non_cont 1584, ssm_conv_f32 1584 unchanged).
+
+## Verdict
+
+Bit-exact, no regression, and lifts decode: dense 336.35 -> 347.14 t/s (+3.2%, 86.0 -> 88.8% of vLLM
+391), MoE 689.74 -> 713.54 t/s (+3.5%) at npl128. Step -11.9 ms (dense). Additive and risk-free;
+de-risks the in-place conv-cache plumbing the bf16-state lever (design (2)/(4)) also touches.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
index 951dd21..76fa401 100644
--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
@@ -2447,6 +2447,22 @@ extern "C" {
struct ggml_tensor * sx,
struct ggml_tensor * c);
+ // Fused decode-time depthwise causal conv1d update (mirrors vLLM causal_conv1d_update). Assembles
+ // the width-K conv window in registers from the cached K-1 taps (`conv_states`, [K-1, channels,
+ // n_seqs]) plus the single current token (`x_cur`, [channels, 1, n_seqs]), computes the depthwise
+ // conv with the SAME ascending-tap FMA order as ggml_ssm_conv, optionally folds SiLU, and writes
+ // the 1-token-shifted ring state back IN PLACE into `conv_state_dst` (a [(K-1)*channels, n_seqs]
+ // view into the conv-state cache). This eliminates the concat + transpose + scalar copy-back +
+ // separate silu of the decode conv path. Output: [channels, 1, n_seqs]. Reuses GGML_OP_SSM_CONV;
+ // detected by the backends via a non-null src[3]. n_seq_tokens must be 1 (single-token decode).
+ GGML_API struct ggml_tensor * ggml_ssm_conv_update_inplace(
+ struct ggml_context * ctx,
+ struct ggml_tensor * conv_states,
+ struct ggml_tensor * conv_kernel,
+ struct ggml_tensor * x_cur,
+ struct ggml_tensor * conv_state_dst,
+ bool fuse_silu);
+
GGML_API struct ggml_tensor * ggml_ssm_scan(
struct ggml_context * ctx,
struct ggml_tensor * s,
diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
index b6a1976..f9cd850 100644
--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
@@ -9463,13 +9463,84 @@ static void ggml_compute_forward_ssm_conv_f32(
}
}
+// Fused decode-time depthwise causal conv1d update (mirror of the CUDA ssm_conv_update_f32). Reads the
+// K-1 cached taps (src[0]) and the single new token (src[2]), computes the depthwise conv with the same
+// ascending-tap FMA order as ggml_compute_forward_ssm_conv_f32, optionally folds silu, writes the conv
+// output to dst, and writes the 1-token-shifted ring state back in place into src[3]. Threads split
+// over channels.
+static void ggml_compute_forward_ssm_conv_update_f32(
+ const ggml_compute_params * params,
+ ggml_tensor * dst) {
+ const ggml_tensor * conv_states = dst->src[0]; // [K-1, channels, n_seqs]
+ const ggml_tensor * conv_kernel = dst->src[1]; // [K, channels]
+ const ggml_tensor * x_cur = dst->src[2]; // [channels, 1, n_seqs]
+ ggml_tensor * cdst = dst->src[3]; // [(K-1)*channels, n_seqs] in-place ring target
+
+ const int ith = params->ith;
+ const int nth = params->nth;
+
+ const int64_t d_conv = conv_kernel->ne[0];
+ const int64_t channels = conv_kernel->ne[1];
+ const int64_t n_seqs = conv_states->ne[2];
+ const bool apply_silu = ggml_get_op_params_i32(dst, 0) != 0;
+
+ GGML_ASSERT(conv_states->nb[0] == sizeof(float));
+ GGML_ASSERT(conv_kernel->nb[0] == sizeof(float));
+
+ const int64_t states_seq_stride = conv_states->nb[2] / sizeof(float);
+ const int64_t states_ch_stride = conv_states->nb[1] / sizeof(float);
+ const int64_t w_stride = conv_kernel->nb[1] / sizeof(float);
+ const int64_t x_seq_stride = x_cur->nb[2] / sizeof(float);
+ const int64_t dst_seq_stride = dst->nb[2] / sizeof(float);
+ const int64_t cdst_seq_stride = cdst->nb[1] / sizeof(float);
+
+ const float * states_base = (const float *) conv_states->data;
+ const float * w_base = (const float *) conv_kernel->data;
+ const float * x_base = (const float *) x_cur->data;
+ float * cdst_base = (float *) cdst->data;
+ float * dst_base = (float *) dst->data;
+
+ const int64_t dc = (channels + nth - 1) / nth;
+ const int64_t c0 = dc * ith;
+ const int64_t c1 = MIN(c0 + dc, channels);
+
+ for (int64_t s = 0; s < n_seqs; ++s) {
+ for (int64_t c = c0; c < c1; ++c) {
+ const float * states_c = states_base + s * states_seq_stride + c * states_ch_stride;
+ const float * w_c = w_base + c * w_stride;
+ const float xc = x_base[s * x_seq_stride + c];
+
+ // ascending-tap FMA: tap0*w0 + ... + tap_{K-2}*w_{K-2} + xc*w_{K-1} (matches ssm_conv)
+ float sumf = 0.0f;
+ for (int64_t j = 0; j < d_conv - 1; ++j) {
+ sumf += states_c[j] * w_c[j];
+ }
+ sumf += xc * w_c[d_conv - 1];
+ sumf += 0.0f; // matches ssm_conv `sumf += b` with b == 0
+
+ dst_base[s * dst_seq_stride + c] = apply_silu ? (sumf / (1.0f + expf(-sumf))) : sumf;
+
+ // 1-token-shifted ring write-back: [tap1 .. tap_{K-2}, xc]
+ float * out_state = cdst_base + s * cdst_seq_stride + c * (d_conv - 1);
+ for (int64_t j = 0; j < d_conv - 2; ++j) {
+ out_state[j] = states_c[j + 1];
+ }
+ out_state[d_conv - 2] = xc;
+ }
+ }
+}
+
void ggml_compute_forward_ssm_conv(
const ggml_compute_params * params,
ggml_tensor * dst) {
switch (dst->src[0]->type) {
case GGML_TYPE_F32:
{
- ggml_compute_forward_ssm_conv_f32(params, dst);
+ if (dst->src[3] != nullptr) {
+ ggml_compute_forward_ssm_conv_update_f32(params, dst);
+ } else {
+ ggml_compute_forward_ssm_conv_f32(params, dst);
+ }
} break;
default:
{
diff --git a/ggml/src/ggml-cuda/ssm-conv.cu b/ggml/src/ggml-cuda/ssm-conv.cu
index 1463169..e1af1cd 100644
--- a/ggml/src/ggml-cuda/ssm-conv.cu
+++ b/ggml/src/ggml-cuda/ssm-conv.cu
@@ -123,6 +123,109 @@ static __global__ void ssm_conv_long_token_f32(const float * __restrict__ src0,
}
}
+// Fused decode-time depthwise causal conv1d update (one new token). Each thread owns one channel of
+// one sequence: it assembles the width-d_conv window from the K-1 cached taps (conv_states) plus the
+// current token (x_cur), computes the depthwise conv with the SAME ascending-tap FMA order as
+// ssm_conv_f32 at i==0, optionally folds silu, writes the conv output, and writes the 1-token-shifted
+// ring state back in place into conv_state_dst. Bit-identical to ssm_conv(concat) + silu + copy-back.
+template <bool apply_silu, int d_conv>
+static __global__ void ssm_conv_update_f32(const float * __restrict__ conv_states,
+ const float * __restrict__ conv_kernel,
+ const float * __restrict__ x_cur,
+ float * __restrict__ conv_state_dst,
+ float * __restrict__ dst,
+ const int channels,
+ const int states_seq_stride,
+ const int w_stride,
+ const int x_seq_stride,
+ const int dst_seq_stride,
+ const int cdst_seq_stride) {
+ const int c = blockIdx.x * blockDim.x + threadIdx.x; // channel
+ const int s = blockIdx.y; // sequence
+ if (c >= channels) {
+ return;
+ }
+
+ const float * states_c = conv_states + (int64_t) s * states_seq_stride + (int64_t) c * (d_conv - 1);
+ const float * w_c = conv_kernel + (int64_t) c * w_stride;
+ const float xc = x_cur[(int64_t) s * x_seq_stride + c];
+
+ // window = [tap0 .. tap_{K-2}, current-token], same ordering as the concat(conv_states, x) window
+ float window[d_conv];
+#pragma unroll
+ for (int j = 0; j < d_conv - 1; j++) {
+ window[j] = states_c[j];
+ }
+ window[d_conv - 1] = xc;
+
+ float sumf = 0.0f;
+#pragma unroll
+ for (int j = 0; j < d_conv; j++) {
+ sumf += window[j] * w_c[j];
+ }
+ sumf += 0.0f; // matches ssm_conv_f32 `sumf += b` with b == 0 (qwen35 conv1d has no bias)
+ dst[(int64_t) s * dst_seq_stride + c] = apply_silu ? ggml_cuda_op_silu_single(sumf) : sumf;
+
+ // 1-token-shifted ring write-back: drop the oldest tap, append the current token
+ float * out_state = conv_state_dst + (int64_t) s * cdst_seq_stride + (int64_t) c * (d_conv - 1);
+#pragma unroll
+ for (int j = 0; j < d_conv - 1; j++) {
+ out_state[j] = window[j + 1];
+ }
+}
+
+static void ggml_cuda_op_ssm_conv_update(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
+ const ggml_tensor * conv_states = dst->src[0]; // [K-1, channels, n_seqs]
+ const ggml_tensor * conv_kernel = dst->src[1]; // [K, channels]
+ const ggml_tensor * x_cur = dst->src[2]; // [channels, 1, n_seqs]
+ const ggml_tensor * cdst = dst->src[3]; // [(K-1)*channels, n_seqs] in-place ring target
+
+ const int64_t d_conv = conv_kernel->ne[0];
+ const int64_t channels = conv_kernel->ne[1];
+ const int64_t n_seqs = conv_states->ne[2];
+ const bool apply_silu = ggml_get_op_params_i32(dst, 0) != 0;
+
+ GGML_ASSERT(conv_states->type == GGML_TYPE_F32 && conv_kernel->type == GGML_TYPE_F32);
+ GGML_ASSERT(x_cur->type == GGML_TYPE_F32 && cdst->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32);
+ GGML_ASSERT(conv_states->nb[0] == sizeof(float));
+ GGML_ASSERT(conv_states->nb[1] == (size_t) (d_conv - 1) * sizeof(float));
+ GGML_ASSERT(conv_kernel->nb[0] == sizeof(float));
+ GGML_ASSERT(dst->ne[0] == channels && dst->ne[1] == 1 && dst->ne[2] == n_seqs);
+
+ const float * states_d = (const float *) conv_states->data;
+ const float * w_d = (const float *) conv_kernel->data;
+ const float * x_d = (const float *) x_cur->data;
+ float * cdst_d = (float *) cdst->data;
+ float * dst_d = (float *) dst->data;
+ cudaStream_t stream = ctx.stream();
+
+ const int states_seq_stride = (int) (conv_states->nb[2] / sizeof(float));
+ const int w_stride = (int) (conv_kernel->nb[1] / sizeof(float));
+ const int x_seq_stride = (int) (x_cur->nb[2] / sizeof(float));
+ const int dst_seq_stride = (int) (dst->nb[2] / sizeof(float));
+ const int cdst_seq_stride = (int) (cdst->nb[1] / sizeof(float));
+
+ const int threads = 128;
+ const dim3 blocks((channels + threads - 1) / threads, (unsigned) n_seqs, 1);
+
+ auto launch = [&](auto NC) {
+ constexpr int kNC = decltype(NC)::value;
+ if (apply_silu) {
+ ssm_conv_update_f32<true, kNC><<<blocks, threads, 0, stream>>>(states_d, w_d, x_d, cdst_d, dst_d,
+ (int) channels, states_seq_stride, w_stride, x_seq_stride, dst_seq_stride, cdst_seq_stride);
+ } else {
+ ssm_conv_update_f32<false, kNC><<<blocks, threads, 0, stream>>>(states_d, w_d, x_d, cdst_d, dst_d,
+ (int) channels, states_seq_stride, w_stride, x_seq_stride, dst_seq_stride, cdst_seq_stride);
+ }
+ };
+
+ switch (d_conv) {
+ case 3: launch(std::integral_constant<int, 3>{}); break;
+ case 4: launch(std::integral_constant<int, 4>{}); break;
+ default: GGML_ABORT("ssm_conv_update only supports d_conv 3 or 4");
+ }
+}
+
template <bool apply_silu>
static void ssm_conv_f32_cuda(const float * src0, const float * src1, const float * bias, const int src0_nb0, const int src0_nb1,
const int src0_nb2, const int src1_nb1, float * dst, const int dst_nb0, const int dst_nb1,
@@ -158,6 +261,15 @@ static void ssm_conv_f32_cuda(const float * src0, const float * src1, const floa
}
void ggml_cuda_op_ssm_conv(ggml_backend_cuda_context & ctx, ggml_tensor * dst, ggml_tensor * bias_add_node, ggml_tensor * silu_dst) {
+ // Fused decode conv-update-in-place variant (ggml_ssm_conv_update_inplace): discriminated by a
+ // non-null src[3] (the in-place ring write-back target). It folds the concat/transpose/copy-back/
+ // silu of the decode conv path into a single kernel.
+ if (dst->src[3] != nullptr) {
+ GGML_ASSERT(bias_add_node == nullptr && silu_dst == nullptr);
+ ggml_cuda_op_ssm_conv_update(ctx, dst);
+ return;
+ }
+
const struct ggml_tensor * src0 = dst->src[0]; // conv_x
const struct ggml_tensor * src1 = dst->src[1]; // conv1d.weight
const bool fuse_bias = bias_add_node != nullptr;
diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
index 1762037..b777748 100644
--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
@@ -5555,6 +5555,60 @@ struct ggml_tensor * ggml_ssm_conv(
return result;
}
+// ggml_ssm_conv_update_inplace
+//
+// Fused decode-time depthwise causal conv1d update. Reuses GGML_OP_SSM_CONV but is discriminated by a
+// non-null src[3]. The op reads each channel's K-1 cached taps from `conv_states` and the single new
+// token from `x_cur`, computes the depthwise conv (ascending-tap FMA, bit-identical to ggml_ssm_conv),
+// optionally folds SiLU, writes the conv output to dst ([channels, 1, n_seqs]) and writes the
+// 1-token-shifted ring state back in place into `conv_state_dst` (the active sequences' conv-cache
+// slot). op_params[0] carries the fuse_silu flag. Mirrors the 0018/0019 in-place state pattern.
+struct ggml_tensor * ggml_ssm_conv_update_inplace(
+ struct ggml_context * ctx,
+ struct ggml_tensor * conv_states,
+ struct ggml_tensor * conv_kernel,
+ struct ggml_tensor * x_cur,
+ struct ggml_tensor * conv_state_dst,
+ bool fuse_silu) {
+ GGML_ASSERT(ggml_is_3d(conv_states));
+ GGML_ASSERT(ggml_is_matrix(conv_kernel));
+ GGML_ASSERT(ggml_is_3d(x_cur));
+
+ const int64_t d_conv = conv_kernel->ne[0];
+ const int64_t channels = conv_kernel->ne[1];
+ const int64_t n_seqs = conv_states->ne[2];
+
+ GGML_ASSERT(conv_states->type == GGML_TYPE_F32);
+ GGML_ASSERT(conv_kernel->type == GGML_TYPE_F32);
+ GGML_ASSERT(x_cur->type == GGML_TYPE_F32);
+ GGML_ASSERT(conv_state_dst != NULL && conv_state_dst->type == GGML_TYPE_F32);
+
+ // conv_states: [K-1, channels, n_seqs], contiguous taps per channel
+ GGML_ASSERT(conv_states->ne[0] == d_conv - 1);
+ GGML_ASSERT(conv_states->ne[1] == channels);
+ GGML_ASSERT(conv_states->nb[0] == sizeof(float));
+ // x_cur: single decode token per sequence
+ GGML_ASSERT(x_cur->ne[0] == channels);
+ GGML_ASSERT(x_cur->ne[1] == 1);
+ GGML_ASSERT(x_cur->ne[2] == n_seqs);
+ // conv_state_dst: [(K-1)*channels, n_seqs] in-place ring write target
+ GGML_ASSERT(conv_state_dst->ne[0] == (d_conv - 1) * channels);
+ GGML_ASSERT(conv_state_dst->ne[1] >= n_seqs);
+ GGML_ASSERT(conv_state_dst->nb[0] == sizeof(float));
+
+ struct ggml_tensor * result = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, channels, 1, n_seqs);
+
+ ggml_set_op_params_i32(result, 0, fuse_silu ? 1 : 0);
+
+ result->op = GGML_OP_SSM_CONV;
+ result->src[0] = conv_states;
+ result->src[1] = conv_kernel;
+ result->src[2] = x_cur;
+ result->src[3] = conv_state_dst;
+
+ return result;
+}
+
// ggml_ssm_scan
struct ggml_tensor * ggml_ssm_scan(
diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp
index 194e611..0eee804 100644
--- a/src/models/delta-net-base.cpp
+++ b/src/models/delta-net-base.cpp
@@ -524,6 +524,57 @@ ggml_tensor * llm_build_delta_net_base::build_conv_state(
return conv_input;
}
+// Fused decode conv path (patch 0021). Reads the active sequences' prior conv-state taps (the same
+// cheap build_rs gather as build_conv_state), then fuses the depthwise conv + silu + the 1-token-
+// shifted ring write-back into a single ggml_ssm_conv_update_inplace op. This removes the concat
+// (concat_cont), the transpose materialization, the scalar copy-back (cpy_scalar) and the separate
+// silu of the decode conv path. The op reads from the (disjoint) materialized taps and writes the
+// new ring state in place into the cache slot at kv_head -- exactly the slot the baseline ggml_cpy
+// wrote -- so it is bit-identical to build_conv_state + ggml_ssm_conv + ggml_silu.
+ggml_tensor * llm_build_delta_net_base::build_conv_state_fused(
+ llm_graph_input_rs * inp,
+ ggml_tensor * conv_states_all,
+ ggml_tensor * qkv_mixed,
+ ggml_tensor * conv_kernel,
+ int64_t conv_kernel_size,
+ int64_t conv_channels,
+ int il) {
+ const auto * mctx_cur = inp->mctx;
+ const auto kv_head = mctx_cur->get_head();
+
+ const int64_t n_seqs = ubatch.n_seqs;
+ const int64_t n_seq_tokens = ubatch.n_seq_tokens;
+
+ GGML_ASSERT(n_seq_tokens == 1); // single-token decode only
+ GGML_ASSERT(cparams.n_rs_seq == 0); // no rollback splits on this path
+
+ // Prior conv-state taps for the active sequences: [K-1, conv_channels, n_seqs]. Same get_rows
+ // gather as the baseline build_conv_state read (tiny; not one of the eliminated buckets).
+ ggml_tensor * conv_states = build_rs(inp, conv_states_all, hparams.n_embd_r(), n_seqs);
+ conv_states = ggml_reshape_3d(ctx0, conv_states, conv_kernel_size - 1, conv_channels, n_seqs);
+ cb(conv_states, "conv_states_reshaped", il);
+
+ // Current token, native (non-transposed) qkv_mixed: [conv_channels, 1, n_seqs].
+ ggml_tensor * x_cur = ggml_reshape_3d(ctx0, qkv_mixed, conv_channels, n_seq_tokens, n_seqs);
+
+ // In-place ring write-back target = the active sequences' conv-cache slot at kv_head, exactly the
+ // destination the baseline ggml_cpy wrote to (s_slot == 0).
+ const int64_t row_count = (conv_kernel_size - 1) * conv_channels;
+ const size_t row_size = ggml_row_size(conv_states_all->type, row_count);
+ ggml_tensor * conv_state_dst =
+ ggml_view_2d(ctx0, conv_states_all, row_count, n_seqs, conv_states_all->nb[1], kv_head * row_size);
+ cb(conv_state_dst, "conv_state_update", il);
+
+ ggml_tensor * conv_output =
+ ggml_ssm_conv_update_inplace(ctx0, conv_states, conv_kernel, x_cur, conv_state_dst, /*fuse_silu=*/true);
+ cb(conv_output, "conv_output_silu", il);
+
+ // the ring write is a side effect of the op; pull the op into the graph via the output
+ ggml_build_forward_expand(gf, conv_output);
+
+ return conv_output; // [conv_channels, 1, n_seqs], already silu'd
+}
+
// Step 2: gather-free recurrent attention. Mirrors mamba-base's get_ssm_rows pattern: the fused
// gated-DeltaNet op reads each sequence's prior state directly from the full cache via the s_copy
// ids (no ggml_get_rows materialization) and writes the new state in place (Step 1). The non-fused
diff --git a/src/models/models.h b/src/models/models.h
index 98b89e9..da0dd86 100644
--- a/src/models/models.h
+++ b/src/models/models.h
@@ -76,6 +76,20 @@ struct llm_build_delta_net_base : public llm_graph_context {
int64_t conv_channels,
int il);
+ // Fused decode-time conv path (patch 0021). Replaces the concat + transpose + ssm_conv + silu +
+ // copy-back chain with a single ggml_ssm_conv_update_inplace op that reads the cached K-1 taps and
+ // the current token, computes the depthwise conv, folds silu, and writes the 1-token-shifted ring
+ // state back in place. Decode-only (n_seq_tokens == 1, n_rs_seq == 0). Returns the silu'd conv
+ // output: (conv_channels, 1, n_seqs). Bit-identical to the build_conv_state + ggml_ssm_conv chain.
+ ggml_tensor * build_conv_state_fused(
+ llm_graph_input_rs * inp,
+ ggml_tensor * conv_states_all,
+ ggml_tensor * qkv_mixed,
+ ggml_tensor * conv_kernel,
+ int64_t conv_kernel_size,
+ int64_t conv_channels,
+ int il);
+
// run delta-net attention and write the new recurrent state(s) back to ssm_states_all
// s: (head_v_dim, head_v_dim, num_v_heads, n_seqs); returns output: (head_v_dim, num_v_heads, n_seq_tokens, n_seqs)
ggml_tensor * build_recurrent_attn(
diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp
index 0874c43..b6dcc5f 100644
--- a/src/models/qwen35.cpp
+++ b/src/models/qwen35.cpp
@@ -383,15 +383,26 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear(
const int64_t conv_kernel_size = conv_kernel->ne[0];
const int64_t conv_channels = d_inner + 2 * hparams.ssm_n_group * hparams.ssm_d_state;
- ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
+ // Patch 0021: on the single-token decode path, fuse the conv window assembly + depthwise conv +
+ // silu + the 1-token-shifted ring write-back into one in-place op (removes concat_cont, the
+ // transpose materialization, cpy_scalar and the separate silu). Bit-identical to the chain below.
+ const bool conv_decode_fused = (n_seq_tokens == 1) && (cparams.n_rs_seq == 0) && cparams.fused_gdn_ar;
+
+ ggml_tensor * conv_qkv_mix;
+ if (conv_decode_fused) {
+ conv_qkv_mix = build_conv_state_fused(inp, conv_states_all, qkv_mixed, conv_kernel,
+ conv_kernel_size, conv_channels, il);
+ } else {
+ ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
- ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
- cb(conv_output_proper, "conv_output_raw", il);
+ ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
+ cb(conv_output_proper, "conv_output_raw", il);
- ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
- cb(conv_output_silu, "conv_output_silu", il);
+ ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
+ cb(conv_output_silu, "conv_output_silu", il);
- ggml_tensor * conv_qkv_mix = conv_output_silu;
+ conv_qkv_mix = conv_output_silu;
+ }
// Calculate the total conv dimension
int64_t qkv_dim = head_k_dim * num_k_heads * 2 + head_v_dim * num_v_heads;
diff --git a/src/models/qwen35moe.cpp b/src/models/qwen35moe.cpp
index 1f6f643..c7c7c44 100644
--- a/src/models/qwen35moe.cpp
+++ b/src/models/qwen35moe.cpp
@@ -407,15 +407,26 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear(
const int64_t conv_kernel_size = conv_kernel->ne[0];
const int64_t conv_channels = d_inner + 2 * hparams.ssm_n_group * hparams.ssm_d_state;
- ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
+ // Patch 0021: on the single-token decode path, fuse the conv window assembly + depthwise conv +
+ // silu + the 1-token-shifted ring write-back into one in-place op (removes concat_cont, the
+ // transpose materialization, cpy_scalar and the separate silu). Bit-identical to the chain below.
+ const bool conv_decode_fused = (n_seq_tokens == 1) && (cparams.n_rs_seq == 0) && cparams.fused_gdn_ar;
+
+ ggml_tensor * conv_qkv_mix;
+ if (conv_decode_fused) {
+ conv_qkv_mix = build_conv_state_fused(inp, conv_states_all, qkv_mixed, conv_kernel,
+ conv_kernel_size, conv_channels, il);
+ } else {
+ ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
- ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
- cb(conv_output_proper, "conv_output_raw", il);
+ ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
+ cb(conv_output_proper, "conv_output_raw", il);
- ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
- cb(conv_output_silu, "conv_output_silu", il);
+ ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
+ cb(conv_output_silu, "conv_output_silu", il);
- ggml_tensor * conv_qkv_mix = conv_output_silu;
+ conv_qkv_mix = conv_output_silu;
+ }
// Calculate the total conv dimension
int64_t qkv_dim = head_k_dim * num_k_heads * 2 + head_v_dim * num_v_heads;
diff --git a/src/models/qwen3next.cpp b/src/models/qwen3next.cpp
index bfdf026..92749d1 100644
--- a/src/models/qwen3next.cpp
+++ b/src/models/qwen3next.cpp
@@ -434,19 +434,30 @@ ggml_tensor * llama_model_qwen3next::graph::build_layer_attn_linear(
const int64_t conv_kernel_size = conv_kernel->ne[0];
const int64_t conv_channels = d_inner + 2 * hparams.ssm_n_group * hparams.ssm_d_state;
- ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
+ // Patch 0021: on the single-token decode path, fuse the conv window assembly + depthwise conv +
+ // silu + the 1-token-shifted ring write-back into one in-place op (removes concat_cont, the
+ // transpose materialization, cpy_scalar and the separate silu). Bit-identical to the chain below.
+ const bool conv_decode_fused = (n_seq_tokens == 1) && (cparams.n_rs_seq == 0) && cparams.fused_gdn_ar;
+
+ ggml_tensor * conv_qkv_mix;
+ if (conv_decode_fused) {
+ conv_qkv_mix = build_conv_state_fused(inp, conv_states_all, qkv_mixed, conv_kernel,
+ conv_kernel_size, conv_channels, il);
+ } else {
+ ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
- ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
- state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs);
- cb(state, "state_predelta", il);
+ ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
+ cb(conv_output_proper, "conv_output_raw", il);
- ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
- cb(conv_output_proper, "conv_output_raw", il);
+ ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
+ cb(conv_output_silu, "conv_output_silu", il);
- ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
- cb(conv_output_silu, "conv_output_silu", il);
+ conv_qkv_mix = conv_output_silu;
+ }
- ggml_tensor * conv_qkv_mix = conv_output_silu;
+ ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
+ state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs);
+ cb(state, "state_predelta", il);
// Calculate the total conv dimension
int64_t qkv_dim = head_k_dim * num_k_heads * 2 + head_v_dim * num_v_heads;
diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index 291c275..c7348d6 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -3748,6 +3748,43 @@ struct test_ssm_conv_bias_silu : public test_case {
}
};
+// GGML_OP_SSM_CONV fused decode conv-update-in-place (ggml_ssm_conv_update_inplace, patch 0021).
+// Validates the conv + silu output (dst) against the CPU reference across backends. The 1-token-
+// shifted ring write-back to conv_state_dst is a side effect (validated end-to-end by the greedy
+// md5 gate); here it just exercises the in-place write target as an op src.
+struct test_ssm_conv_update : public test_case {
+ const int64_t d_conv;
+ const int64_t channels;
+ const int64_t n_seqs;
+
+ std::string op_desc(ggml_tensor * t) override {
+ GGML_UNUSED(t);
+ return "SSM_CONV_UPDATE";
+ }
+
+ std::string vars() override {
+ return VARS_TO_STR3(d_conv, channels, n_seqs);
+ }
+
+ test_ssm_conv_update(int64_t d_conv = 4, int64_t channels = 256, int64_t n_seqs = 4)
+ : d_conv(d_conv), channels(channels), n_seqs(n_seqs) {}
+
+ ggml_tensor * build_graph(ggml_context * ctx) override {
+ ggml_tensor * conv_states = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, d_conv - 1, channels, n_seqs);
+ ggml_tensor * conv_kernel = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, d_conv, channels);
+ ggml_tensor * x_cur = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, channels, 1, n_seqs);
+ ggml_tensor * conv_state_dst = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, (d_conv - 1) * channels, n_seqs);
+ ggml_set_name(conv_states, "conv_states");
+ ggml_set_name(conv_kernel, "conv_kernel");
+ ggml_set_name(x_cur, "x_cur");
+ ggml_set_name(conv_state_dst, "conv_state_dst");
+
+ ggml_tensor * out = ggml_ssm_conv_update_inplace(ctx, conv_states, conv_kernel, x_cur, conv_state_dst, true);
+ ggml_set_name(out, "out");
+ return out;
+ }
+};
+
// GGML_OP_SSM_SCAN
struct test_ssm_scan : public test_case {
const ggml_type type;
@@ -8355,6 +8392,16 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
}
}
+ // fused decode conv-update-in-place (ggml_ssm_conv_update_inplace, patch 0021). channels must be
+ // a multiple of 128 for the CUDA SSM_CONV supports_op gate.
+ for (int64_t d_conv : {3, 4}) {
+ for (int64_t channels : {256, 3328}) {
+ for (int64_t n_seqs : {1, 4, 32, 128}) {
+ test_cases.emplace_back(new test_ssm_conv_update(d_conv, channels, n_seqs));
+ }
+ }
+ }
+
test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 16, 1, 1024, 1, 32, 4)); // Mamba-1
test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 128, 64, 16, 2, 32, 4)); // Mamba-2
test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 256, 64, 8, 2, 32, 4)); // Falcon-H1
--
2.43.0

View File

@@ -1,403 +0,0 @@
From 8a3229f41d5b712e87901796dfae3faee1f2f07d Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Thu, 25 Jun 2026 20:32:55 +0200
Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet decode
occupancy/coalescing retune (patch 0022)
Bit-exact occupancy retune of gated_delta_net_cuda, the B=128 decode recurrence
kernel. After the f32 verdict (vLLM carries the gated-DeltaNet temporal state in
float32 and moves the same ~805 MB/call as llama; the gap was pure DRAM bandwidth
efficiency on equal bytes - llama 73.4% vs vLLM 82.4% of the 273 GB/s GB10 peak),
the lever is a latency-coverage retune that keeps the per-column f32 reduction/FMA
order byte-identical (md5-gateable). The bf16-state plan stays shelved.
Column folding: two new template params NUM_WARPS (default 4) and COLS_PER_WARP
(default 1). Each warp now owns COLS_PER_WARP columns of the 128x128 recurrent
state instead of 1, looping the existing per-column body over col, col+NUM_WARPS,
... within a per-block column tile of NUM_WARPS*COLS_PER_WARP columns;
grid.z = S_v / (NUM_WARPS*COLS_PER_WARP). The S_v rows of every column stay sharded
across the lanes by the same strided i = r*warp_size + lane mapping, and every
column's per-lane FMA accumulation and warp_reduce_sum butterfly are byte-for-byte
unchanged; only the (warp,block)->column assignment and visit order differ, which a
column's value provably does not depend on (columns are fully independent). This
raises per-warp memory-level parallelism ~COLS_PER_WARP-fold (independent
state-load bursts before any reduction + interleaved butterfly reductions hiding
each other's shfl latency), covering more DRAM latency on this bandwidth-bound
kernel. Every global access stays identically coalesced, so it is a scheduling /
latency-coverage win, not a coalescing change. The forbidden float4 state load
(which would repartition a lane to 4 contiguous rows and change the reduction
grouping) is NOT done, so the md5 stays invariant. The S_v=128 tile is
env-selectable (GDN_NW / GDN_CPW) for one-build re-tuning; default is the measured
GB10 winner (16, 8).
GB10 (CUDA 13, sm_121, nsys CUPTI timing - HW counters perm-blocked):
gated_delta_net B=128 decode call (805.3 MB f32 R+W) 4.02 -> 3.49 ms/call,
200.3 -> 230.9 GB/s = 73.4% -> 84.6% of 273 GB/s peak (now above vLLM's 82.4%;
102.6% of vLLM's recurrence bandwidth). decode S_TG t/s (npp128 ntg128, -fa on):
dense 27b npl128 335.9 -> 373.2 (+11.1%), npl32 199.2 -> 207.6 (+4.2%); MoE
35b-a3b npl128 688.4 -> 745.7 (+8.3%), npl32 420.6 -> 440.0 (+4.6%). Prefill
unchanged.
Bit-exact: greedy --temp 0 --seed 1 md5 byte-identical to the 0021 baseline on
both q36-27b-nvfp4 and q36-35b-a3b-nvfp4 (winner 16x8 and 4x1 control);
test-backend-ops -o GATED_DELTA_NET 36/36 PASS.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/src/ggml-cuda/gated_delta_net.cu | 236 +++++++++++++++++---------
1 file changed, 157 insertions(+), 79 deletions(-)
diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
index 86d5e2a..d071d5a 100644
--- a/ggml/src/ggml-cuda/gated_delta_net.cu
+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
@@ -1,6 +1,8 @@
#include "gated_delta_net.cuh"
#include "ggml-cuda/common.cuh"
+#include <cstdlib>
+
// Step 2: gather only the NON-identity sequences' prior recurrent state from the full cache into a
// disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the
// destination slot by the recurrence kernel and are skipped here. One block per sequence.
@@ -29,8 +31,22 @@ static void ggml_cuda_gdn_gather_nonident(const float * cache, const int32_t * i
gdn_gather_nonident_kernel<<<(unsigned) n_seqs, 256, 0, stream>>>(cache, ids, rs_head, scratch, D, (int) n_seqs);
}
-template <int S_v, bool KDA, bool keep_rs_t>
-__global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * 4, 2)
+// Occupancy/coalescing retune (patch 0022). Each warp owns COLS_PER_WARP columns of the recurrent
+// state instead of 1, looping the existing per-column body over col, col+NUM_WARPS, ... within a
+// per-block column tile of size NUM_WARPS*COLS_PER_WARP. The S_v rows of every column stay sharded
+// across the lanes by the SAME strided mapping i = r*warp_size + lane, and every column's per-lane
+// FMA accumulation and warp_reduce_sum<warp_size> butterfly are byte-for-byte unchanged. Only the
+// (warp,block)->column assignment and the order a warp visits its columns differ, and a column's
+// f32 value provably does not depend on either (columns are fully independent: column c reads only
+// its own S_v-float state slice plus the shared per-(token,head,seq) q/k/v/g/beta). So the result
+// and the stored final state are bit-identical to the COLS_PER_WARP==1 baseline (md5-gateable),
+// while per-warp memory-level parallelism rises ~COLS_PER_WARP-fold (COLS_PER_WARP independent
+// state-load bursts issued before any reduction, and the independent butterfly reductions interleave
+// to hide each other's shfl latency) which covers more DRAM latency on this bandwidth-bound kernel.
+// Every individual global access stays IDENTICALLY coalesced (32 consecutive lanes -> one 128B
+// sector), so this is a latency-coverage / scheduling win, not a coalescing change.
+template <int S_v, bool KDA, bool keep_rs_t, int NUM_WARPS = 4, int COLS_PER_WARP = 1, int MIN_BLOCKS = 2>
+__global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * NUM_WARPS, MIN_BLOCKS)
gated_delta_net_cuda(const float * q,
const float * k,
const float * v,
@@ -59,9 +75,9 @@ gated_delta_net_cuda(const float * q,
int rs_head) {
const uint32_t h_idx = blockIdx.x;
const uint32_t sequence = blockIdx.y;
- // each warp owns one column, using warp-level primitives to reduce across rows
+ // each warp owns COLS_PER_WARP columns, using warp-level primitives to reduce across rows.
const int lane = threadIdx.x;
- const int col = blockIdx.z * blockDim.y + threadIdx.y;
+ const int col_base = blockIdx.z * (NUM_WARPS * COLS_PER_WARP) + threadIdx.y;
const uint32_t iq1 = fastmodulo(h_idx, neqk1_magic);
const uint32_t iq3 = fastdiv(sequence, rq3_magic);
@@ -86,20 +102,25 @@ gated_delta_net_cuda(const float * q,
// writing the same slot per block (identity) is race-free.
const float * read_state = (ids != nullptr && ids[sequence] == rs_head + (int) sequence)
? state_dst : curr_state;
- read_state += state_in_offset + col * S_v;
+ read_state += state_in_offset;
attn_data += (sequence * n_tokens * H + h_idx) * S_v;
constexpr int warp_size = ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v;
static_assert(S_v % warp_size == 0, "S_v must be a multiple of warp_size");
constexpr int rows_per_lane = (S_v + warp_size - 1) / warp_size;
- float s_shard[rows_per_lane];
- // state is stored transposed: M[col][i] = S[i][col], row col is contiguous
+ // per-column register shard of the recurrent state; state is stored transposed: M[col][i] = S[i][col].
+ float s_shard[COLS_PER_WARP][rows_per_lane];
ggml_cuda_pdl_sync();
#pragma unroll
- for (int r = 0; r < rows_per_lane; r++) {
- const int i = r * warp_size + lane;
- s_shard[r] = read_state[i];
+ for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+ const int col = col_base + cc * NUM_WARPS;
+ const float * rs = read_state + col * S_v;
+#pragma unroll
+ for (int r = 0; r < rows_per_lane; r++) {
+ const int i = r * warp_size + lane;
+ s_shard[cc][r] = rs[i];
+ }
}
for (int t = 0; t < n_tokens; t++) {
@@ -113,7 +134,7 @@ gated_delta_net_cuda(const float * q,
const float beta_val = *beta_t;
- // Cache k and q in registers
+ // Cache k and q in registers (shared across the COLS_PER_WARP columns of this warp).
float k_reg[rows_per_lane];
float q_reg[rows_per_lane];
#pragma unroll
@@ -126,59 +147,69 @@ gated_delta_net_cuda(const float * q,
if constexpr (!KDA) {
const float g_val = expf(*g_t);
- // kv[col] = (S^T @ k)[col] = sum_i S[i][col] * k[i]
- float kv_shard = 0.0f;
#pragma unroll
- for (int r = 0; r < rows_per_lane; r++) {
- kv_shard += s_shard[r] * k_reg[r];
- }
- float kv_col = warp_reduce_sum<warp_size>(kv_shard);
+ for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+ const int col = col_base + cc * NUM_WARPS;
- // delta[col] = (v[col] - g * kv[col]) * beta
- float delta_col = (v_t[col] - g_val * kv_col) * beta_val;
+ // kv[col] = (S^T @ k)[col] = sum_i S[i][col] * k[i]
+ float kv_shard = 0.0f;
+#pragma unroll
+ for (int r = 0; r < rows_per_lane; r++) {
+ kv_shard += s_shard[cc][r] * k_reg[r];
+ }
+ float kv_col = warp_reduce_sum<warp_size>(kv_shard);
- // fused: S[i][col] = g * S[i][col] + k[i] * delta[col]
- // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
- float attn_partial = 0.0f;
+ // delta[col] = (v[col] - g * kv[col]) * beta
+ float delta_col = (v_t[col] - g_val * kv_col) * beta_val;
+
+ // fused: S[i][col] = g * S[i][col] + k[i] * delta[col]
+ // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
+ float attn_partial = 0.0f;
#pragma unroll
- for (int r = 0; r < rows_per_lane; r++) {
- s_shard[r] = g_val * s_shard[r] + k_reg[r] * delta_col;
- attn_partial += s_shard[r] * q_reg[r];
- }
+ for (int r = 0; r < rows_per_lane; r++) {
+ s_shard[cc][r] = g_val * s_shard[cc][r] + k_reg[r] * delta_col;
+ attn_partial += s_shard[cc][r] * q_reg[r];
+ }
- float attn_col = warp_reduce_sum<warp_size>(attn_partial);
+ float attn_col = warp_reduce_sum<warp_size>(attn_partial);
- if (lane == 0) {
- attn_data[col] = attn_col * scale;
+ if (lane == 0) {
+ attn_data[col] = attn_col * scale;
+ }
}
} else {
- // kv[col] = sum_i g[i] * S[i][col] * k[i]
- float kv_shard = 0.0f;
#pragma unroll
- for (int r = 0; r < rows_per_lane; r++) {
- const int i = r * warp_size + lane;
- kv_shard += expf(g_t[i]) * s_shard[r] * k_reg[r];
- }
+ for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+ const int col = col_base + cc * NUM_WARPS;
+
+ // kv[col] = sum_i g[i] * S[i][col] * k[i]
+ float kv_shard = 0.0f;
+#pragma unroll
+ for (int r = 0; r < rows_per_lane; r++) {
+ const int i = r * warp_size + lane;
+ kv_shard += expf(g_t[i]) * s_shard[cc][r] * k_reg[r];
+ }
- float kv_col = warp_reduce_sum<warp_size>(kv_shard);
+ float kv_col = warp_reduce_sum<warp_size>(kv_shard);
- // delta[col] = (v[col] - kv[col]) * beta
- float delta_col = (v_t[col] - kv_col) * beta_val;
+ // delta[col] = (v[col] - kv[col]) * beta
+ float delta_col = (v_t[col] - kv_col) * beta_val;
- // fused: S[i][col] = g[i] * S[i][col] + k[i] * delta[col]
- // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
- float attn_partial = 0.0f;
+ // fused: S[i][col] = g[i] * S[i][col] + k[i] * delta[col]
+ // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
+ float attn_partial = 0.0f;
#pragma unroll
- for (int r = 0; r < rows_per_lane; r++) {
- const int i = r * warp_size + lane;
- s_shard[r] = expf(g_t[i]) * s_shard[r] + k_reg[r] * delta_col;
- attn_partial += s_shard[r] * q_reg[r];
- }
+ for (int r = 0; r < rows_per_lane; r++) {
+ const int i = r * warp_size + lane;
+ s_shard[cc][r] = expf(g_t[i]) * s_shard[cc][r] + k_reg[r] * delta_col;
+ attn_partial += s_shard[cc][r] * q_reg[r];
+ }
- float attn_col = warp_reduce_sum<warp_size>(attn_partial);
+ float attn_col = warp_reduce_sum<warp_size>(attn_partial);
- if (lane == 0) {
- attn_data[col] = attn_col * scale;
+ if (lane == 0) {
+ attn_data[col] = attn_col * scale;
+ }
}
}
@@ -190,11 +221,15 @@ gated_delta_net_cuda(const float * q,
const int64_t state_size_per_token = S_v * S_v * H * n_seqs; // per-slot stride in output
const int target_slot = (int) n_tokens - 1 - t;
if (target_slot >= 0 && target_slot < K) {
- float * curr_state = (dst + attn_score_elems) + target_slot * state_size_per_token + state_out_offset;
#pragma unroll
- for (int r = 0; r < rows_per_lane; r++) {
- const int i = r * warp_size + lane;
- curr_state[col * S_v + i] = s_shard[r];
+ for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+ const int col = col_base + cc * NUM_WARPS;
+ float * curr_state = (dst + attn_score_elems) + target_slot * state_size_per_token + state_out_offset;
+#pragma unroll
+ for (int r = 0; r < rows_per_lane; r++) {
+ const int i = r * warp_size + lane;
+ curr_state[col * S_v + i] = s_shard[cc][r];
+ }
}
}
}
@@ -202,13 +237,48 @@ gated_delta_net_cuda(const float * q,
if constexpr (!keep_rs_t) {
#pragma unroll
- for (int r = 0; r < rows_per_lane; r++) {
- const int i = r * warp_size + lane;
- state[col * S_v + i] = s_shard[r];
+ for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+ const int col = col_base + cc * NUM_WARPS;
+#pragma unroll
+ for (int r = 0; r < rows_per_lane; r++) {
+ const int i = r * warp_size + lane;
+ state[col * S_v + i] = s_shard[cc][r];
+ }
}
}
}
+// Default column-folding tile for the S_v==128 decode/prefill path (the GDN head dim of this model).
+// Measured winner of the bit-exact occupancy sweep (patch 0022). Override at runtime for the sweep
+// via GDN_NW / GDN_CPW; all selectable variants are bit-identical, only %peak differs.
+#ifndef GDN_DEFAULT_NW
+#define GDN_DEFAULT_NW 16
+#endif
+#ifndef GDN_DEFAULT_CPW
+#define GDN_DEFAULT_CPW 8
+#endif
+
+template <int S_v, bool KDA, bool keep_rs_t, int NUM_WARPS, int COLS_PER_WARP, int MIN_BLOCKS>
+static void launch_gdn_variant(
+ const float * q_d, const float * k_d, const float * v_d,
+ const float * g_d, const float * b_d, const float * s_d,
+ float * dst_d, float * state_dst_d, const int32_t * ids_d, int rs_head,
+ int64_t H, int64_t n_tokens, int64_t n_seqs,
+ int64_t sq1, int64_t sq2, int64_t sq3,
+ int64_t sv1, int64_t sv2, int64_t sv3,
+ int64_t sb1, int64_t sb2, int64_t sb3,
+ const uint3 neqk1_magic, const uint3 rq3_magic,
+ float scale, int K, int warp_size, cudaStream_t stream) {
+ static_assert(S_v % (NUM_WARPS * COLS_PER_WARP) == 0, "NUM_WARPS*COLS_PER_WARP must divide S_v");
+ dim3 grid_dims(H, n_seqs, S_v / (NUM_WARPS * COLS_PER_WARP));
+ dim3 block_dims(warp_size <= S_v ? warp_size : S_v, NUM_WARPS, 1);
+ const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(grid_dims, block_dims, 0, stream);
+ ggml_cuda_kernel_launch(gated_delta_net_cuda<S_v, KDA, keep_rs_t, NUM_WARPS, COLS_PER_WARP, MIN_BLOCKS>, launch_params,
+ q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+ n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+}
+
template <bool KDA, bool keep_rs_t>
static void launch_gated_delta_net(
const float * q_d, const float * k_d, const float * v_d,
@@ -223,47 +293,55 @@ static void launch_gated_delta_net(
float scale, int K, cudaStream_t stream) {
//TODO: Add chunked kernel for even faster pre-fill
const int warp_size = ggml_cuda_info().devices[ggml_cuda_get_device()].warp_size;
- const int num_warps = 4;
- dim3 grid_dims(H, n_seqs, (S_v + num_warps - 1) / num_warps);
- dim3 block_dims(warp_size <= S_v ? warp_size : S_v, num_warps, 1);
const uint3 neqk1_magic = init_fastdiv_values(neqk1);
const uint3 rq3_magic = init_fastdiv_values(rq3);
- int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
+#define GDN_LAUNCH_ARGS \
+ q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head, \
+ H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, sb1, sb2, sb3, \
+ neqk1_magic, rq3_magic, scale, K, warp_size, stream
- const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(grid_dims, block_dims, 0, stream);
switch (S_v) {
case 16:
- ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params,
- q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
- n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+ launch_gdn_variant<16, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
break;
case 32:
- ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params,
- q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
- n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+ launch_gdn_variant<32, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
break;
- case 64: {
- ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params,
- q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
- n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+ case 64:
+ launch_gdn_variant<64, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
break;
- }
case 128: {
- ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params,
- q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
- n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+ // Bit-exact occupancy/coalescing retune (patch 0022): fold COLS_PER_WARP columns per warp
+ // to raise per-warp memory-level parallelism on this bandwidth-bound recurrence. Default is
+ // the measured winner; GDN_NW / GDN_CPW override it for the one-build %peak sweep (every
+ // selectable {num_warps, cols} is bit-identical, so the sweep cannot change the md5).
+ static const int gdn_nw = []{ const char * e = getenv("GDN_NW"); return e ? atoi(e) : GDN_DEFAULT_NW; }();
+ static const int gdn_cpw = []{ const char * e = getenv("GDN_CPW"); return e ? atoi(e) : GDN_DEFAULT_CPW; }();
+ // NUM_WARPS in {4,8,16} x COLS_PER_WARP ladder (all <=512 threads/block, no 1024-thread
+ // .minnctapersm warnings). Measured GB10 %peak: (4,1)=73 baseline ... (16,4)=82 ...
+ // (16,8)=84.7 winner ~ tied with (8,8)/(8,16)/(32,4); the plateau is just above vLLM (82.4).
+ if (gdn_nw == 4 && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 4 && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 4, 2, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 4 && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 4, 4, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 8 && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 8, 1, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 8 && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 8, 2, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 8 && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 8, 4, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 8 && gdn_cpw == 8) launch_gdn_variant<128, KDA, keep_rs_t, 8, 8, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 16 && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 16, 1, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 16 && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 16, 2, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 16 && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 16, 4, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 16 && gdn_cpw == 8) launch_gdn_variant<128, KDA, keep_rs_t, 16, 8, 2>(GDN_LAUNCH_ARGS);
+ else launch_gdn_variant<128, KDA, keep_rs_t, GDN_DEFAULT_NW, GDN_DEFAULT_CPW, 2>(GDN_LAUNCH_ARGS);
break;
}
default:
GGML_ABORT("fatal error");
break;
}
+
+#undef GDN_LAUNCH_ARGS
}
void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
--
2.43.0

View File

@@ -1,144 +0,0 @@
From f7409c2de2868a6a048d3c333329468b4cc9e483 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Thu, 25 Jun 2026 23:47:25 +0200
Subject: [PATCH] feat(paged): qwen35moe NVFP4 activation-quantize de-dup
(patch 0023)
Bit-exact decode/prefill lever for the MoE (qwen3.5moe) NVFP4 path. ggml`s
mul_mat_id quantizes the EXPERT-GATHERED activation rows (ne11_flat =
ne12*n_expert_used). For the broadcast up/gate projections (ne11 == 1) every
expert of a token receives the SAME token activation, so the stock path
re-quantizes each token n_expert_used times. quantize_mmq_nvfp4 produces each
block as a pure per-thread function of its 16 consecutive inputs (no cross-thread
reduction), so the gathered blocks are byte-identical across the experts.
Lever: when ne11 == 1, quantize the ne12 UNIQUE token activations once, then
gather the resulting block_fp4_mmq rows into the expert-gathered layout keyed by
ids_src1 with a coalesced uint4 copy (block_fp4_mmq == 9 uint4 == 144 B). Pure
byte copy of identical blocks, so the gathered buffer is byte-for-byte identical
to re-quantizing each gathered row; the GEMM is untouched. down_proj
(ne11 == n_expert_used, distinct per expert) keeps the stock path.
Measured GB10 (sm_121a), on top of HEAD 8a3229f (patch 0022), q36-35b-a3b-nvfp4:
- nsys decode-isolated: quantize_mmq_nvfp4 868 -> 457 ms/run (-411 ms), new
gather_mmq_fp4 +32 ms; net -379 ms of decode GPU-time.
- S_TG npl128 745.2 -> 758.1 t/s (+1.73%), npl32 +0.6%; prefill T_PP -4%.
- Dense q36-27b-nvfp4 byte-flat (no mul_mat_id): 373.24 t/s unchanged.
Bit-exact gate (greedy --temp 0 --seed 1 md5, byte-identical to 0022):
q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439 (unchanged)
q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd (de-dup on == off)
test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805.
On by default; GGML_CUDA_MOE_QUANT_DEDUP=0 restores the stock re-quantize path.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/src/ggml-cuda/mmq.cu | 21 +++++++++++++++++--
ggml/src/ggml-cuda/quantize.cu | 37 +++++++++++++++++++++++++++++++++
ggml/src/ggml-cuda/quantize.cuh | 4 ++++
3 files changed, 60 insertions(+), 2 deletions(-)
diff --git a/ggml/src/ggml-cuda/mmq.cu b/ggml/src/ggml-cuda/mmq.cu
index e1add5e..9933fa6 100644
--- a/ggml/src/ggml-cuda/mmq.cu
+++ b/ggml/src/ggml-cuda/mmq.cu
@@ -1,3 +1,4 @@
+#include <cstdlib>
#include "common.cuh"
#include "mmq.cuh"
#include "quantize.cuh"
@@ -197,8 +198,24 @@ void ggml_cuda_mul_mat_q(
const int64_t s13 = src1->nb[3] / ts_src1;
if (use_native_fp4) {
- quantize_mmq_fp4_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
- ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream);
+ // 0023: de-dup the broadcast (up/gate) quantize. ne11==1 means src1 is shared
+ // across experts, so quantize the ne12 unique tokens once and gather the blocks.
+ static const bool moe_quant_dedup = []{
+ const char * e = getenv("GGML_CUDA_MOE_QUANT_DEDUP");
+ return e ? atoi(e) != 0 : true; // 0023: on by default; GGML_CUDA_MOE_QUANT_DEDUP=0 disables
+ }();
+ if (moe_quant_dedup && ne11 == 1) {
+ const size_t nbytes_unique = ne12*ne10_padded * sizeof(block_q8_1)/QK8_1 +
+ get_mmq_x_max_host(cc)*sizeof(block_q8_1_mmq);
+ ggml_cuda_pool_alloc<char> src1_unique(ctx.pool(), nbytes_unique);
+ quantize_mmq_fp4_cuda(src1_d, nullptr, src1_unique.get(), src0->type, ne10, s12, 0, 0,
+ ne10_padded, ne12, 1, 1, stream);
+ gather_mmq_fp4_cuda(src1_unique.get(), ids_src1.get(), src1_q8_1.get(),
+ ne11_flat, ne12, ne10_padded, stream);
+ } else {
+ quantize_mmq_fp4_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
+ ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream);
+ }
} else {
quantize_mmq_q8_1_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream);
diff --git a/ggml/src/ggml-cuda/quantize.cu b/ggml/src/ggml-cuda/quantize.cu
index 39a500a..a7fd86f 100644
--- a/ggml/src/ggml-cuda/quantize.cu
+++ b/ggml/src/ggml-cuda/quantize.cu
@@ -419,6 +419,43 @@ void quantize_mmq_q8_1_cuda(
}
}
+// MoE NVFP4 quantize de-dup (0023): for the broadcast (up/gate) expert matmuls every
+// gathered row references one of ne12 unique token activations, so the stock path
+// re-quantizes each token n_expert_used times. Quantize the unique tokens once, then copy
+// the resulting block_fp4_mmq rows into the expert-gathered layout keyed by ids. This is a
+// pure byte copy of identical blocks => the gathered buffer is byte-identical to stock.
+static __global__ void gather_mmq_fp4(
+ const uint4 * __restrict__ unique, const int32_t * __restrict__ ids,
+ uint4 * __restrict__ gathered, const int ne11_flat, const int ne12_unique,
+ const int64_t total_words) {
+ constexpr int W = (int) (sizeof(block_fp4_mmq) / sizeof(uint4)); // 9 uint4 per 144B block
+ const int64_t t = (int64_t) blockIdx.x * blockDim.x + threadIdx.x;
+ if (t >= total_words) {
+ return;
+ }
+ const int w = (int) (t % W);
+ const int64_t ib = t / W; // destination block index = kb*ne11_flat + j
+ const int j = (int) (ib % ne11_flat);
+ const int kb = (int) (ib / ne11_flat);
+ const int src = ids[j];
+ const int64_t ib_u = (int64_t) kb * ne12_unique + src;
+ gathered[t] = unique[ib_u * W + w];
+}
+
+void gather_mmq_fp4_cuda(
+ const void * unique, const int32_t * ids, void * gathered,
+ int64_t ne11_flat, int64_t ne12_unique, int64_t ne0_padded, cudaStream_t stream) {
+ const int blocks_per_col = (int) ((ne0_padded + QK_K - 1) / QK_K);
+ constexpr int W = (int) (sizeof(block_fp4_mmq) / sizeof(uint4));
+ const int64_t total_words = ne11_flat * (int64_t) blocks_per_col * W;
+ const int bs = 256;
+ const dim3 block_size(bs, 1, 1);
+ const dim3 num_blocks((unsigned) ((total_words + bs - 1) / bs), 1, 1);
+ gather_mmq_fp4<<<num_blocks, block_size, 0, stream>>>(
+ (const uint4 *) unique, ids, (uint4 *) gathered,
+ (int) ne11_flat, (int) ne12_unique, total_words);
+}
+
void quantize_mmq_fp4_cuda(
const float * x, const int32_t * ids, void * vy, const ggml_type type_src0,
const int64_t ne00, const int64_t s01, const int64_t s02, const int64_t s03,
diff --git a/ggml/src/ggml-cuda/quantize.cuh b/ggml/src/ggml-cuda/quantize.cuh
index 768a3ae..7f64069 100644
--- a/ggml/src/ggml-cuda/quantize.cuh
+++ b/ggml/src/ggml-cuda/quantize.cuh
@@ -26,6 +26,10 @@ void quantize_mmq_q8_1_cuda(
ggml_type type_src0, int64_t ne00, int64_t s01, int64_t s02, int64_t s03,
int64_t ne0, int64_t ne1, int64_t ne2, int64_t ne3, cudaStream_t stream);
+void gather_mmq_fp4_cuda(const void * unique, const int32_t * ids, void * gathered,
+ int64_t ne11_flat, int64_t ne12_unique, int64_t ne0_padded,
+ cudaStream_t stream);
+
void quantize_mmq_fp4_cuda(const float * x,
const int32_t * ids,
void * vy,
--
2.43.0

View File

@@ -1,347 +0,0 @@
# A.2 - CUDA-graphing the paged decode: measured lever + gap diagnosis
Phase 1 (measure, do not punt). DGX GB10 (sm_121), dev tree `~/llama-paged-dev`
HEAD 089f78d (patch 0017), `build-cuda`. Model `q36-27b-nvfp4.gguf` (dense),
harness `llama-batched-bench`, fusion held OFF (`LLAMA_FUSE_NVFP4_QUANT=0`) for a
clean stock-kernel baseline. `decode_agg` = the `S_TG t/s` column.
## TL;DR verdict
CUDA-graphing the paged decode is **NOT a real throughput lever** (ceiling well
under 1%). The steady decode step is **GPU-compute-bound: 99.4-99.5% GPU-busy**.
Total GPU idle is ~0.5-0.6% of the step, split into within-step launch gaps
(0.37%, the only thing CUDA graphs remove) and a between-step host-loop gap
(0.24%, one ~2 ms gap per step). Graphs already engage on the default paged
decode and do collapse the launch gaps (0.37% -> 0.11%), but the GPU stays
99.4-99.5% busy either way, so decode_agg is unchanged. The 2.6x gap to vLLM
(148 vs 391) lives in the per-step GPU **kernel work** (FP4 GEMM + attention at
batch 128), not in launch overhead or the host loop.
The premise that "the paged decode runs eager (graphs reused=0)" did not survive
measurement: at the benchmarked context the default paged decode captures and
replays graphs exactly like stock non-paged. Two measurement traps (below)
explain the earlier "reused=0 / gap-bound" reading.
## Method note: a graph-enable trap that was corrected
`GGML_CUDA_DISABLE_GRAPHS` is read with `getenv(...) != nullptr`
(`ggml/src/ggml-cuda/common.cuh:1234`), so setting it to an **empty** string
still disables graphs. A first 4-cell pass that used
`GGML_CUDA_DISABLE_GRAPHS=""` for the "graphs ON" cells therefore ran graphs OFF
in all four cells (an OFF-vs-OFF comparison). The numbers below ("v2") unset the
variable with `env -u` for the ON cells. The `-lv 99` probe is unaffected (it
never set the variable).
## Step 1 - the 4-cell decode_agg table (corrected, graphs genuinely enabled)
npp 128, ntg 128, npl 32 and 128, c 40960, b/ub 2048, fa on. `S_TG t/s`:
| cell | npl 32 | npl 128 |
|------------------|---------|---------|
| stock_graphon | 116.47 | 148.41 |
| stock_graphoff | 115.17 | 148.21 |
| paged_graphon | 116.21 | 148.60 |
| paged_graphoff | 114.62 | 147.65 |
ON vs OFF (the graph win):
| config | npl 32 | npl 128 |
|--------|--------|---------|
| stock | +1.13% | +0.13% |
| paged | +1.39% | +0.64% |
- (a) Does STOCK get a graph win? Essentially no: +0.13% at npl 128, +1.13% at
npl 32 (small-batch, where per-kernel launch overhead is relatively larger).
All within run-to-run noise (~1% at npl 32, ~0.2% at npl 128).
- (b) Does PAGED get a graph win? Same picture: +0.64% / +1.39%. Paged is NOT
eager at this config (see Step 2); it captures graphs like stock.
- (c) LEVER SIZE (proxy = stock graph win, now genuinely measured): +0.13% at
npl 128, +1.1% at npl 32. Negligible vs the 2.6x (=+164%) gap to vLLM.
All four cells sit at ~148 (npl 128) / ~115 (npl 32) within ~1%. The ~148 wall is
shared by stock and paged; it is not paged-specific. Calibration cross-check
(paged ON, ntg 64): 147.64, matching the reference 148-149.
## Step 2 - why the "eager" premise is wrong, and what actually mutates
CUDA-graph state machine (`ggml_backend_cuda_graph_compute` in
`ggml/src/ggml-cuda/ggml-cuda.cu`): warmup completes after a step whose node
properties did not change vs the previous step; any later change logs
`CUDA graph warmup reset` and reverts to eager until stable again.
`ggml_cuda_graph_update_required` memcmps every node's full `ggml_tensor` plus
each src's `data` ptr / `ne` / `nb`.
`-lv 99` probe, short context (npp 64, ntg 32, ctx <= 96):
- stock: `warmup complete` x2, `warmup reset` x0.
- paged: `warmup complete` x2, `warmup reset` x0.
Both capture and then replay silently. The `CUDA Graph id N reused` line stays 0
for both because llama rebuilds the cgraph each ubatch (new `cgraph->uid`), so
the uid fast-path never fires; the graph is still replayed via the
`instance != nullptr` path, which logs nothing. **"reused=0" is a false negative,
not evidence of eager execution.** (Trap #1.)
Cadence probe (npp 200, ntg 320, npl 4, ctx 200->520, crosses the 256 and 512
token boundaries), counts over ~320 decode steps:
| path | complete | reset | interpretation |
|-------------------------------|----------|-------|-------------------------------|
| paged in-kernel (default) | 10 | 8 | resets only at 256-boundaries |
| paged gather (KV_PAGED_GATHER)| 0 | 0 | never captures -> pure eager |
| stock non-paged | 10 | 8 | identical 256-cadence |
The 8 resets cluster at the two boundary crossings (timestamps ~9.9 s and ~34 s),
not per-step. The default paged decode is therefore captured for ~97% of steps,
re-warming only every ~256 tokens, with the **same cadence as stock**.
What mutates (the block-table / gather input):
- in-kernel decode (default): the block-table graph input
`idx = ggml_new_tensor_2d(ctx0, I32, n_view, n_stream)` with
`n_view = GGML_PAD(n_gather, 256)` (`src/paged-attn.cpp:199,213`). Its `ne[0]`
steps 256 -> 512 -> 768 only when the context crosses a 256-token boundary. The
kq_mask input (ne0 = n_kv, also padded to 256) steps in lockstep. So the
property change is per-256-tokens, not per-step.
- gather fallback (`LLAMA_KV_PAGED_GATHER=1`, transposed-V, or prefill): the
index input `idx = ggml_new_tensor_2d(ctx0, I32, n_gather, n_stream)`
(`src/paged-attn.cpp:106`) has `ne[0] = n_gather` (UNPADDED), which grows every
step (the unit's own comment, `src/paged-attn.cpp:28-30`: "n_gather grows every
step"). That changes a node property every step, warmup never completes, and
the path runs pure eager. This is the only "graphs reused=0" path, and it is
not the default decode path.
`LLAMA_KV_PAGED_DEBUG` dump at ctx 201 (first 2 decode calls, identical across
the pair): `in-kernel decode n_stream=4 n_kv=256 n_gather=201` -> block-table
`ne[0] = GGML_PAD(201,256) = 256`, stable until n_gather crosses 256.
## Step 3 - where the step time goes (nsys), and a second trap
npl 128, ntg 24, ctx 56 (< 256, so the ON run stays captured after warmup).
Idle split by gap size: within-step launch gaps < 1 ms, between-step host gaps
>= 1 ms. Steady window = 40%-97% of the trace span (excludes model load / graph
reserve / prefill one-offs).
Trap #2: `nsys --trace=cuda` does NOT emit the kernels INSIDE a replayed CUDA
graph into `cuda_gpu_trace` by default. The graphs-ON trace had only 15,279 GPU
rows vs 84,946 for the identical OFF workload and reported a bogus 0.3% GPU-busy.
Re-profiling the ON case with `--cuda-graph-trace=node` restored all 84,946 rows
and 99.5% busy. **Any "decode is idle/gap-bound" reading taken from a graphs-ON
nsys trace without `--cuda-graph-trace=node` is artifactually idle-inflated** -
the likely source of the earlier "freed GPU time became idle gaps" conclusion.
Reliable steady-state numbers:
| trace | GPU rows | busy | within-step idle | between-step idle | host gap/step |
|--------------------------------|----------|--------|------------------|-------------------|---------------|
| OFF (eager) | 84,946 | 99.4% | 0.37% | 0.24% | ~2.0 ms |
| ON (captured, node-trace) | 84,946 | 99.5% | 0.11% | 0.38% | ~1.9 ms |
- CUDA graphs replay (cudaGraphLaunch=46) and collapse the launch path: ON has
~15k kernel launches/run vs OFF ~80k (cudaLaunchKernel 6,024 vs 31,764, plus
ExC 9,049 vs 48,165). That cuts within-step launch idle from 0.37% to 0.11%.
- But the GPU is 99.4-99.5% busy in both, so decode_agg is unchanged.
- Between-step host idle is one ~2 ms gap per decode step (the 128-way sample +
update_slots + batch build), 0.24-0.38% of the ~896 ms step.
Per-step decomposition at npl 128: ~896 ms/step, of which ~890 ms is GPU kernel
compute, ~2 ms host-loop gap, ~3 ms (eager) / ~1 ms (captured) launch gaps.
## The load-bearing question, answered
Within-step or between-step? **Neither is large.** The steady decode is 99.4%
GPU-busy; the entire idle budget is ~0.6% of the step. CUDA graphs already remove
the within-step launch fraction (0.37% -> 0.11%), and the between-step host gap is
~2 ms/step (0.24%). There is no large gap for a host-loop rewrite to reclaim
either; the host loop is currently **hidden under GPU compute** (the GPU stays
busy while the host syncs/schedules). It would only become a lever once the
kernels are fast enough to drop GPU-busy below the host time, i.e. it is a
second-order floor, not the present bottleneck.
## Verdict
1. CUDA-graphing the paged decode is not the lever. Graphs already engage on the
default decode; capturing reduces within-step launch idle from 0.37% to 0.11%
but leaves the GPU 99.4-99.5% busy, so decode_agg moves by ~0% (measured
+0.1% to +0.6% at npl 128, +1.1% to +1.4% at npl 32, all within noise).
2. The between-step host loop is not the present lever either (0.24%, ~2 ms/step,
hidden under GPU compute). It is the candidate floor only after the kernels
speed up.
3. The decode is GPU-compute-bound at this NVFP4 fusion-OFF baseline. The 2.6x
gap to vLLM is in the per-step GPU kernel work (FP4 GEMM + attention at batch
128). That, not graphs and not the host loop, is the throughput lever.
4. Corrected premises: paged is not perpetually eager (it captures with a
256-token reset cadence identical to stock); "graphs reused=0" was a uid
fast-path false negative; and a graphs-ON nsys trace under-counts GPU-busy
unless `--cuda-graph-trace=node` is set.
No code patch in Phase 1 (graphs are not the lever, so there is no paged
graph-capture patch to land). Evidence: `~/bench/a2_4cell_v2/`, `~/bench/a2_probe`,
`~/bench/a2_probe2`, `~/bench/a2_nsys/*.nsys-rep` on the DGX.
# Phase 2 - the real decode lever, located (per-kernel decomposition)
Phase 1 ended on "decode is GPU-compute-bound; the 2.6x gap to vLLM lives in the
per-step GPU kernel work (FP4 GEMM + attention at batch 128)." Phase 2 measured
that per-step GPU work directly - per kernel and per memcpy, on the Phase-1 nsys
`.sqlite` reps - and the "FP4 GEMM + attention" attribution does not survive the
measurement. Two corrections, then the lever.
The conditional Phase 2 fix (make the paged decode graph-capturable) is moot:
Phase 1 already showed the default paged decode captures, and the fresh re-check
below reconfirms the graph win is noise. Neither Phase 2 branch (within-step graph
fix / between-step host loop) is the lever; the lever is a third thing, measured
here.
## Fresh re-confirmation: graphs are not the lever
Independent run (npl128, ntg32, paged, fusion off), not reusing Phase 1's table:
| paged decode | S_TG t/s | vs vLLM 391 |
|---------------|----------|-------------|
| graphs ON | 146.03 | 37.3% |
| graphs OFF | 144.90 | 37.1% |
+0.78%, within noise - same verdict as Phase 1's 4-cell. The ON nsys rep is also
99.5% busy with the same ~3267 ms of memcpy as OFF: graphs capture the memcpy
nodes too, so they cannot remove either the copies or the compute.
## Correction 1: the model is a hybrid SSM, not a plain transformer
`q36-27b-nvfp4.gguf` has `general.architecture = qwen35` with
`qwen35.ssm.{conv_kernel,state_size,group_count,time_step_rank,inner_size}`. The
decode-window kernel cadence (per step, ~19.8 steps in the window) is 48
`gated_delta_net_cuda` + 48 `ssm_conv_f32` vs 16 `flash_attn_tile`, i.e. **48
gated-DeltaNet linear-attention layers : 16 full-attention layers** (a 3:1
hybrid, Qwen3-Next family). Paged attention only touches the 16 full-attention
layers.
## Correction 2: the 99.4% "busy" is ~19% D2D memcpy, not compute
Interval-union sweep over the steady decode window (last 17 s of the npl128/ntg24
OFF rep; single CUDA stream; running-max-end so it is overlap-correct):
| activity set | GPU busy | idle |
|------------------------|----------|-------|
| kernels only | 80.2% | 19.8% |
| kernels + memcpy (all) | 99.4% | 0.6% |
The 969 inter-kernel gaps (>=1 ms, ~48/step) that drop kernels-only to 80% are
filled by **D2D memcpy: 1584 copies/run (~80/step), ~230 MB each, ~2 ms each,
356 GB moved in 17 s**. At batch 128 a ~230 MB block is the gated-DeltaNet
recurrent state; these are the per-SSM-layer state copies. (HtoD copies = the
paged block-table/index upload: 731/run but only 3 ms total, negligible; DtoH
47 ms.) Phase 1's `cuda_gpu_trace`-based 99.4% counted these memcpys as "busy"
and lumped them into "GPU kernel compute" - they are memory movement, and they
are addressable.
## Decode GPU-time decomposition (% of kernel+memcpy busy)
OFF/eager rep, steady window. `/step` = instances per decode step.
| share | activity | /step | role |
|-------|-----------------------------------|-------|-------------------------------|
| 23.4% | gated_delta_net_cuda | 48 | linear-attn recurrence |
| 21.9% | k_get_rows_float | 97 | SSM state / conv-state gather |
| 18.9% | MEMCPY DtoD | 80 | SSM recurrent-state copy |
| 15.5% | mul_mat_vec_q (nvfp4, ncols=1) | 48 | FP4 GEMV |
| 10.4% | mul_mat_q (nvfp4) | 352 | FP4 GEMM |
| 1.9% | quantize_mmq_nvfp4 | 448 | act requant for MMQ |
| 1.0% | concat_cont | 48 | SSM state glue |
| 0.8% | ssm_conv_f32 | 48 | SSM short conv |
| 0.7% | unary_gated_op silu | 112 | SSM gating |
| 0.4% | flash_attn_tile/_ext | 16 | FULL attention (paged) |
Grouped:
- gated-DeltaNet / SSM machinery (recurrence + get_rows gather + DtoD state copy
+ conv + gating glue): **~67% of decode**.
- FP4 matmul (GEMV + GEMM + requant + stream-k fixup): **~28%**.
- Full attention - everything paged attention optimizes: **~0.4%**.
## Verdict and scope of the real lever
1. CUDA graphs: not the lever (Phase 1, re-confirmed: +0.78%, noise). They capture
the memcpy too, so they cannot touch the copies or the compute.
2. Host loop: not the lever (true host idle in the union is 0.24%, ~41 ms/17 s).
3. FP4 GEMM: secondary, ~28%. Consistent with Track B P2a (making the FP4 GEMM 26%
faster left decode_agg flat) - it was never the long pole.
4. Paged / full attention: ~0.4% of decode. **No paged-attention change (graphs,
block-table stabilization, gather rewrite) can move decode_agg on this model**
- it optimizes under half a percent of the step. This is the structural reason
A.2, and the paged-decode track generally, cannot close the vLLM gap on
q36-27b: the model barely uses the path being optimized.
The throughput lever is the ggml **qwen35 gated-DeltaNet decode**. Per SSM layer
per step it re-materializes and D2D-copies the full recurrent state (~230 MB at
batch 128; ~80 copies/step, ~18 GB/step) and feeds the recurrence through ~2
`get_rows` gathers, so ~61% of decode (state copy + state gather + recurrence) is
SSM state plumbing. vLLM's gated-DeltaNet decode (the flash-linear-attention
`fused_recurrent_gated_delta_rule` path) keeps the state in place and fuses the
gather into the scan, avoiding both the per-layer D2D copy and the gathers.
Next-step scope (the real lever, to be done in the ggml/llama qwen35 SSM path -
not paged-attn, not a graph capture, not a block-table tweak):
1. Eliminate the per-layer recurrent-state D2D copy: update the state tensor
in place (or double-buffer / write-back), so the recurrence consumes and
produces the persistent state without a full-state copy each layer each step.
2. Fuse the `get_rows` state / conv-state gather into the recurrent kernel.
Ceiling from this rep (upper bound; assumes the work is fully removed, not just
overlapped):
- remove the DtoD state copy: reclaim 18.9% -> ~146 to ~180 t/s.
- remove copy + gather: reclaim ~41% -> ~146 to ~247 t/s, which puts llama within
~1.6x of vLLM 391 with the FP4 GEMM still untouched.
No code patch in Phase 2 either: the lever is a gated-DeltaNet decode rewrite in
the SSM path, too large for this measurement pass and orthogonal to paged
attention. `patches/paged/0018` stays free. Evidence on the DGX:
`~/bench/a2_decompose/decode_decomp.txt` (per-kernel table + reproducing SQL in
its header), `~/bench/a2_decompose/SUMMARY.txt`, and the Phase-1 reps
`~/bench/a2_nsys/paged_off_npl128.sqlite` / `paged_on_npl128_node.sqlite`.
# A.2 final synthesis - the four-point verdict
All numbers measured on the DGX (GB10, sm_121, q36-27b-nvfp4 dense, fusion OFF,
`decode_agg` = `S_TG t/s`), npl 128 unless noted.
**1. CUDA-graph lever size (measured, not guessed).** +0.13% (4-cell, stock
ON-vs-OFF) to +0.78% (fresh paged re-check) at npl 128; +1.1% to +1.4% at npl 32.
All inside run-to-run noise. The earlier grounding GUESSED ~10-20% from a
94.6%-busy reading; direct measurement puts the steady decode at 99.4-99.5% busy,
so the real graph ceiling is < 1%, not 10-20%. The guess was wrong because the
busy-fraction it rested on was under-read (a graphs-ON nsys trace under-counts
GPU-busy unless `--cuda-graph-trace=node` is set - trap #2).
**2. Was "paged decode runs eager" fixed, and what is the decode_agg win?**
There was nothing to fix: the premise was false. At the benchmarked context the
DEFAULT in-kernel paged decode already captures and replays graphs, with a
256-token reset cadence identical to stock non-paged (10 complete / 8 reset over
~320 steps, resets clustered only at the 256/512 token boundaries). "graphs
reused=0" was a uid fast-path false negative, not eager execution (trap #1). The
only genuinely-eager path is the `LLAMA_KV_PAGED_GATHER=1` fallback (unpadded
index grows every step), which is not the default decode. Because graphs were
already engaged, the decode_agg win from "enabling" them is ~0 (+0.1% to +0.8%).
Graphs DID collapse within-step launch idle (0.37% -> 0.11%, ~80k -> ~15k
launches/run), but the GPU stays 99.4-99.5% busy, so throughput is unchanged.
**3. New llama %-of-vLLM @npl128.** Unchanged by A.2: 146-148.6 t/s vs vLLM 391 =
**37.3-38.0%**. Graphs ON vs OFF both land here (146.03 / 144.90 in the fresh
re-check; 148.41 / 148.21 in the 4-cell). A.2 did not move the percentage.
**4. Honest verdict - did A.2 move toward parity; residual + next lever.** No.
A.2 closed zero of the 2.6x gap, and it provably cannot on this model: paged /
full attention is ~0.4% of decode (16 full-attention layers vs 48 gated-DeltaNet
layers, a 3:1 hybrid SSM), so no graph / block-table / gather change to the paged
path can move decode_agg. The residual gap is structural and lives elsewhere:
~67% of decode is gated-DeltaNet / SSM state plumbing (23.4% recurrence + 21.9%
get_rows state gather + 18.9% D2D recurrent-state copy of ~230 MB per SSM layer
per step, ~18 GB/step), and ~28% is FP4 matmul (already shown secondary by Track
B: a 26%-faster GEMM left decode_agg flat). The within-step launch loop is solved
(graphs) and the between-step host loop is a 0.24% second-order floor hidden under
GPU compute - neither is the residual.
The next lever is NOT in this track. It is the ggml qwen35 gated-DeltaNet decode:
(1) eliminate the per-layer recurrent-state D2D copy (in-place / double-buffer
write-back), and (2) fuse the get_rows gather into the recurrent kernel - mirroring
vLLM's `fused_recurrent_gated_delta_rule`, which keeps the state in place and
fuses the gather. Measured ceiling on this rep: remove the copy -> ~146 to ~180
t/s; remove copy + gather -> ~146 to ~247 t/s (within ~1.6x of vLLM with FP4 GEMM
still untouched). That work is orthogonal to paged attention; `patches/paged/0018`
stays free.

View File

@@ -1,107 +0,0 @@
# Additive layout for the paged-KV patch series - "hook, don't edit"
Goal: ship paged KV as a vendored patch series that **survives llama.cpp pin bumps with
minimal rebase pain**. PR #22569 (the upstream draft) was rejected by maintainers as
"slop" and is far too invasive to vendor - it rewrites core attention. Our series must be
the opposite: **additive**. This document is the design rule and the per-patch core-touch
budget.
## The rule
> Every change is either (a) **new code in a new vendored file** under `src/`, or (b) a
> **single, env-gated hook** at one call site in a core file that delegates to the new
> file. No logic lives in a core file. No core struct/signature is edited.
Why it works: a hook is a 1-3 line diff against a core file. When upstream churns that file,
`git apply` either still lands the hook (context unchanged) or fails *only on that tiny
hunk*, which is trivial to re-place. Logic embedded inside a core function (the PR #22569 /
old-0003 approach) conflicts on every bump and must be re-understood each time.
This is enforceable as a **core-touch budget**: each patch declares the core files it
touches and the line count; review rejects anything that grows logic in core.
## Why it's achievable here (grounded in the pinned source)
The two seams paged KV needs are both already abstract in llama.cpp at the pin
(`LLAMA_VERSION=f3e1828`), so new behavior plugs in without editing core types:
- **KV placement** - `llama_kv_cache::find_slot` already returns a `slot_info` of physical
cell indices. Paged placement is just *different indices*. 0002 already does this as one
gated block (`if (paged_mode) { ... continue; }`, 41 lines, one file). Ideal.
- **Graph inputs** - `llm_graph_input_i` is a pure-virtual base (`set_input()`), and
`llm_graph_result::add_input(llm_graph_input_ptr)` lets *any* code register a new input
subclass. So a paged graph input (the gather index) can be **a new class in a new file**,
added from a one-line hook - no edit to `llm_graph_input_attn_kv` or `llama-graph.h`.
## Per-patch core-touch budget
| # | Patch | New files (additive) | Core hooks (gated, minimal) | Core lines |
|---|-------|----------------------|------------------------------|-----------:|
| 0001 | vendor manager | `paged-kv-manager.{h,cpp}` | `CMakeLists.txt` +1 | 1 |
| 0002 | block placement | - | one `if(paged_mode){...continue;}` in `find_slot` | ~41 |
| 0003 | gather-read | `paged-attn.{h,cpp}` | `CMakeLists.txt` +1; **one** hook in `build_attn`; 2 tiny accessors on `llama_kv_cache_context` | ~8 |
| 0004 | on-demand alloc | (uses 0001 manager) | one branch in `find_slot` calling the manager | ~10 |
| 0005 | continuous batching | - | **LocalAI `grpc-server.cpp`** (already a LocalAI override, not a core patch) | 0 core |
| 0006 | prefix caching | (uses 0001 manager) | one hash-lookup hook in the 0004 alloc branch | ~6 |
Net core surface for the *entire* engine: `find_slot` (placement/alloc - where physical
cells are already chosen) + **one** line in `build_attn` + two accessors. Everything else
is new files or the LocalAI-side server loop.
## 0003 redesigned to the rule (replaces the 4-file-surgery plan)
The old `0003-gather-read-plan.md` edited `llama-kv-cache.{h,cpp}` + `llama-graph.{h,cpp}`
(including a field added to `llm_graph_input_attn_kv` and fill logic in its `set_input`).
The additive form removes the core-struct and core-`set_input` edits entirely:
**New file `src/paged-attn.{h,cpp}`** holds *all* logic:
- `class llm_graph_input_paged_gather : public llm_graph_input_i` - owns the `I32 [n_gather]`
gather-index tensor and a `const llama_kv_cache_context * mctx`. Its `set_input()` fills
the index with the sequence's used cells (`{ i in [0,n_kv) : !cells.is_empty(i) }`, the
same set the `kq_mask` keeps), in the canonical order.
- `paged_attn::gather(ctx0, res, mctx, v_trans, &k, &v, &kq_mask)` - when paged is active,
constructs that input via `res->add_input(...)`, and applies `ggml_get_rows` to `k`, `v`,
and the transposed `kq_mask` by the shared index (mask: `transpose -> get_rows ->
transpose`). When not active it returns immediately -> **stock path byte-identical**.
**Core hooks (the whole core diff for 0003):**
1. `src/llama-graph.cpp`, in `build_attn` right before `build_attn_mha` (~line 2357):
```cpp
paged_attn::gather(ctx0, res, mctx_cur, v_trans, &k, &v, &kq_mask); // no-op unless LLAMA_KV_PAGED
```
One line. No new field on `llm_graph_input_attn_kv`; the gather input is a *separate*
registered input, so `llama-graph.h` is untouched.
2. `src/llama-kv-cache.{h,cpp}`: two thin accessors on `llama_kv_cache_context` so the new
file can read the used-cell set without reaching into internals -
`uint32_t get_n_gather() const;` and `void get_gather_idxs(int32_t * dst) const;`
(delegate to `kv`/`sinfos[i_cur]`, mirroring the existing `get_n_kv` / `set_input_k_idxs`
pattern). ~8 lines total, no signature changes to existing methods.
3. `src/CMakeLists.txt`: `+ paged-attn.cpp`.
First cut: gate to **flash-attn + single-stream** (`GGML_ASSERT` otherwise) - the V-transposed
(non-FA) and multi-stream gathers are a localized follow-up entirely inside `paged-attn.cpp`,
no new core touch. Gate 0 stays the same: `diff` of greedy `llama-simple` output, stock vs
`LLAMA_KV_PAGED=1`, must be identical (attention is permutation-invariant over the gathered
KV set; `n_gather < n_kv` proves compaction, not identity).
## Anti-drift practices (already in `README.md`, restated as policy)
- **Stacking patches, one concern each**, exported 1:1 from a dev branch via
`git format-patch`. On a pin bump, rebase the branch; only the conflicting small patch
needs a touch, and the failure names the exact step.
- **Default-off (`LLAMA_KV_PAGED`)** until each gate is green, so a partial series never
changes stock behavior - and the hooks compile to a no-op branch when the env is unset.
- **Dev tree:** `git worktree add <dev> <LLAMA_VERSION>` off any checkout that has the pin
(e.g. the existing llama.cpp clone), `git apply` the series, develop the next patch as one
commit, re-export. (Set up and verified for this pin during this work.)
## Status / next step
- 0001, 0002: done, additive, verified token-identical.
- 0003: **redesigned to the additive form above** (this doc). Dev tree at the pin with
0001+0002 applied is ready (`paged` branch). Remaining work is the focused
implement-and-verify block for `paged-attn.{h,cpp}` + the one `build_attn` hook, driven to
the token-identical Gate 0. That is a numerical-correctness task (mask/gather alignment,
FA-first), not a structural one - the structure is settled here.
- 0004-0006: follow the budget above; 0005 lands in LocalAI's `grpc-server.cpp` (no core
patch at all).

View File

@@ -1,92 +0,0 @@
# FINAL apples-to-apples NVFP4 benchmark (GB10 / DGX Spark) - CLEAN env, containers stopped
# llama 0023 clean f7409c2 | LLAMA_KV_PAGED=1, LLAMA_MAX_BATCH_TOKENS=512 (decode-first QoS budget; beats stock 394s->142s TTFT@npl32), CUDA graphs ON, -c 131072 --parallel 128 -b 2048 -ub 512 -fa on
# vLLM 0.23.0 | CUDA graphs ON (no enforce-eager), util 0.85, max-model-len 4096, max-num-seqs 256, tp1
# client h2h_cli3.py: 512-tok UNIQUE-nonce prompt (fresh full prefill, defeats prefix caching), max_tokens=256, temp0, ignore_eos, stream+usage
# llama restarts server PER NPL (paged-pool degrades after high-npl bursts); vllm one server/combo + npl8 re-check. 1 measured pass/npl + ptok8 graph warmup. peak_gb engine = PEAK-PRE.
# started Fri Jun 26 04:43:38 AM CEST 2026 baseline=3.29 GB
[2026-06-26 04:43:38] [dense_llama] ==== START dense_llama (llama) baseline_mem=3.29 ====
[2026-06-26 04:43:38] [dense_llama] NPL=8 launching server PRE_GB=3.29
[2026-06-26 04:43:48] [dense_llama] NPL=8 ready LOADED_GB=47.06
[2026-06-26 04:43:55] [dense_llama] GATE=' |REASON:Here\'s a thinking process:\n\n1. **Analyze User Input:** The user says "The capital of France is". This is a straightforward factual question with a clear answer.\n2. **Identify Key Entity:** France (country)\n3. **Identify Question Type:** Capit
[2026-06-26 04:44:30] [dense_llama] NPL=8 PASS=1 {"n": 8, "reqs": 8, "gen_total": 2040, "prompt_tok_total": 4195, "gen_per_req": 255.0, "agg_tps": 61.8, "decode_agg_tps": 82.5, "decode_perseq_tps": 9.57, "prefill_tps": 507.3, "ttft_mean_ms": 6038.1, "ttft_max_ms": 8270.0, "wall_s": 32.999}
[2026-06-26 04:44:30] [dense_llama] NPL=8 PEAK_GB=53.51
[2026-06-26 04:44:35] [dense_llama] NPL=8 server stopped mem=3.31
[2026-06-26 04:44:35] [dense_llama] NPL=32 launching server PRE_GB=3.31
[2026-06-26 04:44:40] [dense_llama] NPL=32 ready LOADED_GB=46.96
[2026-06-26 04:47:55] [dense_llama] NPL=32 PASS=1 {"n": 32, "reqs": 32, "gen_total": 8180, "prompt_tok_total": 16900, "gen_per_req": 255.6, "agg_tps": 43.2, "decode_agg_tps": 192.6, "decode_perseq_tps": 4.79, "prefill_tps": 115.0, "ttft_mean_ms": 133551.7, "ttft_max_ms": 147007.0, "wall_s": 189.49}
[2026-06-26 04:47:55] [dense_llama] NPL=32 PEAK_GB=69.63
[2026-06-26 04:48:01] [dense_llama] NPL=32 server stopped mem=3.32
[2026-06-26 04:48:01] [dense_llama] NPL=64 launching server PRE_GB=3.32
[2026-06-26 04:48:11] [dense_llama] NPL=64 ready LOADED_GB=46.97
[2026-06-26 04:55:10] [dense_llama] NPL=64 PASS=1 {"n": 64, "reqs": 64, "gen_total": 16382, "prompt_tok_total": 33828, "gen_per_req": 256.0, "agg_tps": 39.8, "decode_agg_tps": 277.8, "decode_perseq_tps": 3.09, "prefill_tps": 95.9, "ttft_mean_ms": 321618.8, "ttft_max_ms": 352633.6, "wall_s": 411.603}
[2026-06-26 04:55:10] [dense_llama] NPL=64 PEAK_GB=83.96
[2026-06-26 04:55:16] [dense_llama] NPL=64 server stopped mem=3.30
[2026-06-26 04:55:16] [dense_llama] NPL=128 launching server PRE_GB=3.30
[2026-06-26 04:55:21] [dense_llama] NPL=128 ready LOADED_GB=47.09
[2026-06-26 05:13:18] [dense_llama] NPL=128 PASS=1 {"n": 128, "reqs": 128, "gen_total": 32767, "prompt_tok_total": 67969, "gen_per_req": 256.0, "agg_tps": 30.9, "decode_agg_tps": 384.6, "decode_perseq_tps": 1.86, "prefill_tps": 69.7, "ttft_mean_ms": 902762.7, "ttft_max_ms": 975832.6, "wall_s": 1061.031}
[2026-06-26 05:13:18] [dense_llama] NPL=128 PEAK_GB=93.82
[2026-06-26 05:13:25] [dense_llama] NPL=128 server stopped mem=3.31
[2026-06-26 05:13:25] [dense_llama] ==== DONE dense_llama POST_GB=3.31 ====
[2026-06-26 05:13:25] [dense_vllm] ==== START dense_vllm (vllm) baseline_mem=3.31 ====
[2026-06-26 05:13:25] [dense_vllm] launching vllm PRE_GB=3.31
[2026-06-26 05:21:15] [dense_vllm] vllm ready LOADED_GB=110.48
[2026-06-26 05:21:27] [dense_vllm] GATE='Here\'s a thinking process:\n\n1. **Analyze User Input:** The user says "The capital of France is"\n2. **Identify Key Entity/Question:** The question is asking for the capital city of France.\n3. **Retrieve Knowledge:** I know from general knowledge that t
[2026-06-26 05:21:59] [dense_vllm] NPL=8 PASS=1 {"n": 8, "reqs": 8, "gen_total": 1959, "prompt_tok_total": 4195, "gen_per_req": 244.9, "agg_tps": 65.6, "decode_agg_tps": 70.4, "decode_perseq_tps": 8.76, "prefill_tps": 2096.2, "ttft_mean_ms": 1861.1, "ttft_max_ms": 2000.6, "wall_s": 29.843}
[2026-06-26 05:21:59] [dense_vllm] NPL=8 PEAK_GB=110.92
[2026-06-26 05:22:47] [dense_vllm] NPL=32 PASS=1 {"n": 32, "reqs": 32, "gen_total": 8165, "prompt_tok_total": 16900, "gen_per_req": 255.2, "agg_tps": 176.3, "decode_agg_tps": 211.8, "decode_perseq_tps": 6.28, "prefill_tps": 2182.6, "ttft_mean_ms": 5353.2, "ttft_max_ms": 7741.4, "wall_s": 46.302}
[2026-06-26 05:22:47] [dense_vllm] NPL=32 PEAK_GB=110.87
[2026-06-26 05:23:59] [dense_vllm] NPL=64 PASS=1 {"n": 64, "reqs": 64, "gen_total": 16314, "prompt_tok_total": 33828, "gen_per_req": 254.9, "agg_tps": 236.5, "decode_agg_tps": 309.1, "decode_perseq_tps": 4.38, "prefill_tps": 2088.9, "ttft_mean_ms": 9512.4, "ttft_max_ms": 16191.0, "wall_s": 68.976}
[2026-06-26 05:23:59] [dense_vllm] NPL=64 PEAK_GB=110.88
[2026-06-26 05:25:57] [dense_vllm] NPL=128 PASS=1 {"n": 128, "reqs": 128, "gen_total": 32640, "prompt_tok_total": 67969, "gen_per_req": 255.0, "agg_tps": 288.4, "decode_agg_tps": 418.8, "decode_perseq_tps": 2.79, "prefill_tps": 1929.1, "ttft_mean_ms": 18449.5, "ttft_max_ms": 35227.7, "wall_s": 113.162}
[2026-06-26 05:25:57] [dense_vllm] NPL=128 PEAK_GB=110.95
[2026-06-26 05:26:27] [dense_vllm] RECHECK_NPL8 {"n": 8, "reqs": 8, "gen_total": 2044, "prompt_tok_total": 4187, "gen_per_req": 255.5, "agg_tps": 68.1, "decode_agg_tps": 73.4, "decode_perseq_tps": 9.07, "prefill_tps": 1921.9, "ttft_mean_ms": 1877.6, "ttft_max_ms": 2178.1, "wall_s": 30.018}
[2026-06-26 05:26:35] [dense_vllm] ==== DONE dense_vllm POST_GB=3.53 ====
[2026-06-26 05:26:35] [moe_llama] ==== START moe_llama (llama) baseline_mem=3.53 ====
[2026-06-26 05:26:35] [moe_llama] NPL=8 launching server PRE_GB=3.53
[2026-06-26 05:26:50] [moe_llama] NPL=8 ready LOADED_GB=36.42
[2026-06-26 05:26:52] [moe_llama] GATE=' |REASON:Here\'s a thinking process:\n\n1. **Analyze User Input:**\n - User says: "The capital of France is"\n - This is a straightforward factual question, incomplete but clearly asking for the capital city of France.\n\n2. **Identify Key Information:*
[2026-06-26 05:27:06] [moe_llama] NPL=8 PASS=1 {"n": 8, "reqs": 8, "gen_total": 2048, "prompt_tok_total": 4195, "gen_per_req": 256.0, "agg_tps": 156.8, "decode_agg_tps": 211.8, "decode_perseq_tps": 24.45, "prefill_tps": 1236.4, "ttft_mean_ms": 2477.1, "ttft_max_ms": 3392.9, "wall_s": 13.061}
[2026-06-26 05:27:06] [moe_llama] NPL=8 PEAK_GB=39.66
[2026-06-26 05:27:11] [moe_llama] NPL=8 server stopped mem=3.34
[2026-06-26 05:27:11] [moe_llama] NPL=32 launching server PRE_GB=3.34
[2026-06-26 05:27:16] [moe_llama] NPL=32 ready LOADED_GB=36.54
[2026-06-26 05:27:54] [moe_llama] NPL=32 PASS=1 {"n": 32, "reqs": 32, "gen_total": 8192, "prompt_tok_total": 16900, "gen_per_req": 256.0, "agg_tps": 235.6, "decode_agg_tps": 393.0, "decode_perseq_tps": 10.02, "prefill_tps": 1213.9, "ttft_mean_ms": 8225.2, "ttft_max_ms": 13921.9, "wall_s": 34.768}
[2026-06-26 05:27:54] [moe_llama] NPL=32 PEAK_GB=47.11
[2026-06-26 05:28:00] [moe_llama] NPL=32 server stopped mem=3.30
[2026-06-26 05:28:00] [moe_llama] NPL=64 launching server PRE_GB=3.30
[2026-06-26 05:28:05] [moe_llama] NPL=64 ready LOADED_GB=36.39
[2026-06-26 05:29:10] [moe_llama] NPL=64 PASS=1 {"n": 64, "reqs": 64, "gen_total": 16384, "prompt_tok_total": 33828, "gen_per_req": 256.0, "agg_tps": 271.0, "decode_agg_tps": 527.0, "decode_perseq_tps": 6.15, "prefill_tps": 1152.3, "ttft_mean_ms": 15849.5, "ttft_max_ms": 29356.9, "wall_s": 60.449}
[2026-06-26 05:29:10] [moe_llama] NPL=64 PEAK_GB=57.13
[2026-06-26 05:29:16] [moe_llama] NPL=64 server stopped mem=3.28
[2026-06-26 05:29:16] [moe_llama] NPL=128 launching server PRE_GB=3.28
[2026-06-26 05:29:21] [moe_llama] NPL=128 ready LOADED_GB=36.48
[2026-06-26 05:34:19] [moe_llama] NPL=128 PASS=1 {"n": 128, "reqs": 128, "gen_total": 32760, "prompt_tok_total": 67969, "gen_per_req": 255.9, "agg_tps": 112.7, "decode_agg_tps": 726.4, "decode_perseq_tps": 3.73, "prefill_tps": 276.8, "ttft_mean_ms": 213017.2, "ttft_max_ms": 245528.7, "wall_s": 290.634}
[2026-06-26 05:34:19] [moe_llama] NPL=128 PEAK_GB=61.51
[2026-06-26 05:34:25] [moe_llama] NPL=128 server stopped mem=3.28
[2026-06-26 05:34:25] [moe_llama] ==== DONE moe_llama POST_GB=3.28 ====
[2026-06-26 05:34:25] [moe_vllm] ==== START moe_vllm (vllm) baseline_mem=3.28 ====
[2026-06-26 05:34:25] [moe_vllm] launching vllm PRE_GB=3.28
[2026-06-26 05:39:35] [moe_vllm] vllm ready LOADED_GB=109.46
[2026-06-26 05:39:38] [moe_vllm] GATE='Here\'s a thinking process:\n\n1. **Analyze User Input:**\n - User says: "The capital of France is"\n - This is a straightforward factual question, incomplete but clearly asking for the capital city of France.\n\n2. **Identify Key Information:**\n - C
[2026-06-26 05:39:47] [moe_vllm] NPL=8 PASS=1 {"n": 8, "reqs": 8, "gen_total": 1900, "prompt_tok_total": 4195, "gen_per_req": 237.5, "agg_tps": 231.2, "decode_agg_tps": 256.5, "decode_perseq_tps": 31.84, "prefill_tps": 5186.5, "ttft_mean_ms": 768.8, "ttft_max_ms": 808.2, "wall_s": 8.217}
[2026-06-26 05:39:47] [moe_vllm] NPL=8 PEAK_GB=109.62
[2026-06-26 05:40:07] [moe_vllm] NPL=32 PASS=1 {"n": 32, "reqs": 32, "gen_total": 7794, "prompt_tok_total": 16900, "gen_per_req": 243.6, "agg_tps": 426.4, "decode_agg_tps": 500.8, "decode_perseq_tps": 14.9, "prefill_tps": 6223.4, "ttft_mean_ms": 1830.4, "ttft_max_ms": 2714.2, "wall_s": 18.28}
[2026-06-26 05:40:07] [moe_vllm] NPL=32 PEAK_GB=109.63
[2026-06-26 05:40:37] [moe_vllm] NPL=64 PASS=1 {"n": 64, "reqs": 64, "gen_total": 15927, "prompt_tok_total": 33828, "gen_per_req": 248.9, "agg_tps": 550.7, "decode_agg_tps": 686.1, "decode_perseq_tps": 9.83, "prefill_tps": 5926.5, "ttft_mean_ms": 3224.4, "ttft_max_ms": 5704.9, "wall_s": 28.92}
[2026-06-26 05:40:37] [moe_vllm] NPL=64 PEAK_GB=109.63
[2026-06-26 05:41:27] [moe_vllm] NPL=128 PASS=1 {"n": 128, "reqs": 128, "gen_total": 31795, "prompt_tok_total": 67969, "gen_per_req": 248.4, "agg_tps": 650.7, "decode_agg_tps": 882.2, "decode_perseq_tps": 6.05, "prefill_tps": 5300.5, "ttft_mean_ms": 6487.7, "ttft_max_ms": 12817.8, "wall_s": 48.863}
[2026-06-26 05:41:27] [moe_vllm] NPL=128 PEAK_GB=109.64
[2026-06-26 05:41:36] [moe_vllm] RECHECK_NPL8 {"n": 8, "reqs": 8, "gen_total": 1702, "prompt_tok_total": 4187, "gen_per_req": 212.8, "agg_tps": 207.2, "decode_agg_tps": 226.4, "decode_perseq_tps": 28.06, "prefill_tps": 6021.3, "ttft_mean_ms": 642.7, "ttft_max_ms": 694.8, "wall_s": 8.213}
[2026-06-26 05:41:44] [moe_vllm] ==== DONE moe_vllm POST_GB=3.31 ====
==== ALL 16 ROWS COLLECTED (2 models x 2 engines x 4 npl) ====
decode_agg t/s (llama | vLLM | llama%vLLM):
DENSE q36-27b-nvfp4: npl8 82.5|70.4|117% npl32 192.6|211.8|91% npl64 277.8|309.1|90% npl128 384.6|418.8|92%
MoE q36-35b-a3b: npl8 211.8|256.5|83% npl32 393.0|500.8|78% npl64 527.0|686.1|77% npl128 726.4|882.2|82%
peak_gb (llama on-demand grows | vLLM fixed ~107 pool):
DENSE llama 53.5->93.8 ; vLLM ~110.9 flat
MoE llama 39.7->61.5 ; vLLM ~109.6 flat
Final CSV: final_benchmark.csv ; analysis: QWEN36_NVFP4_BENCH.md (FINAL section).
Cleanup: no leftover server/bench PIDs; GPU free (memnow 3.28 GB); local-ai + local-ai-worker
containers restarted (host returned). DONE.

View File

@@ -1,628 +0,0 @@
# bf16 SSM-state cache: BUILD PLAN (PART C synthesis - hand this to the build agent)
Status: READ-ONLY design. Lands ON TOP of patch 0021 (conv-state in-place fusion, building
concurrently on the GPU). DEFAULT = bf16 SSM recurrent state, f32 opt-out. This PART C is the
executive build brief: ordered edits, acceptance gate, bench targets, semantics/back-compat/risk
register, and the de-risk-first item. PART A (cparams wiring), PART B (kernel/op plumbing) and the
Appendix (upstream precedent + numeric safety) below are the detailed reference each step points into.
The decision (settled by GDN_RECURRENCE_BYTE_GATE.md): the gated-DeltaNet recurrence is the dominant
decode kernel (51.6% of the step, 805 MB f32 state R+W/call at 74% of GB10 peak BW) and is ALREADY
single-pass (measured re-stream ~1.0x, hard-capped <=1.33x). The whole ~2x DRAM gap vs vLLM is purely
f32(llama) vs bf16(vLLM) state-cache WIDTH, not extra passes. Narrowing the persisted SSM state to
bf16 (load->f32, recurrence math in f32 UNCHANGED, store->bf16) halves the dominant term and reaches
vLLM parity-to-ahead. vLLM's own GDN state cache is bf16, so this is a fair equal-precision change.
## C.0 Synthesis decisions that OVERRIDE the per-part text
1. v1 ships `type_s` = BF16 (SSM recurrent state, the 805 MB lever) and KEEPS `type_r` = F32 (conv
state). Reason: `ggml_concat` at prefill (`build_conv_state`, delta-net-base.cpp:472) requires
same-type operands; a bf16 conv cache breaks the f32 `qkv_mixed` concat. Conv state is ~12.6 MB
(launch-bound, ~0 ms byte benefit), so keeping it f32 costs nothing. This OVERRIDES PART A §3a/§3b,
which set BOTH defaults to BF16: in v1 set the `type_r` / `cache_type_conv` DEFAULT to
`GGML_TYPE_F32`. `type_r`=bf16 is a v2 follow-up (needs an f32 staging view before the prefill
concat - PART B §B.6).
2. Keep ALL transient/scratch tensors f32: the GDN op OUTPUT scratch (ggml.c:6327), the 0019 gather
scratch, and the keep_rs_t prefill snapshot. ONLY the PERSISTED cache rows narrow to bf16 (the
src[5] read view and the src[6] in-place write view).
3. The gate REPLACES the bit-exact md5 gate for the bf16 default: bf16 is intentionally non-bit-exact
vs llama f32 (it is equal precision to vLLM's bf16). The 0018/0019 md5 gate STILL applies to (a)
patch 0021's conv fusion and (b) verifying the f32 opt-out path is byte-identical to the pre-bf16
f32 baseline.
## C.1 Ordered file-by-file edit list (build order, on top of 0021)
Order is dependency- and de-risk-driven: prove the kernel dtype-correct in ISOLATION before flipping
any default. Section refs point into PART A / PART B below.
STEP 1 - kernel + op made dtype-generic (the load/store conversion), validated standalone:
- 1a `ggml/src/ggml.c` - relax the F32-only state asserts to {F32,BF16} in the 3 GDN builders:
`ggml_gated_delta_net` (~6308), `_inplace` (~6370), `_inplace_ids` (~6430), on `state` and
`src_state_dst`. KEEP the op OUTPUT scratch F32 (6327). [PART B §B.2]
- 1b `ggml/src/ggml-cuda/ggml-cuda.cu` - `supports_op` `GGML_OP_GATED_DELTA_NET` (~3096): permit a
BF16 `src[5]`/`src[6]`. [PART B §B.3]
- 1c `ggml/src/ggml-cuda/gated_delta_net.cu` - template kernel+gather+launch on `bool STATE_BF16`;
`#include <cuda_bf16.h>`. LOAD `__bfloat162float` (~102), STORE `__float2bfloat16` (~207), GATHER
bf16->f32 scratch (~20). Cast `src_state`/`src_state_dst` pointers to `nv_bfloat16` on bf16; relax
dispatcher asserts (309-311) `sizeof(float)` -> `ggml_type_size(type)`. Keep gather scratch +
keep_rs_t snapshot f32. ALL recurrence math (106-200) UNCHANGED in f32 registers. [PART B §B.4,§B.8]
- 1d `ggml/src/ggml-cpu/ops.cpp` - matching bf16 load/store branch in the GDN reference (10726/10744/
10891 load via `GGML_BF16_TO_FP32`, 10758-10762 store via `GGML_FP32_TO_BF16`); relax `nb[]` asserts
to `ggml_type_size(type)`. [PART B §B.5]
- 1e `tests/test-backend-ops.cpp` - add a BF16-state `GATED_DELTA_NET` case covering BOTH `n_tokens==1`
decode AND a multi-token (prefill/chunk) + `keep_rs_t==true` path, CUDA bf16 vs CPU bf16 reference.
THIS IS THE DE-RISK GATE for Step 1 (see C.5). Build + pass before Step 2.
STEP 2 - cparams selection wiring (llama.cpp core):
- 2a `include/llama.h` (after :366) - add `enum ggml_type type_s;` and `type_r;` adjacent to
`type_k`/`type_v`, marked `[EXPERIMENTAL]`. [PART A §3a]
- 2b `src/llama-context.cpp:3468` (`llama_context_default_params`) - add `/*.type_s =*/ GGML_TYPE_BF16,`
and `/*.type_r =*/ GGML_TYPE_F32,`. THIS IS THE DEFAULT CHANGE (type_r stays F32 per C.0). [PART A §3a]
- 2c `src/llama-memory.h:19` (`struct llama_memory_params`) - add `ggml_type type_r;` and `type_s;`.
[PART A §3a]
- 2d `src/llama-context.cpp:325` (`params_mem` init) - pass `params.type_r` / `params.type_s`. [PART A §3a]
- 2e `src/llama-model.cpp` - replace the 3 hardcoded `GGML_TYPE_F32` pairs (2056-57 recurrent, 2098-99
hybrid_iswa, 2117-18 hybrid = the qwen35/qwen35moe path) with `params.type_r` / `params.type_s`.
[PART A §2/§3a]
STEP 3 - back-compat for saved recurrent state (REQUIRED, the default flips):
- 3a `src/llama-memory-recurrent.cpp` `state_read_data` - on `s_type_i_ref != live type` with both in
{F32,BF16}, CONVERT row-by-row during load instead of returning false (same for `r`). Bump the
recurrent state-file version. [PART A §5, option A]
STEP 4 - CLI / llama-server surface (needed by the gate harness):
- 4a `common/common.h:566` region - `cache_type_ssm = GGML_TYPE_BF16;` and
`cache_type_conv = GGML_TYPE_F32;` (conv default F32 per C.0). [PART A §3b]
- 4b `common/common.cpp:1589` region - `cparams.type_s = params.cache_type_ssm;` and
`cparams.type_r = params.cache_type_conv;`. [PART A §3b]
- 4c `common/arg.cpp` (after :2074) - add `--cache-type-ssm`/`-ctssm` and `--cache-type-conv`/`-ctconv`
via the existing `kv_cache_type_from_str` (arg.cpp:402); confirm `bf16` -> `GGML_TYPE_BF16`. The C.2
harness depends on `--cache-type-ssm {f32,bf16}`. [PART A §3b]
STEP 5 - LocalAI gRPC / YAML (force f32 from model config):
- 5a `backend/backend.proto` - `string CacheTypeSSM` / `CacheTypeConv` (next free tags after 64);
regen proto. [PART A §3c]
- 5b `backend/cpp/llama-cpp/grpc-server.cpp:504` region - `params.cache_type_ssm =
kv_cache_type_from_str(request->cachetypessm());` + conv. [PART A §3c]
- 5c `core/config/model_config.go:935` - `CacheTypeSSM`/`CacheTypeConv` yaml fields. [PART A §3c]
- 5d `core/backend/options.go:247` - map into the request. [PART A §3c]
- 5e `core/config/meta/registry.go` + `build_test.go` - register `cache_type_ssm`/`cache_type_conv`
as static fields (gate). [PART A §3c]
STEP 6 - capability fallback (heterogeneous / CPU-offload safety):
- 6a `src/llama-context.cpp:518-595` - an `auto_fgdn`-style device-match probe: if a participating
device lacks the bf16 GDN load/store specialization (CPU-offloaded GDN layer, non-GB10 backend),
demote `type_s` to F32 BEFORE alloc and log once. [PART A §4]
## C.2 Acceptance gate (REPLACES the bit-exact md5 gate)
bf16 is intentionally non-bit-exact, so the 0018/0019 md5 byte-equality gate does NOT apply to the
bf16 default. The gate is teacher-forced KL-divergence + PPL-delta + greedy coherence + a
long-context drift sweep, vs the SAME model run f32. All commands on `dgx.casa` (DO NOT run during
this design - GPU busy). Binaries `~/llama-paged-dev/build*/bin`; models `~/bench/q36-27b-nvfp4.gguf`
(dense) and `~/bench/q36-35b-a3b-nvfp4.gguf` (MoE); scratch `~/bench/klgate`.
Why teacher-forced (not self-greedy): a self-greedy decode lets each precision pick its own argmax,
so after the first divergence the contexts differ and per-token logits are no longer comparable (you
measure trajectory divergence, not numeric error). `llama-perplexity --kl-divergence` feeds both
precisions the IDENTICAL token stream and compares output distributions position-by-position; the
greedy trajectory is validated SEPARATELY by the Same-top-p metric + a coherence read.
Corpus (one-time): wikitext-2 raw test (~280k tokens) into `~/bench/klgate`. KL mode needs
>= 2*n_ctx tokens; any fixed >=8k-token UTF-8 file works as long as base AND test share it.
256-token headline gate (per model; shown for dense):
```
M=~/bench/q36-27b-nvfp4.gguf; F=~/bench/klgate/wikitext-2-raw/wiki.test.raw; D=~/bench/klgate
COMMON="-m $M -f $F -c 256 -b 256 -ngl 99 -fa on --seed 1 --chunks 32"
# (a) f32 BASE: reference logits + f32 PPL
llama-perplexity $COMMON --cache-type-ssm f32 --kl-divergence-base $D/q27.f32.c256.kld | tee $D/q27.f32.c256.base.log
# (b) bf16 TEST: KL(bf16||f32) + bf16 PPL + Same-top-p
llama-perplexity $COMMON --cache-type-ssm bf16 --kl-divergence --kl-divergence-base $D/q27.f32.c256.kld | tee $D/q27.bf16.c256.kl.log
```
Noise floor (run FIRST, mandatory - GPU reductions are not bit-deterministic, so KLD has a non-zero
floor; bf16 is judged against BOTH the absolute threshold AND this floor):
```
llama-perplexity $COMMON --cache-type-ssm f32 --kl-divergence --kl-divergence-base $D/q27.f32.c256.kld | tee $D/q27.f32f32.floor.log
```
Record `Mean KLD_floor` and `Same-top-p_floor` (expect KLD ~1e-6..1e-5, top-p ~100%).
Coherence spot-check (greedy trajectory, reuses the 0018/0019 `--temp 0 --seed 1` convention):
```
P="Explain how a transformer language model generates text, step by step."
for T in f32 bf16; do llama-cli -m $M -ngl 99 -fa on --temp 0 --seed 1 -n 256 -p "$P" --cache-type-ssm $T 2>/dev/null > $D/q27.greedy.$T.txt; done
diff $D/q27.greedy.f32.txt $D/q27.greedy.bf16.txt && echo "GREEDY BYTE-IDENTICAL"
```
Long-context drift sweep (verifies the g<1 decay bound: bf16 state-rounding error must stay FLAT, not
accumulate, as context grows - the GDN state spans the whole window):
```
for C in 256 1024 2048 4096; do
CMN="-m $M -f $F -c $C -b $C -ngl 99 -fa on --seed 1 --chunks 8"
llama-perplexity $CMN --cache-type-ssm f32 --kl-divergence-base $D/q27.f32.c$C.kld >/dev/null
llama-perplexity $CMN --cache-type-ssm bf16 --kl-divergence --kl-divergence-base $D/q27.f32.c$C.kld | tee $D/q27.bf16.c$C.kl.log
done
```
f32 opt-out verification (the safety valve must actually select f32 and reproduce the committed f32
greedy md5 from 0018/0019 - the bf16 default must NOT change the f32-path output):
```
llama-cli -m $M -ngl 99 -fa on --temp 0 --seed 1 -n 256 -p "$P" --cache-type-ssm f32 2>/dev/null | md5sum # == 0018/0019 f32 baseline md5
```
Repeat the WHOLE gate verbatim for the MoE model (`M=~/bench/q36-35b-a3b-nvfp4.gguf`).
PASS/FAIL (bf16 ships as DEFAULT only if ALL rows pass for BOTH dense and MoE):
| metric | source | PASS threshold |
|---|---|---|
| Mean KLD | 256-gate (b) | **< 1e-3 nats** (hard, the brief) |
| Mean KLD vs floor | (b) vs floor | <= ~5x `Mean KLD_floor` (bounded signal, not pure noise) |
| Same top p | (b) | **>= 99.5%** (100% => greedy byte-identical to f32) |
| PPL-delta `ln(PPL_bf16/PPL_f32)` | (a)+(b) | **abs < 0.005** (PPL within +-0.5%) |
| Max / 99.9% KLD | (b) | report; flag if Max > 0.05 (tail outliers) |
| Coherence | greedy | fluent + on-topic; byte-identical if Same-top-p=100% |
| Long-context drift | sweep | MeanKLD(4096) <= 1.5x MeanKLD(256) AND Same-top-p(4096) >= 99.0% |
If any row fails for a model: keep THAT model on f32 (gallery YAML `cache_type_ssm: f32`) while the
global default stays bf16; the cparams f32 fallback is the safety valve. MoE has fewer GDN layers
(31 vs 48) and smaller per-head state (H_v=32 vs 48), so expected KLD <= dense; same thresholds.
Same-top-p is the bridge to the old md5 harness: at 100% the bf16 greedy output is byte-identical to
f32 and the 0018/0019 md5 gate would still pass - the strongest possible non-bit-exact result.
## C.3 Bench targets + nsys confirmation
Dense q36-27b-nvfp4 (48 GDN layers, S_v=128, H_v=48), npl128, GB10/sm_121, graphs-OFF
apples-to-apples (the measured baseline):
- Recurrence per call: 3.98 ms (f32, 805 MB R+W, 74% peak) -> **~2.0-3.0 ms** (bf16, ~413 MB R+W).
2.0 ms = 74% peak retained; 3.0 ms = conservative 50% peak on the smaller footprint.
- Recurrence per step: 191 ms -> ~96-143 ms (save ~48-95 ms).
- Step time: 384 ms -> **289-339 ms**.
- Decode throughput: ~335 -> **360-443 tok/s** = parity-to-ahead of vLLM (327 ms / 391 tok/s).
MoE q36-35b-a3b-nvfp4 (31 GDN layers, H_v=32): state per (seq,layer) = 128*128*32*4 = 2.0 MiB f32 ->
per-call R+W ~537 MB f32 -> ~268 MB bf16. Fewer layers + smaller state => smaller ABSOLUTE recurrence
savings, and MoE decode is more GEMM-bound (the `MUL_MAT_ID` expert path), so the bf16-state win is a
smaller FRACTION of the MoE step. Target: a measurable per-call halving of the GDN recurrence time
with the C.2 KL gate passing; no absolute MoE step target is asserted here (the MoE step is
MUL_MAT_ID-dominated, a separate lever from this one).
nsys confirmation (the measurement that proves the lever landed):
```
GGML_CUDA_DISABLE_GRAPHS=1 nsys profile -o ssmbf16 --force-overwrite true \
llama-batched-bench -m $M -npp 8 -ntg 12 -npl 128 -ub 2048
nsys stats --report cuda_gpu_kern_sum ssmbf16.nsys-rep | grep -i gated_delta_net
```
Confirm: `gated_delta_net_cuda` mean duration/call drops 3.98 -> 2.0-3.0 ms; step time + tok/s land in
the 289-339 ms / 360-443 tok/s band; the f32 opt-out reproduces the 3.98 ms f32 call. The gate is the
JOINT condition: per-call speed in band AND KL<1e-3 - neither alone ships bf16.
## C.4 Default / opt-out semantics, back-compat, risk register
Semantics:
- DEFAULT `type_s` = `GGML_TYPE_BF16` (SSM recurrent state). `type_r` = `GGML_TYPE_F32` in v1 (conv
state; bf16 is v2). This is the INVERSE of KV (KV is opt-IN to compression at F16 default; SSM is
opt-OUT to f32).
- Opt-out: `--cache-type-ssm f32` (CLI) or `cache_type_ssm: f32` (LocalAI YAML) -> bit-exact f32
recurrence. Per-model opt-out lives in gallery YAML if a model fails the gate; the global default
stays bf16.
- Silent capability fallback: the C.1 STEP 6 device-match probe demotes `type_s` to F32 before alloc
on devices lacking the bf16 GDN specialization (CPU offload / non-GB10) and logs once.
Back-compat (the ONE real breakage): `llama-memory-recurrent.cpp` serializes the per-layer state
dtype and HARD-matches on restore (mismatch -> `"mismatched s type"` -> returns false). The f32->bf16
default flip makes OLD f32-saved sessions fail to restore against a bf16 build. Fix = STEP 3a: convert
row-by-row on mismatch (both in {F32,BF16}) + bump the recurrent state-file version. KV never hit this
because `type_k`/`type_v` were EXPERIMENTAL and never default-changed; the SSM default FLIP is what
forces the convert/version work.
Risk register:
- **R1 numeric drift (KL gate fails).** Likelihood LOW: g<1 geometric decay contracts per-step bf16
rounding to a bounded series (~`eps/(1-exp(g_mean))`), f32 registers confine rounding to one
per-step cache write, and vLLM ships this exact config in production. Mitigation: C.2 gate +
per-model f32 opt-out + global f32 fallback.
- **R2 prefill / keep_rs_t / gather state path (the silent-corruption landmine).** The conversion
points are documented for DECODE; the SAME kernel also runs the chunked prefill path, the keep_rs_t
snapshot (writes to f32 scratch while the cache is bf16), and the 0019 gather (reads bf16 cache ->
f32 scratch). A dtype mistake on any of these corrupts the state at the prefill->decode handoff and
surfaces ONLY as long-context drift, which a decode-only 256-token gate can mask. Mitigation: STEP
1e test-backend-ops MUST cover the multi-token prefill + keep_rs_t==true path, not just decode; the
C.2 long-context sweep is the second net. (This is C.5, the single biggest risk.)
- **R3 MoE MUL_MAT_ID path.** The GDN recurrence op is IDENTICAL for dense and MoE; the MoE expert
GEMM (`MUL_MAT_ID`) does NOT touch the SSM state, so bf16-state is orthogonal to the expert path.
Residual risk: `qwen35moe` `build_recurrent_attn` must route the same bf16 state view (it shares
delta-net-base.cpp). Mitigation: run the full C.2 gate on the MoE model; the test-backend-ops case
is arch-agnostic.
- **R4 conv-state coupling with patch 0021.** Flipping `type_r` to bf16 breaks `ggml_concat` at
prefill (different types). Mitigation: v1 keeps `type_r`=F32 (C.0); `type_r`=bf16 deferred to v2
with an f32 staging view (PART B §B.6).
- **R5 back-compat restore failure.** Mitigation: STEP 3a convert + version bump (above).
## C.5 Single biggest risk + how the build agent de-risks it FIRST
Single biggest risk: **R2 - silent state corruption on the NON-decode state paths** (chunked prefill,
the keep_rs_t snapshot, the 0019 gather). The 805 MB measurement and every conversion-point in the
cheat-sheet describe the STEADY decode path (`n_tokens==1`, `!keep_rs_t`). But the bf16 cache is ALSO
read/written by the multi-token prefill path and the prefill/rollback snapshot (which targets f32
scratch while the cache is bf16). A dtype bug there does not crash and barely moves the 256-token
decode md5; it corrupts the recurrent state at the prefill->decode boundary and shows up ONLY as
long-context drift - exactly the failure a quick gate misses.
De-risk FIRST (before ANY default flip or wiring): implement STEP 1 (kernel + op dtype-generic) and
STEP 1e (test-backend-ops) ONLY, then prove the kernel is dtype-correct in ISOLATION by forcing a
bf16 state allocation behind a temporary debug flag and running test-backend-ops with a case that
exercises (a) single-token decode, (b) a multi-token prefill chunk, and (c) `keep_rs_t==true`,
comparing CUDA bf16 against the CPU bf16 reference AND against the f32 path within tolerance. Only
after that case is GREEN does the build agent proceed to STEP 2 (flip the default) and the C.2
model-level gate. This decouples kernel dtype-correctness from the cparams wiring, so a Step-1 bug is
caught by a deterministic op test in minutes instead of as a fuzzy long-context regression after the
full stack is wired.
---
# bf16 SSM state cache — cparams wiring (DEFAULT bf16 + f32 opt-out)
Label: cparams-default-fallback (READ-ONLY design). Mirrors the KV-cache `type_k`/`type_v`
precision plumbing exactly. Designed against HEAD-after-patch-0021 (conv-state in-place fusion).
This is lever (2) of GDN_RECURRENCE_BYTE_GATE.md: the recurrent SSM state cache is the dominant
decode byte stream (805 MB R+W/call, 51.6% of step, single-pass f32 = at the BW floor). The whole
~2x DRAM gap vs vLLM is f32(llama) vs bf16(vLLM) state width. Storing the persisted state in bf16
(load→f32, recurrence math in f32 UNCHANGED, store→bf16) halves the dominant term. vLLM's GDN state
cache is bf16, so bf16-default is the fair equal-precision comparison → make it the DEFAULT.
---
## 1. The KV-cache template we mirror (exact chain for type_k / type_v)
```
CLI common/arg.cpp:2052 -ctk/--cache-type-k TYPE → params.cache_type_k
(common_params, common/common.h:566, default GGML_TYPE_F16)
glue common/common.cpp:1589 cparams.type_k = params.cache_type_k (cparams = llama_context_params)
API include/llama.h:365 llama_context_params.type_k // [EXPERIMENTAL]
llama-context.cpp:3468 default in llama_context_default_params() = GGML_TYPE_F16
mem llama-context.cpp:326 llama_memory_params params_mem.type_k = params.type_k
llama-memory.h:19 struct llama_memory_params { ggml_type type_k; type_v; ... }
alloc llama-model.cpp:2030 create_memory(params_mem, cparams) → KV cache uses params.type_k
```
Key facts:
- `type_k`/`type_v` are NOT stored in `struct llama_cparams` (src/llama-cparams.h). They ride in
`llama_context_params` → `llama_memory_params` and are consumed directly at cache-alloc time.
We mirror that: NO new `llama_cparams` field is needed.
- KV default is opt-IN to compression (F16 default, pass `-ctk q8_0` to shrink). SSM is the INVERSE:
bf16 DEFAULT, pass an explicit `f32` to opt out / restore bit-exactness.
## 2. Where the SSM state type is currently hardcoded (the targets)
The recurrent cache constructor already accepts the types — only the model hardcodes F32:
- `src/llama-memory-recurrent.cpp:22-23` ctor params `ggml_type type_r, type_s`
- `r_l` (line 100, `n_embd_r`) = short conv state → `type_r` (TINY: conv_width-1 taps × conv_dim)
- `s_l` (line 101, `n_embd_s`) = SSM recurrent state → `type_s` (THE 805 MB/call dominant)
- `src/llama-memory-hybrid.h:32-33` ctor params `type_r, type_s` (qwen35 / qwen35moe path)
- Hardcoded `GGML_TYPE_F32` call sites in `src/llama-model.cpp::create_memory`:
- 2056-2057 `llama_memory_recurrent(...)` (pure recurrent arches)
- 2098-2099 `llama_memory_hybrid_iswa(...)` recurrent_type_r / recurrent_type_s
- 2117-2118 `llama_memory_hybrid(...)` recurrent_type_k / recurrent_type_v (mislabeled; they are r/s)
Note: `qwen35` / `qwen35moe` are HYBRID (filter_attn/filter_recr, no SWA) → they take the
`llama_memory_hybrid` branch (2108-2118). That is the call site that matters for the parity push.
## 3. New plumbing (parallel chain `type_s` / `type_r`)
### 3a. Public API + cparams glue (llama.cpp side)
| File | Change |
|------|--------|
| `include/llama.h` (after :366) | Add `enum ggml_type type_s; // data type for recurrent SSM state cache [EXPERIMENTAL]` and `enum ggml_type type_r; // data type for recurrent conv state cache [EXPERIMENTAL]`. Place adjacent to `type_k`/`type_v`. |
| `src/llama-context.cpp:3468` (default params) | Add `/*.type_s =*/ GGML_TYPE_BF16,` and `/*.type_r =*/ GGML_TYPE_BF16,`. **This is the DEFAULT change.** |
| `src/llama-memory.h:19` (`struct llama_memory_params`) | Add `ggml_type type_r;` and `ggml_type type_s;` next to `type_k`/`type_v`. |
| `src/llama-context.cpp:325` (`params_mem` init) | Add `/*.type_r =*/ params.type_r,` and `/*.type_s =*/ params.type_s,`. |
| `src/llama-model.cpp` 2056-57 / 2098-99 / 2117-18 | Replace the 3 hardcoded `GGML_TYPE_F32` pairs with `params.type_r` / `params.type_s`. |
### 3b. CLI / llama-server (common side)
| File | Change |
|------|--------|
| `common/common.h:566` region | Add `ggml_type cache_type_ssm = GGML_TYPE_BF16;` and `ggml_type cache_type_conv = GGML_TYPE_BF16;` (mirror `cache_type_k/v`; note the DEFAULT is BF16, not F16). |
| `common/common.cpp:1589` region | Add `cparams.type_s = params.cache_type_ssm;` and `cparams.type_r = params.cache_type_conv;`. |
| `common/arg.cpp` (after :2074) | Add `--cache-type-ssm TYPE` (`-ctssm`) → `params.cache_type_ssm = kv_cache_type_from_str(value)`, and `--cache-type-conv TYPE` (`-ctconv`). Reuse the existing `kv_cache_type_from_str` (arg.cpp:402). Help text: "recurrent SSM state cache type (default bf16; pass f32 for bit-exact recurrence)". |
`kv_cache_type_from_str` already accepts `f32`/`bf16`/`f16` — no change needed; just confirm `bf16`
maps to `GGML_TYPE_BF16` (add the case if absent).
### 3c. LocalAI gRPC backend (so users can force f32 from model YAML)
Mirror `CacheTypeKey` exactly:
| File | Change |
|------|--------|
| `backend/backend.proto:419` region | Add `string CacheTypeSSM = NN;` and `string CacheTypeConv = NN;` (next free field tags). Regenerate proto. |
| `backend/cpp/llama-cpp/grpc-server.cpp:504` region | `if (!request->cachetypessm().empty()) params.cache_type_ssm = kv_cache_type_from_str(request->cachetypessm());` and the conv equivalent. (grpc-server already has its own `kv_cache_type_from_str`; ensure it knows `bf16`.) |
| `core/config/model_config.go:935` region | Add `CacheTypeSSM string yaml:"cache_type_ssm,omitempty"` and `CacheTypeConv string yaml:"cache_type_conv,omitempty"`. |
| `core/backend/options.go:247` region | Add `CacheTypeSSM: c.CacheTypeSSM,` and `CacheTypeConv: c.CacheTypeConv,` to the request build. |
| `core/config/meta/registry.go:161` + `core/config/meta/build_test.go:140` | Register `cache_type_ssm` / `cache_type_conv` as static fields (the `staticFields` slice + registry map) so the meta-config gate passes. |
LocalAI semantics: leaving `cache_type_ssm` UNSET in YAML → empty gRPC string → backend keeps its
BF16 default. Setting `cache_type_ssm: f32` → forces the f32 opt-out (bit-exact recurrence).
## 4. Default / fallback semantics
- **DEFAULT = `GGML_TYPE_BF16`** for both SSM state (`type_s`) and conv state (`type_r`).
- SSM state (`type_s`) is the lever: f32→bf16 halves 805→413 MB/call → ~3.98→~2.0-3.0 ms/call.
- Conv state (`type_r`) is negligible bytes; default it bf16 too for consistency, but it can stay
f32 with zero perf cost if patch-0021's in-place conv path assumes f32 — see §6.
- **Opt-out = `GGML_TYPE_F32`** via `--cache-type-ssm f32` (CLI) or `cache_type_ssm: f32` (LocalAI YAML).
Restores bit-exact recurrence; use when the KL gate (<1e-3 / PPL-delta over 256-tok greedy) fails
for a given model, or for deterministic regression baselines.
- **Silent capability fallback**: gate the bf16 path behind a device-match probe modeled on
`auto_fgdn` (llama-context.cpp:518-595). If the GDN recurrence kernel's bf16 load/store
specialization is unavailable on a participating device (e.g. a CPU-offloaded GDN layer with no
bf16 op, or a non-GB10 backend), fall back to `GGML_TYPE_F32` for `type_s` BEFORE cache alloc and
log it once. This keeps "bf16 default" from breaking heterogeneous/CPU setups.
- The kernel contract is unchanged-math: load bf16→f32 into `s_shard` (registers stay f32), all
recurrence arithmetic in f32, store f32→bf16. Only the persisted cache is rounded per step;
geometric decay (g<1) bounds the rounding (does not accumulate unboundedly).
## 5. Back-compat (the one real breakage — saved sessions / state files)
`src/llama-memory-recurrent.cpp` SERIALIZES the per-layer state tensor dtype and does a HARD match
on restore:
- write: `state_write_data` writes `s_type_i = (int32_t)s_l[il]->type` (line ~900) and the r type.
- read: `state_read_data` reads `s_type_i_ref`, compares to current `s_l[il]->type`, and on
mismatch logs `"mismatched s type (%d != %d, layer %d)"` and **returns false** (restore FAILS).
Same for `r` type.
Consequence of the default flip f32→bf16:
- Sessions SAVED by an old f32-default build will FAIL to RESTORE against a new bf16-default build
(and vice versa), because the serialized `s_type_i_ref` (F32) ≠ the new cache type (BF16).
Required handling (pick one, recommend A):
- **A (convert on mismatch, recommended)**: in `state_read_data`, when `s_type_i_ref != current`
and both ∈ {F32, BF16}, convert row-by-row during load (`ggml_fp32_to_bf16` / `bf16→fp32`) instead
of returning false. Same for `r`. Bump the recurrent state-file version so older readers reject
cleanly. This makes old f32 sessions loadable into bf16 caches and round-trips safely.
- **B (pin precision to the saved file)**: if a session is being restored, read `s_type_i_ref`
first and set `type_s`/`type_r` from it, overriding the default for that context. Keeps restore
working but silently disables the bf16 win for resumed sessions.
- **C (document-only)**: keep the hard match; document that bf16-default invalidates cross-version
saved recurrent states. Lowest effort, worst UX. Not recommended given parity is the goal.
KV-cache parallel: `type_k`/`type_v` were always EXPERIMENTAL and non-default-changing, so the KV
path never had to solve this. The SSM default-FLIP is what forces the convert/version work — call it
out as the single most load-bearing back-compat item.
## 6. Coupling notes / sequencing
- Land ON TOP of patch 0021 (conv-state in-place fusion). If 0021's fused conv write assumes an f32
conv-state tensor, either (a) extend it to the cache tensor's dtype, or (b) keep `type_r` = F32 by
default and make ONLY `type_s` bf16 (conv bytes are negligible, so this loses nothing perf-wise and
de-risks 0021). Decision: ship `type_s`=BF16 first; make `type_r`=BF16 a follow-up gated on 0021's
conv path being dtype-generic.
- Kernel side (separate patch, not this wiring): `ggml/src/ggml-cuda/gated_delta_net.cu` currently
takes `const float * curr_state` / `float * state_dst` and does `s_shard[r] = read_state[i]`
(line 102) — hardcoded f32. The bf16 build needs the dispatch to read `s0->type` and route a
bf16 load/store specialization; the gather kernel `gdn_gather_nonident_kernel` (line 7, `const
float * cache`) likewise needs a bf16 variant. The cparams wiring here only selects the cache
dtype; the kernel patch consumes it. Patches 0018 (in-place) / 0019 (gather) state asserts must be
relaxed from f32-only to {f32,bf16}.
- CPU mirror `ggml-cpu/ops.cpp` GDN path needs the same bf16 load/store for CI parity / fallback.
## 7. Validation gate
- KL < 1e-3 and PPL-delta within tolerance vs the f32-state build over a 256-token greedy run, per
model (dense q36-27b-nvfp4, MoE q36-35b-a3b-nvfp4). If a model fails, that model sets
`cache_type_ssm: f32` in its gallery YAML (per-model opt-out) — the global default stays bf16.
- Add a `test-backend-ops` case for the GDN recurrence with bf16 state (mirror the 0021 harness:
dense text md5 + MoE byte check) to lock the load→f32→store→bf16 contract.
---
# Appendix - label `upstream-bf16-precedent` (READ-ONLY research)
Precedent + numeric-safety justification for the §1-7 wiring above. Sources: paged dev tree
(`dgx.casa:~/llama-paged-dev`, branch `paged`) and the vLLM checkout
(`~/vllm-bench/.../site-packages/vllm`).
## A.1 Upstream llama.cpp: recurrent-cache f32 is HARDCODED (no f16/bf16 path), not a documented numeric guard
The asymmetry to override: the attention KV cache type is user-tunable; the recurrent state cache is not.
- KV cache: `llama_context_params.type_k/type_v` default `GGML_TYPE_F16`
(`src/llama-context.cpp:3468-3469`), `[EXPERIMENTAL]` in `include/llama.h:365-366`, plumbed from
user params (`attn_type_k = params.type_k`).
- Recurrent/SSM cache: `llama_memory_recurrent(... type_r, type_s ...)` and the hybrid wrappers take
the recurrent types as ctor args, but EVERY call site in `src/llama-model.cpp` passes the literal
`GGML_TYPE_F32` (2056-2057 pure-recurrent; 2098-2099 hybrid-iswa `recurrent_type_r/s`;
2117-2118 hybrid `recurrent_type_k/v`). No cparams field feeds these - compile-time constants.
So mamba/mamba2/rwkv/falcon-h1/nemotron-h/qwen3.5 ALL get f32 recurrent + conv state unconditionally.
- Alloc: `r = ggml_new_tensor_2d(ctx, type_r, ...)`, `s = ggml_new_tensor_2d(ctx, type_s, ...)`
(`src/llama-memory-recurrent.cpp:100-101`). No f16 branch anywhere.
Is f32 a deliberate numeric constraint? Structural, not documented:
- `ggml_ssm_conv` / `ggml_ssm_conv_update_inplace` HARD-ASSERT f32 on conv state/kernel/x_cur/dst
plus `nb[0]==sizeof(float)` (`ggml/src/ggml.c:5581-5584,5589,5597`). Conv path is f32-locked at the
builder.
- `ggml_ssm_scan` does NOT assert input state `s` dtype, but hardcodes its OUTPUT as
`GGML_TYPE_F32` (`ggml/src/ggml.c:5662`); scan kernels read `s` as `float *`.
- `ggml/src/ggml-cuda/gated_delta_net.cu` takes `const float * curr_state`, `float * state`,
`float * state_dst`; the per-(seq,head) shard `float s_shard[rows_per_lane]` is loaded/stored as raw
float (34-102). Same in `ggml-cpu/ops.cpp`.
- NO code comment anywhere justifies "f32 for precision". The constraint is that the ops were written
float-only. => recurrent-cache-f32 is a hardcoded implementation default to override deliberately:
the 3 literal `GGML_TYPE_F32` call-site pairs (gate behind `type_s`/`type_r` per §3), the
gated_delta_net.cu load/store convert, and KEEP conv f32 unless its asserts are extended (conv bytes
are negligible - only the temporal `type_s` state needs bf16).
## A.2 vLLM: GDN temporal state cache is bf16 BY DEFAULT, fp32-accumulated in-kernel (the exact design)
- Dtype: `qwen3_next.py:780-787` -> `MambaStateDtypeCalculator.gated_delta_net_state_dtype` ->
`_mamba_state_dtype` (`mamba_utils.py:84-96`):
`conv_state_dtype = get_kv_cache_torch_dtype(mamba_cache_dtype, model_dtype)`;
`if mamba_ssm_cache_dtype == "auto": temporal_state_dtype = conv_state_dtype`.
With both knobs default `"auto"`, `get_kv_cache_torch_dtype("auto", model_dtype)` returns
`model_dtype` (`torch_utils.py:293-297`) = bf16 for Qwen3-Next => BOTH conv and temporal state are
bf16 by default. Explicit opt-out: `--mamba-ssm-cache-dtype float32` (mirror of our f32 fallback).
- In-kernel numerics (decode), `fla/ops/fused_recurrent.py`:
`b_h = tl.load(p_h0).to(tl.float32)` (303) load bf16->fp32; q/k/v/g/beta `.to(tl.float32)` (309-318);
recurrence in fp32 `b_h*=exp(g); b_v-=sum(b_h*b_k); b_v*=beta; b_h+=b_v*b_k; b_o=sum(b_h*b_q)`
(327-331); `tl.store(p_ht, b_h.to(p_ht.dtype.element_ty))` (337) store fp32->bf16. Prefill chunk path
identical (`b_h=tl.zeros(...,tl.float32)`, `+= load().to(fp32)`, 102/120).
=> byte-for-byte the proposed llama lever: load bf16->f32, math in f32 (UNCHANGED order, matches
gated_delta_net.cu's v-g*kv -> *beta -> S-update -> S^T q), store f32->bf16; only the persisted cache
crosses the bf16 boundary, once per step.
- vLLM numeric guards: NONE beyond fp32 accumulation - no per-step renorm, no clamp, no Kahan. Optional
`use_qk_l2norm_in_kernel` normalizes q,k (keeps k unit-norm) but does not touch the state.
- KDA nuance: `kda_state_dtype` returns `(state_dtype, torch.float32)` - Kimi Delta Attention keeps a
fp32 secondary component. qwen3.5 is `gated_delta_net` (fully-bf16 temporal state), but this shows
vLLM judged a fp32 component necessary for one delta variant -> reinforces keeping the f32 toggle.
Verdict: vLLM's own GDN state cache is bf16, so bf16-state in llama is a FAIR equal-precision target,
not a regression vs the competitor. bf16 brings llama TO vLLM's precision.
## A.3 Numeric-safety assessment for bf16 gated-DeltaNet state
Update: `S <- S*diag(exp(g)) + beta * k (x) (v - S k)`, with
`g = -exp(A_log)*softplus(a+dt_bias) <= 0` so `exp(g) in (0,1]` (strict geometric decay) and
`beta = sigmoid(.) in (0,1)`.
- Decay bounds error accumulation. bf16 = 8 mantissa bits -> per-element rel rounding
`eps ~= 2^-8 ~= 3.9e-3`. An error injected at step t is multiplied by `exp(g)<1` every later step ->
carry-error is a CONTRACTING geometric series bounded by ~`eps/(1-exp(g_mean))`, a small constant
multiple of one step's eps, NOT linear/unbounded. The recurrence is a contraction map - no
divergence. (The "per-step renorm" framing is not a literal renorm op in either codebase; the bound
IS the `g<1` contraction + `beta in (0,1)` + unit-norm k from the l2norm capping `||k (x) delta||`.)
- fp32 register accumulation is the minimal-error placement: load bf16->f32, do `S k`, `v-g*kv`,
`*beta`, the outer-product accumulate and `S^T q` ALL in fp32 (UNCHANGED math), store f32->bf16 once.
Identical to vLLM, which ships this as the Qwen3-Next default with no reported quality regression -
the strongest empirical safety evidence.
- Dominant risk is small KL/PPL drift, not instability. Gate KL<1e-3 + PPL-delta over 256-tok greedy
vs the f32 build; fall back to f32 via the §3c toggle if it fails. Keep conv state f32 (ssm_conv* is
f32-locked, conv bytes negligible) - no reason to risk it.
Bottom line: (1) upstream recurrent-cache f32 is a hardcoded implementation default (conv asserts f32;
scan/gdn kernels float-only; no numeric-rationale comments) - override via §3's `type_s`/`type_r`
plumbing, bf16-default + f32 opt-out, touching only the temporal state. (2) vLLM's GDN temporal state
is bf16 by default (auto->model_dtype), fp32-accumulated, with `--mamba-ssm-cache-dtype float32`
opt-out - a fair equal-precision target. (3) bf16 GDN state is numerically safe: g<1 decay contracts
rounding to a bounded geometric series, fp32 registers confine bf16 rounding to one per-step cache
write, and vLLM ships this exact config in production. KL<1e-3 / PPL gate + f32 fallback is the right
safety net.
---
# PART B - label `bf16-kernel-plumbing` (the kernel/op edits §6 defers)
Part A wires the cache DTYPE selection (cparams -> memory_params -> `s_l`/`r_l` alloc). Part B is the
consuming half: every kernel/op that reads or writes those caches, and the exact
load->f32->compute(f32, UNCHANGED)->store->bf16 conversion points. Traced against HEAD-after-0021 on
`dgx.casa:~/llama-paged-dev` (branch `paged`).
## B.1 Complete set of state-cache READERS/WRITERS (one op family only)
`s_l` (ssm_states_all) reaches compute through exactly ONE op family - the gated-DeltaNet recurrence -
via a strided VIEW from `build_rs` (graph base) that carries the cache dtype. The cache-touching srcs:
- `src[5]` `src_state` - the s0 read view (the cache, or the 0019 gather scratch).
- `src[6]` `src_state_dst` - the 0018 in-place write-back target (a view INTO the cache).
- `src[7]` `ids` - I32 seq map for the 0019 gather (no dtype concern).
No other op reads `s_l`. `build_rs` only re-strides (dtype rides through); the 0019
`gdn_gather_nonident_kernel` is the only other reader. So bf16 awareness localizes to: the 3 ggml.c
builders (asserts), cuda `supports_op`, `gated_delta_net.cu`, and the CPU mirror in `ops.cpp`.
## B.2 ggml.c builder asserts (relax F32-only -> {F32,BF16})
File `ggml/src/ggml.c`:
- `ggml_gated_delta_net` (6287): line 6308 `GGML_ASSERT(state->type == GGML_TYPE_F32)` ->
`... == GGML_TYPE_F32 || ... == GGML_TYPE_BF16`.
- `ggml_gated_delta_net_inplace` (6349): same `state` assert (~6366-6370) + any `src_state_dst`
type assert -> allow BF16.
- `ggml_gated_delta_net_inplace_ids` (6417): same `state` + `src_state_dst` relax.
- KEEP the op OUTPUT scratch f32: line 6327 `ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne)` stays. The
`[attn_scores | new_states]` output is a TRANSIENT graph tensor; the bf16 persisted write goes
through `src_state_dst`/`state` (in-place). The non-in-place fallback `cpy`s scratch->cache and
`ggml_cpy` already type-converts f32->bf16.
## B.3 CUDA supports_op
`ggml/src/ggml-cuda/ggml-cuda.cu`, `supports_op` case `GGML_OP_GATED_DELTA_NET` (3096): allow a BF16
`src[5]`/`src[6]` (add BF16 to the permitted state-src types).
## B.4 CUDA recurrence kernel `ggml/src/ggml-cuda/gated_delta_net.cu`
Template the kernel + gather + launch on the CACHE-pointer dtype (`bool STATE_BF16`); keep f32 valid so
the f32 opt-out is the SAME kernel. Include `<cuda_bf16.h>`; convert with `__bfloat162float` /
`__float2bfloat16`. ALL recurrence math (lines 106-200) stays in f32 registers, byte-for-byte UNCHANGED.
- Signatures: line 39 `const float * curr_state` -> `const STATE_T * curr_state`; line 57
`float * state_dst` -> `STATE_T * state_dst`; `read_state` (85-88) -> `const STATE_T * read_state`.
- LOAD (s0 -> f32 regs), lines 100-103:
`if constexpr (STATE_BF16) s_shard[r]=__bfloat162float(read_state[i]); else s_shard[r]=read_state[i];`
`s_shard` stays `float`.
- STORE-BACK (f32 regs -> bf16 cache):
- non-keep final write (203-208): `state[col*S_v+i] = STATE_BF16 ? __float2bfloat16(s_shard[r]) : s_shard[r];`
- keep_rs_t snapshot (191-200) targets `dst + attn_score_elems` = the f32 OUTPUT scratch (kept f32
per B.2); this is the prefill/rollback path (n_rs_seq>0), decode is `!keep_rs_t`. KEEP it f32.
Only the CACHE pointers (`curr_state` src[5], `state_dst` src[6]) are STATE_T.
- 0019 gather `gdn_gather_nonident_kernel` (7-30): `const float * cache` -> `const STATE_T * cache`;
`dst[i] = STATE_BF16 ? __bfloat162float(src[i]) : src[i];`. Keep `scratch` OUTPUT f32 (pool alloc
326-333 stays `ggml_cuda_pool_alloc<float>`) so the non-identity read path feeds f32; the identity
in-place path reads bf16 directly. `read_state`'s dtype follows the branch that selected it.
- Dispatcher (270-353):
- casts 299/323 `(const float *)src_state->data`, 312 `(float *)src_state_dst->data` ->
`(const nv_bfloat16 *)` / `(nv_bfloat16 *)` when `type == GGML_TYPE_BF16`; branch launch on type.
- asserts 309-311: `src_state_dst->type == GGML_TYPE_F32` -> allow BF16; `nb[0] == sizeof(float)` ->
`== ggml_type_size(type)`; `nb[1] == S_v*S_v*H*sizeof(float)` -> `... * ggml_type_size(type)`.
- q/k/v/g/beta strides (348-353) are ACTIVATION (f32) strides - UNCHANGED. Kernel indexes state by
ELEMENT (`col*S_v+i`), so the typed pointer halves the byte stride implicitly.
- `launch_gated_delta_net` (212-) + S_v switch (230-260): thread `STATE_BF16` into the
`gated_delta_net_cuda<S_v, KDA, keep_rs_t, STATE_BF16>` instantiations.
## B.5 CPU reference `ggml/src/ggml-cpu/ops.cpp` (parity / CI / CPU-offload fallback)
`ggml_compute_forward_gated_delta_net_one_chunk` (10662) + `_f32` (10847), dispatch (10915):
- LOAD: 10726 `const float * state_in_base = (const float *)src_state->data`, the rs_head/gather read
10744-10745, and 10891 `const float * cache = (const float *)src_state->data` -> when
`src_state->type == GGML_TYPE_BF16`, read `GGML_BF16_TO_FP32(((const ggml_bf16_t*)..)[..])`.
- STORE: 10758-10762 `inplace_state_base = (float *)src_state_dst->data` -> store
`((ggml_bf16_t*)inplace_state_base)[..] = GGML_FP32_TO_BF16(s_shard)`; relax asserts `nb[0]`/`nb[1]`
to `ggml_type_size(type)`. Keep ONE impl, branch load/store on `src_state->type`.
## B.6 Conv state (`r_l`) -> bf16 : DEFER (optional, low-value, prefill snag)
Conv state ~12.6 MB total, LAUNCH-bound (0021 removed concat/cpy); bf16 saves ~0 ms, adds complexity:
- DECODE (0021 fused) `ggml_ssm_conv_update_inplace` (ggml.c:5566) asserts 5581-5584
`conv_states/conv_state_dst->type == F32`; CUDA `ssm_conv_update_f32` (ssm-conv.cu:131) + CPU
`ggml_compute_forward_ssm_conv_update_f32` (ops.cpp:9471) read/write f32. To bf16: relax the 2
asserts, template tap LOAD (`__bfloat162float`) + ring write-back STORE (`__float2bfloat16`), cast
`conv_states`/`conv_state_dst` ptrs in both dispatchers.
- PREFILL (non-fused) `build_conv_state` (delta-net-base.cpp:449-524): `conv_states=build_rs(...)`
(bf16 view) then `ggml_concat(conv_states, qkv_mixed, 0)` (472). **`ggml_concat` requires same type**
- qkv_mixed is f32 -> bf16 conv cache BREAKS the prefill concat (needs an f32 staging view of the
taps first; the ring write-back `ggml_cpy` at 496/520 already converts; concat is the blocker).
RECOMMENDATION: keep `type_r` = F32 in v1 (matches Part A §6). Ship `type_s`=BF16 first; `type_r`=BF16
is a follow-up that adds the f32 staging view.
## B.7 Confirm UNTOUCHED: full-attn KV-cache (16 layers) + FP4 weights
- KV-cache: the `llama_kv_cache` half of `llama_memory_hybrid`, alloc with `params.type_k/type_v`
(llama-model.cpp 2030-2031 / 2089-2090 / 2108-2109). Part A changes ONLY the recurrent half's
`type_s`; `attn_type_k`/`attn_type_v` untouched. Paged-KV gather (0003-0011), flash-attn,
`type_k()/type_v()` accessors (kv-cache.h 161-162/381-382) unaffected.
- FP4 weights (nvfp4 dense + MoE): model weights, separate from runtime state caches; recurrence/conv
kernels read STATE not weights. FP4 GEMM (0017/0020) untouched.
- Activations (q/k/v/g/beta, attn-out, z) stay f32 (<1% of bytes). Only persisted `s_l` rows narrow.
## B.8 Conversion-point cheat-sheet (the ONLY numeric-precision boundaries)
1. CUDA load `gated_delta_net.cu` ~102: `s_shard[r] = __bfloat162float(read_state[i])`.
2. CUDA store ~207: `state[col*S_v+i] = __float2bfloat16(s_shard[r])`.
3. CUDA gather ~20: `dst[i] = __bfloat162float(src[i])` (bf16 cache -> f32 scratch).
4. CPU load `ops.cpp` ~10726/10744/10891: `GGML_BF16_TO_FP32(((ggml_bf16_t*)src_state->data)[..])`.
5. CPU store ~10762: `((ggml_bf16_t*)inplace_state_base)[..] = GGML_FP32_TO_BF16(s_shard)`.
Everything between (1)/(4) and (2)/(5) is f32-register math, identical to today's f32 kernel. Only the
persisted cache rounds to bf16 once per step; g<1 geometric decay bounds the rounding.
## B.9 File-by-file edit table (Part B)
| File | Edit |
|---|---|
| `ggml/src/ggml.c` | relax `state`/`src_state_dst` F32 asserts -> allow BF16 in the 3 GDN builders (6308, ~6370, ~6430); keep output scratch F32 (6327) |
| `ggml/src/ggml-cuda/ggml-cuda.cu` | `supports_op` GATED_DELTA_NET (3096): allow BF16 state src |
| `ggml/src/ggml-cuda/gated_delta_net.cu` | template kernel+gather+launch on STATE_BF16; `__bfloat162float` load / `__float2bfloat16` store; cast src_state/src_state_dst ptrs; relax dispatcher asserts (309-311) to `ggml_type_size(type)`; keep gather scratch + keep_rs snapshot f32 |
| `ggml/src/ggml-cpu/ops.cpp` | bf16 load/store branch in GDN ref (10726/10744/10758-10762/10891); relax asserts |
| `tests/test-backend-ops.cpp` | add BF16-state GATED_DELTA_NET case (CUDA bf16 vs CPU bf16) |
| (deferred) conv: `ggml.c:5581-84`, `ssm-conv.cu:131`, `ops.cpp:9471`, `delta-net-base.cpp:472` | v2 only - f32 staging before prefill concat |
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,37 +0,0 @@
# bf16 SSM state - build/de-risk progress
DECISION (user override of plan): f32 DEFAULT + bf16 OPT-IN. type_s default = GGML_TYPE_F32.
Conv state (type_r) stays F32. Recurrence math stays f32 (load->f32, store->cache dtype).
## STEP 1 (dtype-generic kernel + op) - DONE + DE-RISK GATE 1 PASSED
Files (DGX ~/llama-paged-dev):
- ggml/src/ggml.c: 3 GDN builder asserts F32 -> {F32,BF16}; state_dst nb[0] -> ggml_type_size.
- ggml/src/ggml-cuda/gated_delta_net.cu: gdn_state_t<STATE_BF16> alias; gather + recurrence kernel +
launchers templated on STATE_BF16; __bfloat162float load / __float2bfloat16 store; gather scratch
shares cache dtype (uniform read); dispatcher detects src_state->type, GDN_DISPATCH macro 8-way.
- ggml/src/ggml-cpu/ops.cpp: byte-based read base + read_bf16 load conversion; bf16 in-place
convert-store after token loop; bf16 gather widening; relaxed asserts to ggml_type_size.
- ggml/src/ggml-cpu/ggml-cpu.c: work-size +S_v*S_v for bf16 in-place.
- tests/test-backend-ops.cpp: state_type field on test_gated_delta_net; 16 bf16 cases (hs 64+128 x
decode/prefill/keep_rs x kda).
GATE 1: build clean (EXIT=0); test-backend-ops -o GATED_DELTA_NET = 52/52 OK (CUDA bf16 vs CPU bf16).
## STEP 2/3/4 (cparams opt-in wiring) - IN PROGRESS
f32 DEFAULT everywhere; --cache-type-ssm bf16 opts in.
## STEP 2/3/4 (cparams opt-in) - DONE
- llama.h/llama-context.cpp/llama-memory.h/llama-model.cpp: type_r/type_s plumbed, DEFAULT F32.
- common.h/common.cpp/arg.cpp: cache_type_ssm/conv (F32 default) + --cache-type-ssm/-conv CLI.
- llama-memory-recurrent.cpp: convert-on-mismatch f32<->bf16 (r and s) via ggml_*_row API.
## EXTRA FIX (plan B.1 miss): build_rs rs_zero clear used ggml_scale (f32-only) -> bf16 abort.
- llama-graph.cpp: f32 keeps ggml_scale_inplace (bit-exact); non-f32 uses ggml_fill_inplace.
- fill.cu + ops.cpp + ggml.c: added BF16 to ggml_fill. get_rows/cpy already bf16-capable.
## DE-RISK GATE - ALL PASS
- build clean EXIT=0 (test-backend-ops, llama-completion, llama-cli, llama-perplexity, llama-batched-bench).
- test-backend-ops -o GATED_DELTA_NET = 52/52 (16 bf16 cases: decode/prefill/keep_rs x kda x hs64/128).
- f32 default md5: dense 5951a5b4... MoE 07db32c2... == 0023 (non-invasive; also --cache-type-ssm f32 matches).
- bf16 opt-in: coherent "Paris", no crash; byte-identical to f32 on 48-tok sample (Same-top-p 100%).
- diff backup: ~/llama-paged-dev/BF16_SSM_STATE.diff (1003 lines, 15 files). NOT committed/pushed.
READY FOR C.2 KL GATE (GateBench).

View File

@@ -1,199 +0,0 @@
# bf16 SSM-state cache - BUILD + DE-RISK RESULTS
Label: bf16-build-derisk (the GPU build agent). Lands on top of patch 0023 (HEAD f7409c2) on the DGX
dev tree `~/llama-paged-dev` (branch `paged`). Status: **DE-RISK GATE PASSED, READY FOR THE C.2 KL
GATE (GateBench).** Work is built into `build-cuda` and saved as `~/llama-paged-dev/BF16_SSM_STATE.diff`
(uncommitted on the dev tree; the 0024 ship/shelve decision is gated on GateBench's KL results).
## DECISION applied (user override of the plan): f32 DEFAULT + bf16 OPT-IN
The plan defaulted bf16; the user wants f32 to stay the bit-exact DEFAULT and bf16 to be opt-IN via
`--cache-type-ssm bf16`. So `type_s` default = `GGML_TYPE_F32`, `type_r` default = `GGML_TYPE_F32`
(conv stays f32 always, per plan C.0). Only the persisted RECURRENT (temporal) state narrows to bf16
when opted in; recurrence math stays f32 (load->f32, compute f32, store->cache dtype). The opt-in is
non-invasive: with no flag the output is byte-identical to 0023.
## Files changed (15; full diff = ~/llama-paged-dev/BF16_SSM_STATE.diff, 1003 lines)
STEP 1 - dtype-generic kernel + op (the de-risk core):
- `ggml/src/ggml.c` - 3 GDN builder `state`/`state_dst` asserts F32 -> {F32,BF16}; `state_dst->nb[0]`
`sizeof(float)` -> `ggml_type_size(state_dst->type)`. Also relaxed the `ggml_fill` builder assert to
allow BF16 (needed by the rs_zero clear; see below).
- `ggml/src/ggml-cuda/gated_delta_net.cu` - `gdn_state_t<STATE_BF16>` alias (`nv_bfloat16`/`float`);
recurrence kernel + gather kernel + both launchers + the dispatcher templated on `STATE_BF16`.
LOAD `__bfloat162float`, STORE `__float2bfloat16`; the gather scratch is allocated at the CACHE
dtype so `read_state` is a single uniform dtype (no mixed-dtype read path - eliminates the plan-R2
landmine). The keep_rs snapshot + the non-in-place final write stay f32 (op output scratch); the
bf16 store happens ONLY on the in-place cache path. `supports_op` already returned `true`
unconditionally for GATED_DELTA_NET, so no change there.
- `ggml/src/ggml-cpu/ops.cpp` - byte-based prior-state read base + `read_bf16` load conversion
(`GGML_BF16_TO_FP32`); bf16 in-place convert-store after the per-(head,seq) token loop
(`GGML_FP32_TO_BF16`); bf16-widening non-identity gather; relaxed `nb[]` asserts to
`ggml_type_size`. Added a `ggml_compute_forward_fill_bf16` + dispatch case.
- `ggml/src/ggml-cuda/fill.cu` - BF16 case in the fill kernel switch.
- `ggml/src/ggml-cpu/ggml-cpu.c` - GDN work-size adds the extra `S_v*S_v` f32 buffer when the cache is
bf16 in-place (mirror of `need_work` in ops.cpp).
- `tests/test-backend-ops.cpp` - `state_type` field on `test_gated_delta_net`; 16 bf16-state cases
(head_size 64 + 128 x {decode, multi-token prefill 33/64/100, keep_rs_t K=4} x kda 0/1, n_seqs 1/2).
STEP 2/3/4 - cparams opt-in wiring (f32 DEFAULT):
- `include/llama.h` - `type_r`/`type_s` in `llama_context_params` (adjacent to type_k/type_v).
- `src/llama-context.cpp` - default-params `type_r = type_s = GGML_TYPE_F32`; `params_mem` passes them.
- `src/llama-memory.h` - `type_r`/`type_s` in `llama_memory_params`.
- `src/llama-model.cpp` - the 3 hardcoded `GGML_TYPE_F32` recurrent ctor pairs (recurrent /
hybrid_iswa / hybrid = the qwen35/qwen35moe path) now pass `params.type_r` / `params.type_s`.
- `src/llama-memory-recurrent.cpp` - back-compat: `state_read_data` converts f32<->bf16 on type
mismatch (helper `recurrent_read_convert_rows` via the public `ggml_bf16_to_fp32_row` /
`ggml_fp32_to_bf16_row`) instead of failing, for both r and s; lets an f32-saved session restore
into a bf16 cache and vice versa.
- `src/llama-graph.cpp` - `build_rs` rs_zero clear: f32 keeps the exact `ggml_scale_inplace(.,0)` op
(bit-exactness); bf16 (and any non-f32) state uses `ggml_fill_inplace(.,0)` (CUDA scale is f32-only;
this was the one extra state-touching op the plan's "one op family" claim missed). get_rows + cpy
on the extra-states path already support bf16, so no change needed there.
- `common/common.h` / `common/common.cpp` / `common/arg.cpp` - `cache_type_ssm` / `cache_type_conv`
(default F32) + `--cache-type-ssm`/`-ctssm` and `--cache-type-conv`/`-ctconv` CLI (reusing the
existing `kv_cache_type_from_str`, which already maps `f32`/`bf16`).
## DE-RISK GATE - ALL PASS
1. **Build clean** (build-cuda, CUDA arch 121): EXIT=0 for ggml/ggml-cuda/ggml-cpu/llama/llama-common
and the binaries (test-backend-ops, llama-completion, llama-cli, llama-perplexity, llama-batched-bench).
2. **test-backend-ops -o GATED_DELTA_NET = 52/52 PASS** (CUDA backend vs CPU reference). Includes all
16 new bf16-state cases (CUDA bf16 vs CPU bf16) covering decode (n_tokens==1), multi-token
prefill/chunk (33/64/100), and keep_rs_t (K=4), with kda on/off and head_size 64 + 128 (production
S_v). The bf16 op test is the deterministic R2 de-risk for the load/compute/store contract.
3. **f32-default md5 == 0023 baseline (opt-in is non-invasive):**
- dense q36-27b-nvfp4: `5951a5b4d624ce891e22ab5fca9bc439` == 0023 (no flag AND `--cache-type-ssm f32`)
- MoE q36-35b-a3b-nvfp4: `07db32c2bcb78d17a43ed18bc22705cd` == 0023
Command: `llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`.
4. **bf16 opt-in coherence + engaged (dense, `--cache-type-ssm bf16`):** no crash; coherent + on-topic.
- 48-tok greedy ("The capital of France is"): "**Paris**." - byte-identical to f32 (md5 5951a5b4...),
i.e. Same-top-p = 100% over that short sample (the g<1 decay bounds the per-step rounding so the
argmax trajectory is unchanged at short length).
- 256-tok greedy ("Explain how a transformer LM generates text, step by step"): fluent, well-structured
step-by-step explanation, and the bf16 md5 (`fc82b4cd44f8ec999c3b6843eb3f8c61`) **DIVERGES** from
f32 (`554cc667a2e62a47c34a999a127ac7e5`) - definitive proof that bf16 is genuinely ENGAGED (not a
silent f32 fallback) and behaves as expected (non-bit-exact, coherent). The per-token divergence
is exactly what the C.2 teacher-forced KL gate quantifies.
- Independent proof bf16 is allocated: BEFORE the build_rs fill fix, decode aborted in
`ggml_cuda_op_scale` on the recurrent-state tensor - an f32 cache would never have reached that
bf16-only failure, so the opt-in demonstrably allocates bf16. Wiring is also directly traceable:
`--cache-type-ssm bf16` -> cache_type_ssm -> cparams.type_s -> params_mem.type_s -> the
llama_memory_hybrid recurrent `s_l` alloc.
CONFIRM: ready for the C.2 KL-divergence + PPL-delta + long-context drift gate (GateBench).
## A landmine fixed beyond the plan (record for the gate/ship agents)
The plan B.1 asserted `s_l` reaches compute through ONLY the gated-DeltaNet op. It also flows through
`build_rs`: (a) the rs_zero restart-slot clear used `ggml_scale_inplace(.,0)`, and `ggml_cuda_op_scale`
hard-asserts f32 -> the first bf16 decode aborted in scale. Fixed by routing the bf16 clear through
`ggml_fill` (with a new bf16 fill path). (b) the extra-states `ggml_get_rows` + `ggml_cpy` already
support bf16 (verified) - no change. This is exactly the kind of non-decode state path the de-risk
was meant to surface; it is now covered end-to-end (the bf16 coherence run exercises rs_zero on the
fresh-sequence prompt).
## NOT done in this phase (next agents)
- STEP 5 LocalAI gRPC/YAML (`CacheTypeSSM`/`CacheTypeConv` proto + grpc-server + model_config +
options + meta registry) - needed to force f32/bf16 from a gallery YAML; not on the de-risk gate.
- STEP 6 capability fallback (device-match probe to demote bf16->f32 before alloc on a device lacking
the bf16 GDN/fill path, e.g. CPU-offloaded GDN). The CPU reference DOES implement bf16 (load/store/
gather/fill) so a CPU fallback is numerically correct today, but the probe is the clean guard.
- The C.2 KL/PPL/long-context gate + the C.3 nsys per-call bench - GateBench (GPU gate agent, runs
sequentially after this build phase; binaries are pre-built in build-cuda).
Assisted-by: Claude:opus-4.8 [Claude Code]
---
# C.2/C.3 ACCEPTANCE GATE + PARITY BENCH RESULTS (label bf16-gate-bench)
Status: **GATE FAILS -> NO-SHIP. KEEP SHELVED. patch 0024 NOT created; nothing committed.**
All runs on `dgx.casa` build-cuda binaries, wikitext-2-raw test, `-ngl 99 -fa on --seed 1`.
Corpus: `~/bench/klgate/wikitext-2-raw/wiki.test.raw` (symlink to `~/wikitext-2-raw`, ~280k tokens).
## 1. KL acceptance gate
### Noise floor (f32-vs-f32, c256 chunks32) - the non-determinism floor
| model | Mean KLD | Max KLD | Same-top-p | ln(PPL(Q)/PPL(base)) |
|---|---|---|---|---|
| dense q27 | -1.3e-5 | 1e-6 | 100.000% | +0.001256 |
| MoE q35 | ~0 (-3e-7) | 5.9e-5 | 100.000% | +0.000607 |
### Headline 256-token gate (bf16-vs-f32, c256 chunks32) - PASSES, but vacuously
bf16 c256 is **byte-identical to the floor** for both models (Mean KLD -1.3e-5 dense / ~0 MoE,
Same-top-p 100%, identical PPL). Reason: a single 256-token window is processed in ONE ubatch
(ub512 > 256), so the recurrent state is written to the bf16 cache only ONCE at the chunk end and is
NEVER read back to produce that window's logits. The 256-token gate therefore does NOT exercise the
bf16 round-trip at all - it is blind to the actual cost.
### Long-context drift sweep (bf16-vs-f32, chunks8) - FAILS HARD for BOTH models
| model | ctx | Mean KLD | Same-top-p | Max KLD | 99.9% KLD |
|---|---|---|---|---|---|
| dense | 256 | -1.3e-5 | 100.000% | 1e-6 | 0 |
| dense | 1024 | 0.0647 | 91.54% | 20.17 | 7.69 |
| dense | 2048 | 0.1739 | 90.65% | 24.89 | 18.18 |
| dense | 4096 | 0.1258 | 90.40% | 26.03 | 17.22 |
| MoE | 256 | ~0 | 100.000% | 5.6e-5 | 4.9e-5 |
| MoE | 1024 | 0.0472 | 90.04% | 5.13 | 0.95 |
| MoE | 2048 | 0.0442 | 90.84% | 1.85 | 1.11 |
| MoE | 4096 | 0.0422 | 89.97% | 2.76 | 0.83 |
Gate thresholds: Mean KLD < 1e-3; Same-top-p >= 99.5%; |ln(PPL ratio)| < 0.005;
drift MeanKLD(4096) <= 1.5x MeanKLD(256) AND Same-top-p(4096) >= 99.0%.
Result: 256-tok PASS (vacuous); **drift FAIL by ~50-170x on Mean KLD and ~9 pts on Same-top-p**
(top-p ~90% = roughly 1 token in 10 flips its argmax at >=1024 ctx). FAIL for both dense and MoE.
### Discrimination (is it a bug or genuine bf16?) - dense c1024 chunks8
- **f32 long-context floor c1024**: Mean KLD -1.2e-5, Same-top-p 100% -> the bf16 divergence is REAL
signal, not a long-context measurement artifact.
- **bf16 KLD is invariant to ubatch-boundary count** (= the cross-ubatch state read-back frequency):
ub1024 (0 internal boundaries) 0.0642 / 91.19%; ub512 (1) 0.0647 / 91.54%; ub256 (3) 0.0639 /
91.17%; ub64 (15) 0.0682 / 90.97%. Flat. -> The error is INTRINSIC to bf16 over the long
recurrence INSIDE a single op call, NOT a chunked-prefill/keep_rs/gather handoff bug (R2 ruled out;
test-backend-ops 52/52 already passed). The error PLATEAUS with context (contraction), i.e. it is
bounded but LARGE: the gated-DeltaNet has long-memory heads (exp(g) ~ 1), so the g<1 decay does NOT
tightly contract the per-step bf16 rounding the way the plan's A.3 optimistically assumed.
Note: this is exactly vLLM's own precision (vLLM's GDN temporal cache is bf16). vLLM users never see
this delta because vLLM has no f32 reference; our gate exposes the full bf16-vs-f32 gap because our
f32 path is a HIGHER bar than vLLM.
## 2. Parity bench - the perf lever IS real
### nsys recurrence per-call (graphs OFF, npp4 ntg32 npl128) - gated_delta_net_cuda Avg
| model | f32 ms/call | bf16 ms/call | delta |
|---|---|---|---|
| dense q27 | 3.381 | 1.726 | **-49.0%** |
| MoE q35 | 2.245 | 1.153 | **-48.6%** |
The predicted 3.49 -> 2-3 ms/call lever LANDED (even beat it). Total GPU time dropped too (dense
~12.05 -> ~9.05 s graphs-off). bf16 halving the persisted SSM-state bytes halves the dominant decode
kernel, exactly as designed.
### End-to-end decode throughput (S_TG aggregate, npp128 ntg128, graphs ON unless noted)
| model | npl | f32 t/s | bf16 t/s | note |
|---|---|---|---|---|
| dense | 32 | 212 | 239 | +12.8% |
| dense | 128 | 371-376 (stable) | 287 / 336 / 487 / 498 (BIMODAL) | clean ~490 = +31%; bad runs from a CUDA-graph instability on the dense path |
| dense | 128 | 371.67 (graphsOFF) | 486.68 (graphsOFF) | clean +31% |
| MoE | 32 | 449 | 509 | +13.4% |
| MoE | 128 | 767 | 958 | +24.9% (clean, nsys-corroborated) |
% of vLLM (391 t/s dense reference): f32 default = 95-96% (bit-exact, higher precision than vLLM);
bf16 clean ~490 = **125%** (but unstable on dense + fails the numeric gate). MoE bf16 +25% is clean.
## 3. DECISION: NO-SHIP / KEEP SHELVED
- The KL gate **fails** the long-context drift criterion for both models: bf16 SSM state changes
~10% of tokens at >=1024 ctx vs our f32 (Same-top-p ~90%, Mean KLD 0.04-0.17). It is therefore NOT
a quality-neutral opt-in and cannot honor the project's "f32 bit-exact default" promise.
- Per the task rule (gate FAIL -> do not ship as usable): **patch 0024 was NOT created and nothing was
committed** (DGX tree stays uncommitted; backup `~/llama-paged-dev/BF16_SSM_STATE.diff`).
- The perf lever is genuinely real (recurrence halves; dense ~490 t/s = 125% of vLLM when clean; MoE
+25%) and bf16 == vLLM's own precision, so it remains a valid FUTURE option - but only if shipped as
an explicitly-labeled "vLLM-precision-class, NON-bit-exact" mode (never quality-neutral), AND the
dense CUDA-graph throughput instability (bimodal 287..498) is fixed first.
- Recommendation: keep the shipped default as f32 bit-exact (95% of vLLM at higher precision). Shelve
bf16. Optional follow-up lever if precision must be cut: bf16 only on the SHORT-memory heads (those
with exp(g) well below 1), keeping long-memory heads f32 - a mixed-precision state that could pass
the gate while still cutting bytes; not implemented/measured here.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,339 +0,0 @@
# Bit-exact vs vLLM, and the f32-preserving-parity hunt (Qwen3.5 gated-DeltaNet)
Label: crossengine-bitexact (READ-ONLY, no GPU). Adversarial source+numerics study.
Model: q36-27b-nvfp4 (dense, `Qwen3_5ForConditionalGeneration`) / q36-35b-a3b-nvfp4
(MoE, `Qwen3_5MoeForConditionalGeneration`). Engines: llama dev `~/llama-paged-dev`,
vLLM 0.23.0 `~/vllm-bench`. Decode B=128, enforce-eager / graphs-off, GB10 (~273 GB/s).
> **CORRECTION NOTICE (supersedes the earlier draft of this file).** A prior pass concluded
> "vLLM's GDN state cache is bf16, so the 2x recurrence-DRAM gap is f32(llama)-vs-bf16(vLLM)
> width" (old B2/B3). **That is wrong.** It read `gated_delta_net_state_dtype(..., mamba_ssm_cache_dtype="auto")`
> as auto->model-dtype=bf16, but it did **not** trace the Qwen3.5-specific config override that
> reassigns `mamba_ssm_cache_dtype` from `"auto"` to `"float32"` *before* the state dtype is
> resolved. **vLLM stores this model's gated-DeltaNet temporal state in float32**, the same width
> as llama. Proof chain in Part B. Everything in Part C is re-derived from the corrected dtype.
>
> **INDEPENDENT RE-VERIFICATION (this pass, live DGX source).** Two separate sub-agents reached
> *opposite* dtype readings (one f32, two bf16). The contradiction was resolved by reading every
> link of the chain directly, not by majority vote. All eight links confirm **float32 temporal
> state**: `config.json text_config.mamba_ssm_dtype = "float32"` (both served models);
> `config/cache.py:129` default `mamba_ssm_cache_dtype = "auto"`; the bench scripts
> (`h2h_dense_vllm.sh`, `h2h_moe_serve_vllm.sh`, `serve_nvfp4.sh`) pass **only**
> `--enforce-eager --gpu-memory-utilization 0.85 --max-model-len 4096` (no `--mamba-ssm-cache-dtype`,
> no `--dtype`); `config/vllm.py:847 __post_init__` -> `:856 try_verify_and_update_config()` (runs at
> finalize, before any state-dtype resolution); `MODELS_CONFIG_MAP` (`models/config.py:622-623`) maps
> both `Qwen3_5ForConditionalGeneration` and `Qwen3_5MoeForConditionalGeneration` ->
> `Qwen3_5ForConditionalGenerationConfig`; its override body (`config.py:546-549`)
> `if mamba_ssm_cache_dtype=="auto": cache_config.mamba_ssm_cache_dtype = mamba_ssm_dtype` **fires**
> (value "float32"); `mamba_utils.py:91-94` then takes the `!= "auto"` branch ->
> `temporal = STR_DTYPE_TO_TORCH_DTYPE["float32"] = torch.float32` (conv stays bf16);
> `qwen_gdn_linear_attn.py:1101` `_, state_dtype = self.get_state_dtype()` takes the **temporal** (2nd)
> tuple element and allocates the cache (`:1136`) at f32; `ssm_state = self_kv_cache[1]` (`:1316/1596/1664`).
> The two bf16 sub-agent readings are **refuted** - they stopped at the `cache.py` default "auto" and
> never traced the `__post_init__` override. **Numeric corroboration:** at the measured vLLM duration
> 3.62 ms/call, bf16 (402 MB) would imply 111 GB/s = 41% peak (implausibly low for a tuned BW-bound
> Triton kernel); f32 (805 MB) implies 222 GB/s = 81% peak (the expected regime). f32 is the only
> reading consistent with both source *and* the measured time.
## Headline (two answers)
1. **Bit-exact-vs-vLLM (identical logits / probabilities) is IMPOSSIBLE - for this model and for any
two distinct engines.** B4 = CONFIRMED. The sharpest proof is the GDN recurrence itself: the two
kernels evaluate an *algebraically reassociated* expression (`g.Sigma` vs `Sigma.g`) on *different
reduction trees*, so they diverge **even if both ran pure f32 with identical inputs**. On top of
that the FP4 GEMM uses different operand precision (8-bit vs 4/16-bit activations) and different
accumulation - a >>ULP divergence in every projection and the LM head.
2. **bf16 SSM state is NOT the only way to reach vLLM decode throughput, and an f32-preserving lever
was missed.** vLLM reaches its throughput **with an f32 GDN state** (proven). Both engines move the
same ~805 MB f32/recurrence-call; the ~10% per-call gap is a bandwidth-**efficiency** gap on equal
bytes (llama ~74% of peak, vLLM ~81%), i.e. an occupancy/grid/coalescing lever that is **bit-exact
vs llama's own f32**. bf16 state is an *optional over-clock* (goes AHEAD of vLLM on the recurrence),
not a parity requirement. B2/B3 (as "bf16 width is the lever") = REFUTED.
---
# The five questions, answered (synthesis)
**Q1. Can llama be BIT-EXACT with vLLM? NO.** Two *binding* (>>ULP) divergence sources make
bit-identical logits impossible on their own: **(A1)** the FP4 GEMM - llama MMQ quantizes the
activation to **q8_1 (8-bit)** while vLLM runs cutlass **w4a4 (4-bit acts)** or marlin **w4a16
(16-bit acts)**; different operand precision + accumulation order -> ~1e-2 relative error in *every*
projection and the LM head; **(A2)** the GDN recurrence - llama computes `g*(Sigma round(S*k))`
(scalar decay *after* the reduction) while vLLM computes `Sigma round(round(g*h)*k)` (decay rounded
into each element *before* the reduction): an IEEE-754 reassociation on *different reduction trees*
(warp butterfly vs Triton `tl.sum`) that diverges **even with identical pure-f32 state and inputs**.
A dozen further ops (L2/RMSNorm, MRoPE, gate `exp`, flash-attn softmax) add close-but-not-equal
rounding. Cross-engine bit-exactness is impossible *in general* (FP non-associativity across distinct
GEMM/recurrence/norm kernel stacks); the determinism literature only buys run-to-run determinism
*within* one engine. **Weaker form reachable:** greedy **top-1 token agreement** is the right gate
(top-1 / KL / PPL-delta, never md5). It is probabilistic (flips at low-margin steps), **compounds**
with length (once one token differs the SSM/KV states fork), and is *weaker here* than a
same-precision run because of the A8-vs-A4 GEMM gap.
**Q2. Is bf16 SSM state the only path to vLLM decode throughput? NO - an f32-preserving lever exists
and bf16 is not even required for parity.** vLLM carries the **same f32 temporal state** (proven +
re-verified), so the recurrence gap is **bandwidth EFFICIENCY on equal f32 bytes** (llama 74% vs vLLM
81% of GB10 peak), ~10% per call, *not* a 2x width gap. The lever: **retune `gated_delta_net_cuda`
74% -> ~81%** - it launches 196608 tiny one-column blocks (butterfly-reduce per token); fold toward
fewer/larger `BV x BK` tiles + vectorized `f32x4` loads + better row coalescing, **keeping the
per-column reduction order -> BIT-EXACT vs llama's own f32** (md5-gateable). **Cost vs bf16:** zero
precision risk and bit-exact, but it can only **match** vLLM's recurrence BW (81%), never beat it;
worth ~+5% (~335 -> ~351 tok/s, ~90% of vLLM), and it caps below 100% unless stacked with the other
bit-exact levers (conv fusion 0021, activation fold, oproj MMQ 0020). The adversarial sweep of every
other f32 avenue (lossless sub-f32, delta/low-rank/sparse, recompute+checkpoint, 2nd-stream/overlap,
chunked recurrence) **FAILS** to beat it; recompute is bit-exact but only **ties** the irreducible
one-full-state-READ floor and is now moot (vLLM also writes f32, so you match its achieved BW, you
don't need to eliminate the write). bf16 remains the **only** lever that goes *ahead* of vLLM on the
recurrence (~440 tok/s) - an **over-clock**, not a requirement.
**Q3. Does bf16 state MATCH vLLM's precision or DEGRADE below it? It DEGRADES below vLLM.** (This
corrects the `precision-ground-truth` sub-agent's "matching, not degrading" claim, which rested on
the refuted bf16 reading.) vLLM keeps the **temporal/recurrent** state in **f32**; only its small
**conv** state is bf16 (llama keeps conv f32, so llama is *more* precise there). So bf16 **temporal**
state in llama (~8 mantissa bits) sits **below vLLM's f32 temporal** (~24 bits) - it is a deliberate
precision-for-speed trade, KL/PPL-gated vs llama's own f32 *and* a step under vLLM's recurrent-state
precision. A genuine "match vLLM's envelope" change would be f32 temporal (as today) + bf16 conv -
which costs llama precision only on a tiny stream and buys almost no BW.
**Q4. What can "parity" mean here? Throughput at equal precision + a distributional quality bar -
never identical bits.** Bit-identical logits are impossible cross-engine, so "parity" = **(a)**
throughput (tok/s in the harness) at **(b)** a quality bar measured by **top-1 greedy agreement,
KL(llama||vLLM)/step, and PPL-delta**, never md5. Both engines already run the recurrence math in f32
registers; at **equal** precision (llama f32 temporal == vLLM f32 temporal) the *only* open variable
is throughput, and that gap is closable **bit-exactly** (Q2). If llama adopts bf16 temporal, "parity"
must be restated as "throughput >= vLLM at KL/PPL within gate vs llama's own f32" and reported as the
precision-for-speed trade it is.
**Q5. Did the prior analysis get B1-B4 right? B1 mostly; B2/B3 REFUTED; B4 CONFIRMED. Overturn the
"bf16 is required" framing - keep the bit-exact levers.**
- **B1 TRUE** (single-pass f32, load-once/store-once, 74% peak) - but its sub-claim "more efficient
than vLLM (41%)" is **REFUTED** (41% was the bf16 artifact; vLLM is ~81%, *more* efficient).
- **B2 REFUTED** - not a f32-vs-bf16 width gap; equal f32 bytes both sides, ~10% efficiency gap.
- **B3 REFUTED** as written - vLLM reaches its throughput **with f32 state**; a bit-exact f32
occupancy retune reaches vLLM's recurrence BW. bf16 is optional.
- **B4 CONFIRMED** - impossible, on two independent grounds (structural A1+A2; general FP
non-associativity across distinct kernel stacks).
- **Plan disposition:** do **not** overturn the conv-fusion (0021) bit-exact lever - keep it.
**Re-prioritize the bit-exact f32 occupancy/coalescing retune of `gated_delta_net_cuda` as the
parity path.** Treat bf16 temporal state as an explicitly-gated **over-clock for going beyond
vLLM**, reported as a precision-for-speed trade (below vLLM's f32 recurrent precision), NOT as a
parity-matching change.
---
# PART A - Divergence inventory (per source: bit-identical vs close)
Per decode layer the two engines run *different kernels* for: FP4 GEMMs (proj + LM head), depthwise
conv+SiLU, q/k L2-norm, the GDN recurrence, gated RMSNorm; and on the hybrid's full-attention layers:
RMSNorm q/k-norm, MRoPE, flash attention, a sigmoid gate.
## A1. NVFP4 dequant + FP4 GEMM -- NOT bit-identical (diverges >> ULP)
- **llama**: MMQ (`mmq.cuh` `block_fp4_mmq`, nvfp4 block=16, 4x ue4m3 sub-scales). Host path
(`ggml-cuda.cu` ~1955-2014) **quantizes the activation (src1) to q8_1** (`block_q8_1_mmq`, **8-bit**,
block 32) and accumulates over K in the MMQ tile (DP4A / Blackwell FP4-MMA); tile order set by
`mmq_y`/`mmq_x` + the warp-MMA fragment layout.
- **vLLM**: `compressed_tensors_w4a4_nvfp4` -> cutlass FP4 GEMM on Blackwell (**4-bit** activations,
w4a4, per-group act-quant, e4m3 block scale x global FP8 tensor scale) or marlin fp4 fallback
(**16-bit** activations, w4a16, dequant->bf16 then bf16 GEMM). `apply_weights` -> `self.kernel`.
- **Verdict: not close.** (a) *Operand precision differs*: llama 8-bit acts vs vLLM 4-bit (cutlass) or
16-bit (marlin) - per-GEMM outputs differ at ~1e-2 relative, not ULP. (b) Scale-application order
differs. (c) Accumulation tiling/order differs (MMQ fragment vs cutlass/marlin). This is the largest
divergence and is present in every projection + the LM head, so logits differ materially on its own.
## A2. gated-DeltaNet recurrence -- NOT bit-identical, AND provably so even in pure f32
Both single-pass over an **f32** state (Part B). llama: `gated_delta_net.cu`
`gated_delta_net_cuda<128,KDA=false>`; vLLM: `fused_recurrent.py`
`fused_recurrent_gated_delta_rule_packed_decode_kernel`. Scalar-gate (GDA) path, `g.ne0==1`.
With S[k][v] (llama, transposed) == h[v][k] (vLLM):
```
llama: kv[v] = Sigma_k S_old[k][v]*k[k] # OLD state; g applied AFTER the sum
delta = (v[v] - g*kv[v])*beta; S_new = g*S_old + k(x)delta; o[v]=Sigma_k S_new[k][v]*q[k]
vLLM: h' = g*h_old # decay rounded into EVERY element first
kv[v]=Sigma_k h'[v][k]*k[k]=Sigma_k round(g*h_old)*k; b_v=(v[v]-kv[v])*beta
h_new = h' + b_v(x)k; o[v]=Sigma_k h_new[v][k]*q[k]
```
Algebraically identical (g scalar). **Numerically not**, for two structural reasons that survive even
with identical f32 state, identical inputs, and identical reduction tree:
- **Reassociation:** llama forms `g*(Sigma round(S*k))` (scalar multiply *after* the reduction);
vLLM forms `Sigma round(round(g*h)*k)` (decay rounded into each element *before* the reduction).
Distributing a multiply across a sum is exact in R, not in IEEE-754. This is not a precision knob.
- **Different reduction trees:** llama `warp_reduce_sum<32>` (4 sequential per-lane FMAs + 5-step
butterfly) vs vLLM `tl.sum(...,1)` (Triton tree over the 128-wide BK axis).
**Verdict: not bit-identical; cannot be made so without rewriting one kernel to the other's op order.**
## A3. Depthwise conv1d (width 4) + SiLU -- NOT bit-identical
llama `ggml_ssm_conv` (ascending-j f32 FMA) + `ggml_silu`, conv state cached **f32**. vLLM
`causal_conv1d_update` (Triton) + SiLU, conv state cached **bf16** (`conv_state_dtype = bf16`; only the
*temporal* SSM state is forced f32 - Part B). Different kernel + different conv-state width + FMA order.
(Patch 0021 fuses llama's chain bit-exactly vs *llama's own* f32 path, not vs vLLM.)
## A4. q/k L2-norm + RMSNorm/RMSNormGated -- NOT bit-identical (close, ~1e-6)
L2-norm: llama standalone `ggml_l2_norm` (f32 tree) vs vLLM `l2norm_fwd`/in-kernel fold
(`USE_QK_L2NORM_IN_KERNEL`). RMSNorm: llama `ggml_rms_norm` vs vLLM `vllm_c` fused kernel (run log:
`rms_norm=['vllm_c','native']`); gated out-norm `build_norm_gated`=RMS*SiLU(z) vs `RMSNormGated`.
Different variance reduction tree / eps placement / fusion boundary.
## A5. MRoPE + gate scalar pipeline -- NOT bit-identical (close)
MRoPE: `ggml_rope_multi` (ggml sin/cos) vs vLLM rotary cos/sin cache (different theta eval + apply
order). Gate: vLLM computes `-exp(A_log)*softplus(a+dt)` then `exp` **in-kernel**; llama computes
`softplus(alpha+ssm_dt)*ssm_a` as split graph ops with `ssm_a` baking `-exp(A_log)` at GGUF-convert
time (rounded once), writes/reloads the intermediate, `expf` in-kernel. Same algebra, different
rounding points + convert-time vs runtime `exp(A_log)`.
## A6. Flash attention (full-attn layers) -- NOT bit-identical (close)
llama `ggml_flash_attn_ext` -> `fattn-mma-f16`/`fattn-vec` (online softmax, F16/F32 PV accum per
`GGML_PREC`) vs vLLM FlashInfer/FA2. Different tiling => different running max/sum order => different
rounding.
## A7. SiLU/sigmoid primitives + fusion -- equivalent IF inputs matched (they never do)
Both ultimately use the same hardware `expf`/`__nv_expf`; the primitives could match given identical
inputs, but every upstream value has diverged, and vLLM fuses act+quant / norm+quant differently than
llama's separate ops (run log `fuse_act_quant=True`), moving the rounding points.
### Inventory summary
| Source | bit-identical? | divergence size |
|---|---|---|
| FP4 GEMM (proj/LM head): MMQ q8_1(A8) vs cutlass w4a4(A4)/marlin w4a16 | **NO** | **>>ULP (~1e-2)** |
| GDN recurrence: hand-CUDA warp-reduce vs Triton tl.sum | **NO (provable even in f32)** | reassoc + tree |
| conv1d+SiLU: f32 conv-state vs bf16 conv-state | NO | dtype + order |
| L2-norm / RMSNorm | NO | ~1e-6 (tree) |
| MRoPE | NO | ~ULP-1e-6 |
| gate softplus/exp | NO | rounding points |
| flash attention | NO | softmax tiling |
| silu/sigmoid primitive | identical IFF inputs equal | inputs never equal |
Any single NO makes the logits differ. A1 and A2 differ by far more than ULP -> the logit vectors are
not close-to-equal at the bit level; they agree only to a few significant digits.
---
# PART B - The decisive f32-state correction (proof from source)
The byte-gate inferred vLLM's GDN temporal state is **bf16** (402 MB/call, 41% peak) and built the
"bf16-width is the lever" case on it. The byte count was *inferred from the dtype*; ncu byte counters
were blocked, so only the **duration** (3.62 ms/call) was measured. The dtype inference is falsified:
1. `config.json`: `architectures=["Qwen3_5ForConditionalGeneration"]`, `text_config.dtype=bfloat16`,
and **`text_config.mamba_ssm_dtype = "float32"`**.
2. `models/config.py:590 MODELS_CONFIG_MAP` maps `"Qwen3_5ForConditionalGeneration"` (line 622) and
`"Qwen3_5MoeForConditionalGeneration"` (623) to `Qwen3_5ForConditionalGenerationConfig`.
3. `Qwen3_5ForConditionalGenerationConfig.verify_and_update_config` (config.py:536-562):
`mamba_ssm_dtype = getattr(hf_text_config,"mamba_ssm_dtype")` (="float32"); if
`cache_config.mamba_ssm_cache_dtype == "auto"` (the default) it executes
**`cache_config.mamba_ssm_cache_dtype = mamba_ssm_dtype`** -> sets it to **"float32"**.
4. This override runs at config finalization: `config/vllm.py:856` -> `try_verify_and_update_config()`
(vllm.py:1880-1900) looks up the arch in `MODELS_CONFIG_MAP` and calls `verify_and_update_config`.
It runs **before** any layer/model state-dtype resolution.
5. The bench left it default: `h2h_dense_vllm.sh` = `vllm serve .../q36-27b-nvfp4-vllm --enforce-eager
--gpu-memory-utilization 0.85 --max-model-len 4096` (no `--mamba-ssm-cache-dtype`; `dl-logs/vllm_dense.log`
non-default args confirm none). So the override fires and the value is "float32".
6. State dtype resolution reads the **already-overridden** value:
- `gdn/base.py:53-57` `get_state_dtype()` -> `gated_delta_net_state_dtype(model_dtype=bf16,
cache_config.mamba_cache_dtype="auto", cache_config.mamba_ssm_cache_dtype="float32")`.
- `qwen3_5.py:678 get_mamba_state_dtype_from_config` likewise passes
`vllm_config.cache_config.mamba_ssm_cache_dtype` (= "float32", post-override) - **not** a raw "auto".
- `mamba_utils.py _mamba_state_dtype`: conv_state = `get_kv_cache_torch_dtype("auto", bf16)` = **bf16**;
temporal_state, since `mamba_ssm_cache_dtype != "auto"`, = `STR_DTYPE_TO_TORCH_DTYPE["float32"]`
= **torch.float32** (key verified: `torch_utils.py:33 "float32": torch.float32`).
7. `qwen_gdn_linear_attn.py:1101` `_, state_dtype = self.get_state_dtype()` takes the **second** tuple
element (temporal) = **float32**, allocates the cache `dtype=state_dtype`. The packed_decode kernel
round-trips f32: `b_h = tl.load(p_h0).to(f32)` ... `tl.store(p_ht, b_h.to(p_ht.dtype.element_ty))`
with `p_ht.dtype == initial_state.dtype == float32`.
**=> vLLM's gated-DeltaNet temporal (recurrent) state cache for this model is float32, identical width
to llama's f32 state.** The earlier "bf16" reading hardcoded the third arg as `"auto"` and missed the
override at step 3-4. Only the small *conv* state is bf16 in vLLM (f32 in llama: divergence A3, tiny
byte stream).
## Re-derived efficiency table (measured duration + PROVEN f32 byte volume)
| kernel | state dtype (PROVEN) | bytes R+W/call | duration/call | eff. BW | % of 273 peak |
|---|---|---|---|---|---|
| llama `gated_delta_net_cuda` | f32 | 805 MB | 3.98 ms | 202 GB/s | **74%** |
| vLLM `..._packed_decode` | **f32 (not bf16)** | **805 MB (not 402)** | 3.62 ms | **222 GB/s** | **~81%** |
- **B1 (single-pass f32 byte floor): TRUE** (load-once/store-once `s_shard`, coalesced). *Sub-claim
"more BW-efficient than vLLM (41%)" REFUTED* - 41% was the bf16 artifact; at the correct f32 byte
count vLLM is at ~81%, i.e. **more** efficient than llama.
- **B2 ("the gap is f32-vs-bf16 width"): REFUTED.** Equal f32 bytes both sides; the ~10% per-call gap
is bandwidth **efficiency** on equal bytes, not width.
- **B3 ("vLLM throughput REQUIRES bf16 state"): REFUTED.** vLLM reaches it *with f32 state*.
---
# PART C - The f32-preserving lever, and where recompute/bf16 land
Since vLLM hits ~81% on the **same f32 byte volume** llama runs at ~74%, the missed lever is **raising
llama's `gated_delta_net_cuda` achieved BW 74% -> ~81%**, bit-exact, NOT dtype width:
- llama grid `(H=48, n_seqs=128, ceil(S_v/4)=32) = 196608` blocks/128 thr, each warp owns ONE state
column + warp-reduces over 128 rows. vLLM grid `(NV=4, B*HV=6144) = 24576` programs (num_warps=1),
each owns a BV=32 x BK=128 tile. llama's far-finer blocking (8x more blocks, one column of work each,
a butterfly reduce/token) is the likely ~7-point deficit. Retune toward fewer/larger blocks (more
columns/block, vectorized f32x4 loads, better row coalescing) - changes thread/tile mapping + load
width only, **keeps the per-column reduction order -> bit-exact vs llama's own f32**.
- Upper bound: 74%->81% on ~50% of the step ~= +17 ms/step (384 -> ~367), ~+5% -> ~351 tok/s (~90% of
vLLM 391), stacking with the landed bit-exact levers (oproj MMQ 0020 @86%, conv fusion 0021).
**Other f32-preserving avenues (adversarial sweep) - none beats the simple bf16 over-clock, but the
occupancy tune above is the real bit-exact win:**
- *Lossless sub-f32 state:* generic float compression is data-dependent (1.1-1.5x, never a guaranteed
2x) and breaks the 128-consecutive-f32 coalescing a BW-bound kernel depends on. The state is dense,
full-rank, non-symmetric (sum of `k(x)delta`, k!=delta) -> no low-rank/half-storage. FAILS.
- *Recompute (checkpoint every N + rank-1 replay):* eliminates the per-step WRITE; the per-step full
dense f32 READ (the `S^T k` / `S^T q` matvecs need every element; the checkpoint is itself a full
read) is irreducible. Optimal N~=11 -> ~473 MB/step (0.587x), realistically ~0.65-0.75x after
replay/latency overhead. A genuine bit-exact path but it only reaches - never beats - the read floor,
at large kernel/graph complexity. **Note: this was over-weighted before because vLLM was assumed
bf16; now that vLLM is f32 too and runs at 81%, you do NOT need to cut the write to match vLLM - you
need to match vLLM's achieved BW on the same f32 bytes.** Recompute is dominated.
- *2nd stream / overlap / pipelining:* DRAM BW (273) is one shared resource; the whole decode step is
uniformly BW-bound (state traffic + ~13.5 GB/step dense NVFP4 weight traffic both hit 273), so
overlapping two BW-bound phases sums to ~0. FAILS.
- *Equivalent recurrence with less decode traffic:* chunked gated-delta-rule is a prefill lever (C=1 at
decode); attention/materialization-free form is O(t) over the prefix. FAILS.
**bf16 SSM state is therefore an OPTIONAL over-clock**, the only lever that goes *ahead* of vLLM on the
recurrence (halve 805 -> ~440 tok/s) - but it takes llama below both its own f32 and vLLM's f32
precision, so it must be **KL/PPL-gated vs llama's own f32**, never md5. f32-only parity-class
throughput is plausible from the SUM of bit-exact levers (recurrence occupancy + conv fusion + oproj
MMQ + activation fold); none require bf16.
---
# PART D - Verdict on B4 + the meaningful weaker form
## Bit-exact-vs-vLLM: IMPOSSIBLE (B4 CONFIRMED) - two independent grounds
1. **Structural (this model):** A1 (FP4 GEMM operand precision + accumulation) and A2 (recurrence
`g.Sigma` vs `Sigma.g` + different reduction trees) make per-layer outputs differ by >>ULP, so logits
cannot be bit-identical. A2 shows it is not a precision knob: the kernels evaluate a *reassociated
expression*, differing **even given identical f32 state and inputs**.
2. **General (any two engines):** IEEE-754 add/mul are non-associative; two engines that tile, reduce,
fuse, and quantize differently cannot produce bit-identical results for a non-trivial transformer.
Field determinism work (batch-invariant / fixed-reduction kernels, "defeating nondeterminism in LLM
inference") delivers **run-to-run determinism WITHIN one engine**; it does **not** and cannot deliver
**cross-engine** bit-exactness (that needs identical kernel+tiling+reduction-order+dtype for *every*
op). Cross-engine bit-exactness is essentially never achieved in practice. Bit-exactness is only a
meaningful gate **within** an engine (how llama patches 0018-0021 are validated by md5).
## Greedy-token match (argmax robustness) - the right weaker form, but probabilistic
Because logits differ mostly in low-order bits (A4-A7) plus a few-significant-digit GEMM/recurrence gap
(A1-A2), the **argmax** frequently coincides whenever the top-1/top-2 logit margin exceeds the
cross-engine noise. This is the only meaningful cross-engine "equivalence"; gate on **top-1 agreement /
KL / PPL-delta**, never md5. Caveats: not guaranteed per-token (low-margin steps can flip); it
**compounds** - once one greedy token differs the sequences fork and the KV/SSM states diverge, so
agreement degrades with length (high on short continuations, drift on long ones); and the FP4 A4-vs-A8
gap (A1) makes the per-step divergence *larger* here than a same-precision bf16-vs-bf16 comparison,
weakening greedy agreement for this model specifically.
**Bottom line:** target near-vLLM via KL/PPL/top-1-agreement, not bit-exactness. Reserve bit-exact
gating for intra-llama validation (the f32 recurrence-occupancy lever and the conv fusion qualify;
bf16 state does not and must be KL/PPL-gated vs llama's own f32).
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,53 +0,0 @@
# GDN Recurrence Byte-Gate - Progress (agent: ncu-byte-gate)
## Hard blocker on direct DRAM counters
- ncu HW perf counters: ERR_NVGPUCTRPERM (NVreg_RestrictProfilingToAdminUsers=restricted, root-only).
- nsys --gpu-metrics-devices: same ERR_NVGPUCTRPERM.
- No passwordless sudo on dgx.casa. DRAM byte counters UNOBTAINABLE without root.
- FALLBACK (decisive, no perfcounters needed): CUPTI kernel TIMING (allowed) + exact byte
geometry from kernel source => implied effective BW + a hard mathematical cap on re-stream factor.
## Byte geometry (exact, from gated_delta_net.cu + GGUF)
- Qwen3.5 dense q36-27b-nvfp4: 48 GDN layers, H=48 v-heads, S_v=128 (square state 128x128/head).
- State per (seq,head) = 128*128 f32 = 64 KiB. Per seq = 48*64KiB = 3.0 MiB.
- Kernel is SINGLE-PASS by construction: loads s_shard[] ONCE into regs, recurrence in-register,
writes state ONCE (read_state coalesced 128 consecutive f32/warp; writeback coalesced).
l2norm/sigmoid/softplus/gate act on small q/k/g/beta (NOT the 805MB state); gather no-ops at
steady decode (identity seqs). => NO multi-pass state re-streaming exists to fuse away.
- Minimal bytes/call (B=128): state R+W = 128*48*16384*4*2 = 805.3 MB; +q/k/v/out ~10 MB = ~816 MB.
- Floor time @273 GB/s = 816MB/273 = 2.99 ms/call.
## Measured (clean nsys CUDA timing, graphs OFF, npp8 ntg12 npl128, build-cuda-base df1cc97)
- llama gated_delta_net_cuda steady decode: 480 calls, grid(48,128,32), avg 3.98 ms/call
(min 3.90, max 4.33; very tight => bandwidth-bound). 48 layers => 191 ms/step (50% of 384 ms).
- Implied effective BW @1.0x bytes = 816MB/3.98ms = 205 GB/s = 75% of 273 peak.
- HARD CAP: max bytes movable in 3.98ms @273 peak = 1.087 GB = 1.33x minimal.
=> re-stream factor in [1.0x, 1.33x]. 2x re-streaming PHYSICALLY IMPOSSIBLE.
Source proves single-pass+coalesced => ~1.0x, kernel at ~75% peak.
## Conv-path (same trace, steady-decode region kernels, per-call):
- ssm_conv_f32: 672 calls whole-trace avg 135.9us (incl prefill); decode-region TBD
- concat_cont: 576 calls avg 169.6us ; concat_non_cont 96 calls (prefill big)
- cpy_scalar: 896 calls avg 123.7us ; gdn_gather_nonident 672 calls avg 153.9us (mostly no-op)
## vLLM (apples-to-apples: NSEQ=128, enforce_eager=True; postssm_decomp/vllm_decode.sqlite)
- vLLM state dtype = model_dtype = BF16 (_mamba_state_dtype default "auto"; config dtype=bfloat16).
Geometry identical to llama (H=48, k/v head_dim 128, S_v 128).
- vLLM fused_recurrent_gated_delta_rule_packed_decode steady: 3.62 ms/call (grid 4x6144x1),
bf16 state R+W = 402.6 MB => 111 GB/s = 41% peak. SINGLE-PASS (load p_h0 once -> f32 regs ->
store bf16 once).
- llama 3.98 ms/call, f32 805.3 MB => 202 GB/s = 74% peak. llama kernel is MORE BW-efficient.
## Conv-path (llama steady decode, per call x48 layers)
- concat_cont 169.6us (8.14 ms/step) + cpy_scalar 120.1us (5.76) + ssm_conv_f32 115.9us (5.56)
= ~19.5 ms/step. Conv state ~12.6 MB (tiny) => LAUNCH-bound, not byte-bound => fusion lever (~5%).
- l2_norm 6.8us, gdn_gather 1.21us (no-op identity seqs => gather does NOT re-stream state).
## FINAL VERDICT (DONE)
- llama re-stream factor ~1.0x (hard cap <=1.33x; >=1.5x physically impossible @273 peak).
- NO-BUILD fused single-pass recurrence: already single-pass, coalesced, 74% peak (> vLLM 41%);
gate ops touch tiny q/k/g/beta, not the 805MB state => recovers ~0 state bytes.
- BUILD bf16 SSM state (design lever (2)): the 2x gap vs vLLM is 100% f32-vs-bf16 cache width.
805->413 MB => ~45-95 ms/step => step 384 -> 289-339 ms = parity-to-ahead of vLLM 327.
Non-bit-exact vs llama f32 but equal to vLLM's own bf16 precision.
- Findings written: GDN_RECURRENCE_BYTE_GATE.md (MEASUREMENT + VERDICT section appended).

View File

@@ -1,499 +0,0 @@
# Durable scope: token-granular continuous-batch scheduler for llama-server on GB10
Build-ready plan. **Not implemented in this workflow** (serving-loop rewrite). This
document scopes the durable path to give llama-server's `update_slots()` a vLLM-v1-style
token-granular continuous-batch scheduler, and records the single honest finding that
re-shapes what the change can and cannot buy.
Hardware: NVIDIA GB10 / DGX Spark (sm_121, CC=1210 = `GGML_CUDA_CC_DGX_SPARK`), unified
LPDDR5x ~273 GB/s. Models: dense Qwen3.6-27B NVFP4 (`~/bench/q36-27b-nvfp4.gguf`),
MoE Qwen3.6-35B-A3B NVFP4 (`~/bench/q36-35b-a3b-nvfp4.gguf`). Dev tree `~/llama-paged-dev`
(branch `paged`, HEAD `151343b`, patch 0015), `build-cuda` sm_121, `LLAMA_KV_PAGED=1`.
Scheduler code: `tools/server/server-context.cpp::update_slots()` (LocalAI override that
`#include`s it: `backend/cpp/llama-cpp/grpc-server.cpp`).
## TL;DR (the honest reframe)
Three findings, read directly from the source at HEAD `151343b` and from the committed
NVFP4 re-run (`QWEN36_NVFP4_BENCH.md`), collapse the apparent size of this work and reset
what it is allowed to claim:
1. **The unified mixed batch already exists.** `update_slots()` already builds ONE
`llama_batch` per step = {every ready decode token} **+** {a bounded chunk of prefill
tokens}, in a fixed two-phase order: Phase 1 (lines 2604-2719) appends every
`SLOT_STATE_GENERATING` slot's sampled token **unconditionally** (no budget gate), then
Phase 2 (lines 2753-3330) fills the remaining batch capacity with prompt tokens. Decode
is therefore **already claimed first and never dropped or capped** - the exact property
vLLM's "RUNNING-before-WAITING" pass works to guarantee is **free** here by construction.
2. **The chunked-prefill slot state already exists and already persists across steps.** A
slot in `SLOT_STATE_PROCESSING_PROMPT` with `slot.prompt.n_tokens() < slot.task->n_tokens()`
is a partial prefill; it stays in that state and resumes next step until its prompt is
fully ingested, at which point it flips to `SLOT_STATE_DONE_PROMPT` -> `GENERATING`
(line 3252, then 3502). Multiple slots can be `PROCESSING_PROMPT` and `GENERATING`
simultaneously; there is **no global "one prefill at a time" gate**. So the mission's
"allow a slot to be mid-prefill while others decode in the same step" is **not a state
machine to build - it is already the behaviour.** This is the single biggest de-risking
fact in this document.
3. **What is genuinely missing is the budget POLICY, and it is small.** Patch 0013
(`LLAMA_PREFILL_BUDGET`) is a single **static** per-step prefill cap, consumed greedily by
slots in iteration order. It is not decode-load-aware (does not subtract the live decode
count `D`), not adaptive (one constant across npl 8..128), and not fair (the first
`PROCESSING_PROMPT` slot can eat the whole budget). The durable delta is to convert that
static cap into vLLM's **dynamic, decode-first, per-slot-fair token budget**: one total
per-step budget `T`, decode claims its `D` tokens first, prefill gets the **leftover**
`T - D` distributed across waiting prompts with a per-slot cap. That is ~the only
behavioural change. **No new slot states, no batch-formation rewrite.**
### The honest ceiling (this is load-bearing for how the work is scoped and sold)
The committed re-run and a dedicated profiling pass (`QWEN36_NVFP4_BENCH.md`, plus
`~/bench/stag_128.json`) establish that **the residual ~2.4x high-concurrency decode gap is a
decode-KERNEL batch-scaling ceiling, not a scheduler defect**:
- At npl8 the kernels are **at parity** (dense 99%, MoE 84% of vLLM decode).
- A clean staggered full-batch-128 run, with **all 128 slots cleanly decoding and zero
prefill starvation**, still tops out at **decode_agg 157.4 tok/s** (dense) - the same
~157-161 ceiling that four independent measurements converge on. vLLM does **390.7** at the
same effective batch. With a *perfect* scheduler the kernel still gives ~157. **The
scheduler cannot lift this.**
- Patch 0013 budget-256 **already reaches ~161** (the ceiling) at npl128. So a token-granular
scheduler buys **little additional steady-state decode_agg** over 0013 on the all-at-once
workload.
Therefore this scheduler's deliverable is **NOT "match vLLM's 391/811 decode."** It is:
- **Close the 12x TTFT gap** (dense 305 s @ 0013 / 491 s stock -> vLLM's ~25 s, and ~2 s on
staggered arrival) - the genuine, large win.
- **Robustly HOLD the decode ceiling** (~161 dense / ~333 MoE @npl128) **without
per-workload budget tuning** - 0013 needs a hand-picked constant (256 for dense, costs MoE
TTFT, net-negative at low npl); the dynamic `T - D` budget is self-tuning across the whole
npl range and across dense vs MoE.
- **Burst-robustness**: bounded TTFT for *all* concurrently-arriving prompts (kill the
burst-TTFT spread), and no admission collapse under sustained load.
Closing the residual 2.4x decode-throughput gap is a **separate, named lever**: the
paged-attention **decode-kernel** batch-scaling work (patches 0009-0011 territory) and/or
CUDA-graphed decode. It is called out explicitly in P3 and is **out of this scope's
scheduler mandate**. We must measure and sell this work on **TTFT + burst-robustness +
self-tuning hold of the ceiling**, never on a decode_agg number the kernel forbids.
## The gap, precisely localized (recap of the committed bench)
At matched NVFP4 on one GB10 box (`QWEN36_NVFP4_BENCH.md`), llama (patch 0015) vs vLLM 0.23.0,
decode_agg tok/s | TTFT mean, npl swept 8/32/64/128:
| npl | dense llama (0013 b256) | dense vLLM | MoE llama (0013 b256) | MoE vLLM |
|----:|------------------------:|-----------:|----------------------:|---------:|
| 8 | 63.5 / 4.3 s | 64.3 / 2.6 s | 169.3 / 1.7 s | 202.0 / 0.8 s |
| 32 | 105.7 / 23.1 s | 189.8 / 7.5 s | 239.0 / 9.0 s | 462.0 / 2.3 s |
| 64 | 132.0 / 109 s | 284.2 / 13 s | 277.0 / 16.2 s | 624.5 / 4.1 s |
| 128 | **161.2 / 305 s** | 390.7 / 24.8 s | **333.5 / 98 s** | 811.1 / 8.0 s |
Both models converge to the **same ~41% of vLLM decode at npl128** after 0013. That
convergence is the signal: once prefill starvation is removed, a dense model and a
12x-cheaper-prefill MoE land on the **identical** ceiling -> the residual is **not prefill**
and **not the kernel-at-parity-@npl8** - it is the **quality of the per-step batching
decision** (TTFT/robustness) plus the **kernel decode ceiling** (the throughput residual).
This scope addresses the first; it names the second as the separate lever.
## What already exists (reuse, do NOT rebuild)
All line numbers verified at `tools/server/server-context.cpp` HEAD `151343b`.
- **[A] decode-first co-batch** - Phase 1, lines 2604-2719. Iterates `slots`; every
`SLOT_STATE_GENERATING` slot (gated only by `can_batch_with`, line 2611) is pushed to
`generating[]`; line 2715-2719 `for (slot : generating) slot.update_batch(batch)` appends
its sampled token (+ draft tokens) via `common_batch_add`. After this loop,
`batch.n_tokens == D` (the decode-token count). **No budget gate** - decode always goes in.
- **[B] chunked-prefill state per slot** - the pair `slot.prompt.n_tokens()` (=
`num_computed_tokens`) vs `slot.task->n_tokens()` (= `num_tokens`). A `PROCESSING_PROMPT`
slot with `prompt.n_tokens() < task->n_tokens()` resumes next step (Phase 2 re-enters it).
Transition to `DONE_PROMPT` at line 3252 when the prompt is exhausted; to `GENERATING` at
line 3502. **This is exactly vLLM's "leave the request in `running`, advance
`num_computed_tokens` next step" - already implemented.**
- **[C] single shared batch + compute chunking** - one `llama_batch` holds decode+prefill;
the compute loop (lines ~3366-3378) `for (i=0; i<batch.n_tokens; i+=n_tokens){ n_tokens =
min(n_batch, batch.n_tokens-i); llama_decode(batch_view); }` runs it as one `llama_decode`
when `batch.n_tokens <= n_batch`; `n_ubatch` (512) splitting happens inside `llama_decode`.
- **[D] patch 0013 static prefill budget** - the thing to supersede. Read once at lines
2737-2747 (`n_prefill_budget = min(n_batch, atoi(LLAMA_PREFILL_BUDGET))`, a CONSTANT for
the run); enforced as an extra `while` predicate at line 3188 (`n_prompt_budgeted <
n_prefill_budget`), counter at 3214, outer break at 3326. `0` = disabled = byte-identical
stock.
- **[E] productization seam** - `backend/cpp/llama-cpp/grpc-server.cpp` lines 781-791 parse
the model option `max_prefill_tokens` / `mpt` / `prefill_budget` and `setenv`
`LLAMA_PREFILL_BUDGET` before context init (same pattern as `kv_paged`). New knobs hang off
this seam identically.
- **[F] paged KV (patches 0001-0011)** - on-demand block allocation keyed by sequence
position. Batch formation only changes **which** tokens are in a step; paged alloc is
driven by the per-slot sequence positions, which are unchanged. Orthogonal (see Correctness).
## vLLM v1 reference algorithm (the target, for fidelity)
From `vllm/v1/core/sched/scheduler.py::schedule()` (0.23.0, on the box). The unifying idea:
there is no prefill phase vs decode phase. Every request advances `num_computed_tokens`
toward `num_tokens` by up to N this step; for a decoder N=1, for a prefiller N=remaining
prompt. One per-step `token_budget = max_num_batched_tokens` bounds the TOTAL (decode +
prefill). Pass 1 visits `running` first (decoders cost 1 each -> all decode claimed before
any prefill is sized); Pass 2 admits `waiting` (new prompts) only with leftover budget, each
chunked by `min(remaining_prompt, long_prefill_token_threshold, leftover_budget)`. Caps:
`max_num_seqs` (concurrent sequences), `long_prefill_token_threshold` (~4% of max_model_len,
per-request prompt-chunk cap so one giant prompt cannot monopolize a step). Net: decode batch
maximal every step (-> the GEMM-batching throughput vLLM gets), prefill always makes bounded
progress (-> low, flat TTFT), one `model.forward()` per step.
The mapping to llama is clean because [A]+[B] already give us "running visited first" and
"prefiller resumes next step." We are missing only: **one total budget `T`, leftover `T - D`
sizing, and the per-request chunk cap with fair distribution.**
## The unified per-step batch-formation algorithm (the design)
New knobs (all default to current behaviour; env set before context init like `LLAMA_KV_PAGED`):
- `T` = `LLAMA_MAX_BATCH_TOKENS` (option `max_batch_tokens` / `mbt`) - total per-step token
budget (decode + prefill), the analogue of `max_num_batched_tokens`. Default `n_batch`
(2048). Clamped `T = min(T, n_batch)` so the existing single-`llama_decode` chunking is
unchanged.
- `PREFILL_CAP` = `LLAMA_PREFILL_CAP` (option `prefill_cap`) - per-slot max prompt tokens per
step, the `long_prefill_token_threshold` analogue. Default `min(T, ceil(0.04 * n_ctx))`,
floored at `n_ubatch` (512) so a single prompt still makes a full ubatch of progress.
- Back-compat: if only the legacy `LLAMA_PREFILL_BUDGET` is set (new knobs unset), behave
exactly as 0013 (static cap) - 0013 is the degenerate `T = n_batch`, no-leftover case.
Pseudocode, mapping to real variables and seams (the `>>` lines are the change vs today):
```
common_batch_clear(batch); // line 2594
// PASS 1 - DECODE FIRST (unchanged: lines 2604-2719)
for (slot : slots) if (slot.state == GENERATING && can_batch_with) generating.push(slot);
... speculative draft ...
for (slot : generating) slot.update_batch(batch); // appends decode (+draft) tokens
>> D = batch.n_tokens; // NEW seam: decode load is now final (after 2719)
>> T = min(LLAMA_MAX_BATCH_TOKENS ? : n_batch, n_batch);
>> prefill_budget_step = max(0, T - D); // DYNAMIC leftover, auto-shrinks with D
>> prefill_cap_per_slot = PREFILL_CAP; // long_prefill_token_threshold analogue
>> n_prompt_budgeted = 0; // total prompt tokens added this step (subsumes 0013)
// PASS 2 - PREFILL FILLS THE LEFTOVER (lines 2753-3330, budget made dynamic + per-slot fair)
if (cont_batching || batch.n_tokens == 0) {
>> for (k = 0; k < n_slots; ++k) { // round-robin start offset (fairness, see P2)
>> slot = slots[(rr_start + k) % n_slots];
if (!slot.is_processing() || !can_batch_with) continue;
if (slot.state == STARTED) slot.state = PROCESSING_PROMPT; // line 2782 (unchanged)
>> slot_prompt_added = 0; // NEW: per-slot chunk counter (reset each slot)
// inner prompt-fill (lines 3187-3239), guard now triple-bounded:
while (slot.prompt.n_tokens() < slot.task->n_tokens()
>> && batch.n_tokens < T // was: < n_batch
>> && n_prompt_budgeted < prefill_budget_step // was: 0013 static n_prefill_budget
>> && slot_prompt_added < prefill_cap_per_slot) {// NEW: per-slot cap -> fair distribution
common_batch_add(batch, cur_tok, pos_next, {slot.id}, need_embd);
slot.prompt.tokens.push_back(cur_tok);
slot.n_prompt_tokens_processed++;
n_prompt_budgeted++; slot_prompt_added++;
... checkpoint-boundary breaks (unchanged) ...
}
if (slot.prompt.n_tokens() == slot.task->n_tokens()) slot.state = DONE_PROMPT; // line 3252
... checkpoint creation (unchanged) ...
>> if (batch.n_tokens >= T) break; // was: >= n_batch (line 3320)
>> if (n_prompt_budgeted >= prefill_budget_step) break; // was: 0013 break (line 3326)
}
}
for (i=0; i<batch.n_tokens; i+=n) { n=min(n_batch,batch.n_tokens-i); llama_decode(view); } // unchanged
```
The whole change is: (a) compute `prefill_budget_step = T - D` at the new seam after line
2719 instead of reading a static env constant at 2737; (b) bound the inner/outer loops by `T`
and the dynamic budget instead of `n_batch` and the static budget; (c) add `slot_prompt_added`
with `prefill_cap_per_slot` for per-slot fairness; (d) a round-robin start offset so the same
early slots do not always win the leftover.
**Why this holds the decode ceiling without tuning.** `T` bounds total tokens per step ->
bounds step compute time -> decode steps fire at a steady high rate (high decode-steps/sec).
As decode load `D` rises, `prefill_budget_step = T - D` auto-shrinks, so prefill never inflates
the step beyond `T` even at npl128. This is the mechanism by which 0013's hand-tuned 256
reaches 161; here it is reached **automatically across the npl range** because the budget is
`T - D`, not a constant. **Why this closes TTFT.** Prefill always gets a non-zero leftover
(`prefill_budget_step >= 0`, and `T` is sized so leftover > 0 until the box is fully decode-
saturated), distributed across waiting prompts by `prefill_cap_per_slot`, so every prompt makes
bounded progress every step instead of waiting for a dedicated prefill burst.
## Slot state machine changes (minimal - this is the headline de-risk)
**No new states. No state-transition rewrite.** The existing 6-state machine
(`IDLE / WAIT_OTHER / STARTED / PROCESSING_PROMPT / DONE_PROMPT / GENERATING`, lines 67-72)
already encodes everything:
- "mid-prefill while others decode" = a `PROCESSING_PROMPT` slot coexisting with `GENERATING`
slots in the same step. **Already happens** (Phase 1 and Phase 2 populate one batch).
- "chunked-prefill state per slot" = `(state == PROCESSING_PROMPT) && (prompt.n_tokens() <
task->n_tokens())`. **Already persisted** across `update_slots()` calls; Phase 2 re-enters
the slot and resumes from `prompt.n_tokens()`.
The only **additions** are per-step scheduler scratch, not slot lifecycle state:
1. `slot_prompt_added` - a per-slot, per-step counter (local to the Phase-2 loop body), for
the per-slot chunk cap. Not stored on the slot across steps.
2. A `rr_start` round-robin offset (one `size_t` on the server, advanced each step) so the
leftover budget is distributed fairly across `PROCESSING_PROMPT` slots rather than always
draining the lowest-index slot first (this is what kills the burst-TTFT *spread* - without
it, slot 0's prompt finishes first every time and the last slots starve).
3. Optional, P2: a per-step admission cap `K` on how many `STARTED -> PROCESSING_PROMPT`
transitions begin in one step. This falls out of the budget arithmetic already (a bounded
`prefill_budget_step` with a per-slot floor admits only `~budget/floor` prompts/step), so it
may need no explicit code; if made explicit it is the `max_num_seqs`-style "don't admit a
new prefill if the step is full" guard, mapped onto the pre-allocated `n_parallel` slots.
That is the entire state-machine footprint: two pieces of per-step scratch and an optional cap.
The mission's feared "slot-state rewrite" does not materialize.
## How it supersedes / subsumes patch 0013
| property | 0013 (static cap) | this scheduler (dynamic `T - D`) |
|----------|-------------------|----------------------------------|
| per-step prefill bound | constant `n_prefill_budget` | `T - D`, shrinks as decode load rises |
| decode-load aware | no (ignores `D`) | yes (leftover after decode) |
| works across npl 8..128 with one config | no (256 best @128, net-negative @8) | yes (self-tuning) |
| fair across multiple waiting prompts | no (greedy, slot 0 wins) | yes (`prefill_cap_per_slot` + round-robin) |
| TTFT on bursty arrival | raises it (defers first tokens) | bounded for all prompts |
| decode-first guarantee | structural (Phase 1) | structural (Phase 1) - **kept** |
0013 is the **degenerate case** `T = n_batch` with `prefill_budget_step` pinned to a constant
and no per-slot cap. The patch keeps `LLAMA_PREFILL_BUDGET` working for back-compat (when the
new knobs are unset). When `LLAMA_MAX_BATCH_TOKENS` is set, the static path is replaced by the
dynamic one. **Default (all knobs unset) = byte-identical stock**, exactly like 0013.
## Correctness
- **KV cache during chunked prefill** - unchanged from today. A `PROCESSING_PROMPT` slot already
advances `slot.prompt.tokens` / `pos_next()` chunk by chunk across steps; we only change the
chunk SIZE per step, not how positions or sequence ids are assigned. `common_batch_add`
receives the same `(tok, pos, {slot.id})` tuples in the same order. No new KV state.
- **Determinism** - greedy (temp 0) output can differ from a single-`n_batch`-chunk run only by
the **intrinsic flash-attn chunk-size FP grouping** that 0013 already documented and bounded:
pure stock `-b256` diverges from `-b2048` the same way with this patch inactive; output stays
coherent and answers correctly. The op-level math per token is position-determined and
unchanged; only the FA reduction grouping over a step's token mix shifts. The deterministic
oracle is the CPU backend / the op test (bit-exact); the GB10 CUDA greedy-decode band applies
to end-to-end only, never to the op test.
- **Paged KV (patches 0001-0011)** - **orthogonal**. Paged on-demand block allocation is keyed
by sequence position and slot/stream, which this change does not touch; it changes only which
tokens are in a given `llama_decode`. The in-kernel paged decode read (0009-0011) operates
per-token via the block tables regardless of what prefill tokens are co-batched. Required gate:
run the full P0-P2 suite with `LLAMA_KV_PAGED=1` **and** `=0` and confirm **identical
scheduling decisions** (same per-step token counts, same admission order) - paged must be a
no-op on the scheduler.
- **`can_batch_with` constraint** (line 302) - a batch admits only slots with the same
`task->type` and equal LoRA. Homogeneous-completion serving (the benchmark and the dominant
LocalAI case) satisfies it, so the mixed decode+prefill batch forms freely. Mixed task types /
per-request LoRA fall back to separate batches - a pre-existing bound, not a regression; note
it, do not try to lift it here.
- **Checkpoint interaction (a real, orthogonal serving defect to account for)** - each slot that
reaches `DONE_PROMPT` may call `create_checkpoint` (line 2147), ~149 MiB per checkpoint on the
dense 27B, gated by `n_ctx_checkpoints > 0` (line 3133). Profiling found that under sustained
heavy load the checkpoint subsystem **thrashes**: admission collapsed to one slot every ~13 s,
zero decoding for 290 s, while `/slots` itself serialized behind a 13 s `update_slots` step.
This is **independent** of the decode/prefill mix but it **masks** the scheduler's win if left
on. **P0 must isolate it** (run with `n_ctx_checkpoints=0`), and **P2's admission decision
should be checkpoint-cost-aware** on the 128 GB unified box (do not admit a fresh prefill whose
checkpoint would thrash the pool). Treat as a named co-defect, not part of the core batching
change.
## Phased plan P0 -> P3 (work, payoff, files, risk)
| Phase | Work | Expected payoff (dense / MoE @npl128 unless noted) | Files | Risk |
|-------|------|-----------------------------------------------------|-------|------|
| **P0** baseline + metrics harness | Per-step effective-decode-batch poller (`/slots`), TTFT percentiles (p50/p90/p99/max), `decode_agg` over the fully-overlapped window, decode-ITL (worst freeze / median), **step-time histogram**, admission rate (slots/s reaching GENERATING), checkpoint-event log. Lock the staggered-arrival ceiling (**157.4** dense, all-128 clean) and the all-at-once burst pathology as the two reference traces. Isolate checkpoints (`n_ctx_checkpoints=0`). | dev-tree only: `~/bench/` (reuse `stag.py`, `slot_poll.py`, `h2h_cli.py`, `h2h_moe_sweep.sh`; `stag_128.json`, `h2h_real128b.json`) | **None** (gate). Locks correctness + the 157/333 ceiling so any regression is caught. | Low |
| **P1** unified mixed-batch formation | Replace the static budget read (2737-2747) with the **dynamic `T - D`** computed at the new seam after line 2719; bound the inner/outer Phase-2 loops by `T` (3188, 3320) and `prefill_budget_step` (3326) instead of `n_batch` and the static cap. No per-slot cap, no round-robin yet (that is P2). | `tools/server/server-context.cpp` (seam @2719, knob read, 3188, 3320, 3326); mirror to `0016-paged-continuous-batch-scheduler.patch` | **TTFT**: removes the burst penalty 0013 inflicts - staggered TTFT ~2 s, burst TTFT collapses toward vLLM's ~25 s / 8 s. **Decode**: holds the ceiling **(~161 / ~333)** *without per-workload tuning* (0013 needed 256 hand-picked). No new throughput beyond the ceiling - by design. | Low-Med (loop-bound edits in a hot path; default-off gate makes it byte-identical stock) |
| **P2** scheduling policy / fairness | Add `slot_prompt_added` + `prefill_cap_per_slot` (the `long_prefill_token_threshold` analogue) and the **round-robin start offset**; optional explicit per-step admission cap `K` + checkpoint-cost-aware admission. Tune `T`, `PREFILL_CAP` on GB10 (dense vs MoE, npl 8/32/64/128). | `server-context.cpp` (Phase-2 loop body @2753-3330, server-level `rr_start`); `grpc-server.cpp` (options `max_batch_tokens`/`mbt`, `prefill_cap` @781-791) | **TTFT spread**: bounds first-token latency for **all** concurrently-arriving prompts (kills the burst-TTFT spread, e.g. dense max 305 s -> single-digit-s on staggered, bounded on burst). **Robustness**: no admission collapse under sustained load; decode batch stays maximal so the *time-averaged* decode_agg on real (non-burst) traffic rises toward the staggered 157/333 because slots reach GENERATING fast. | Med (fairness + admission logic; e2e coherence + A/B vs 0013 required) |
| **P3** residual decode throughput | **Honest boundary: this is the decode-KERNEL lever, NOT the scheduler.** The scheduler has delivered TTFT + robustness + ceiling-hold. Closing the residual 2.4x (161 -> 391 dense, 333 -> 811 MoE) requires paged-attention **decode-kernel** batch-scaling (patches 0009-0011 territory) and/or **CUDA-graphed decode** (the now-uniform decode-only step is graph-capturable). Scope/track separately. | (separate scope) `ggml/src/ggml-cuda/` decode-read kernels; optional CUDA-graph capture seam in `update_slots` | This is **where 391/811 would come from**; it is **out of this scope's mandate** and must not be charged against the scheduler. The scheduler makes the decode step uniform (a precondition that *helps* a future graph capture). | High (kernel work; the GB10 occupancy wall, see below) |
**Per-phase payoff vs the mission targets (TTFT 25 s / 8 s, decode 391 / 811 @npl128):**
- **TTFT 25 s / 8 s** - reached by **P1 + P2** (the 12x gap is the scheduler's to close; on
staggered arrival it goes below the vLLM burst figure to ~2 s).
- **Decode 391 / 811** - **NOT a P1/P2 deliverable.** P1/P2 hold **161 / 333** (= ~41% of vLLM,
the kernel ceiling) robustly and tuning-free. The remaining ~2.4x is **P3 kernel**, a separate
lever. Pre-registering this split is the point: the scheduler is judged on TTFT + holding the
ceiling, the kernel on the throughput residual.
## GB10 considerations
- **Bandwidth floor ~273 GB/s** is the *cause* of the decode ceiling (NVFP4 weight-read +
paged-KV gather per step). The scheduler cannot lift a bandwidth/kernel floor - it can only
keep the batch *at* the ceiling. Size `T ~= n_batch` (2048) so the compute step stays a single
`llama_decode`; `n_ubatch` (512) governs the internal split.
- **`T` is the ITL/TTFT trade knob** (vLLM's `max_num_batched_tokens`): larger `T` = more
prefill/step = faster TTFT but bigger per-step ITL spike; smaller `T` = smoother ITL, slower
TTFT. Because the budget is `T - D`, the spike is bounded at `T` regardless of decode load.
Default `T = n_batch`; expect to tune down toward ~1024 for ITL-sensitive serving.
- **Checkpoint ~149 MiB/slot thrash** on the 128 GB unified box - admission must be
checkpoint-cost-aware (P2); P0 measures with checkpoints off to isolate the batching win.
- **Memory**: paged on-demand KV (dense 52->94 GB, MoE 39->61 GB across npl) vs vLLM's flat
~112 GB pre-reservation - llama's standing multi-tenant advantage, unaffected by this change.
- **Eager mode** both engines today; **CUDA-graphed decode** is the P3 kernel lever, and the
scheduler's uniform decode-only step is a precondition that *helps* a future capture.
## Biggest risks and how to de-risk
1. **"Slot-state rewrite" (the feared big risk) = actually LOW.** The mid-prefill-while-others-
decode state and the chunked-prefill resume already exist ([B]); we add only per-step scratch
(`slot_prompt_added`, `rr_start`), not lifecycle states. **De-risk**: keep all 6 states
untouched; gate every change behind the new knobs; default-off = byte-identical 0013/stock,
verified by an A/B diff of per-step token counts.
2. **Correctness regression in the mixed batch = the FA chunk-grouping nondeterminism.** Already
documented and bounded by 0013 (stock `-b256` vs `-b2048` diverge identically). **De-risk**:
op-test bit-exact where deterministic; greedy-coherence e2e on both models; A/B vs 0013 with
the new knobs set to reproduce 0013 (`T = n_batch`, no leftover) and confirm **byte-identical**
to 0013.
3. **Paged-KV interaction = LOW (orthogonal positions).** **De-risk**: run the whole P0-P2 suite
with `LLAMA_KV_PAGED=1` and `=0`; assert identical scheduling decisions (paged must be a
no-op on batch formation). This is a hard gate, not a spot check.
4. **Checkpoint thrash masks the win = MEDIUM.** A real serving defect that can swamp the
scheduler's signal. **De-risk**: P0 isolates it (`n_ctx_checkpoints=0`); P2 makes admission
checkpoint-cost-aware; report the scheduler metrics both with and without checkpoints so the
batching win is legible independent of the checkpoint co-defect.
5. **Honest-payoff risk = the decode_agg number barely moves over 0013 (kernel ceiling), so the
work can be mis-judged as "no win."** This is the most important risk to manage. **De-risk**:
frame and measure on **TTFT percentiles, burst-TTFT spread, step-time histogram, admission
rate, and tuning-free ceiling-hold across npl/dense/MoE** - the axes the scheduler actually
moves - and **pre-register the decode-kernel as the separate residual-closer** (P3) so the
scheduler is never charged with the 391/811 number the kernel forbids.
## Commit / hygiene
Scope doc only (this file). **No engine change committed in this workflow.** Bench and parity
scripts stay dev-tree-only (`~/bench/`, `~/llama-paged-dev/benches/`). When P1/P2 are
implemented they mirror to `backend/cpp/llama-cpp/patches/paged/0016-paged-continuous-batch-
scheduler.patch` (next free slot after 0015) and the LocalAI option lands in `grpc-server.cpp`
beside `max_prefill_tokens`. Commit with `git commit -s`, trailer
`Assisted-by: Claude:opus-4.8 [Claude Code]`, no `Co-Authored-By`, no em-dashes. Do not push
(human pushes).
---
## Review / risk (adversarial, source-verified)
Skeptical staff review against the actual source at HEAD `151343b` (server-context.cpp,
llama-batch.cpp, llama-kv-cache.cpp, paged-*.cpp), grpc-server.cpp in this worktree, and the
committed `QWEN36_NVFP4_BENCH.md` plus the vLLM H2H serve logs/scripts on the box.
### Verdict: the scope is SOUND. GO on P0 -> P1, CONDITIONAL P2, separate-track P3.
The central de-risking claims check out against the code, and the load-bearing honesty (decode
residual is a kernel ceiling, not a scheduler defect) is correct and now further corroborated.
Two calibration fixes are required before P1 (below), neither changes the go decision.
### (1) Tractability - CONFIRMED bounded; zero libllama changes. What enables/blocks it, concretely:
- **Enables (already-exercised path, not new surface).** A mixed prefill+decode ubatch with
per-seq different `n_past` is the *existing* behaviour. `llama_batch` carries per-token `pos`
and `seq_id` (`common_batch_add(batch, tok, pos_next(), {slot.id}, ...)`); `llama_kv_cache` +
`paged_alloc::place()` place each `(seq, pos)` independently; `llama_kv_cache::init_batch`
(line 742) already splits the mixed batch into ubatches. **The server emits exactly this mixed
decode+prefill batch today** - patch 0013 ships it and produces coherent output - so the new
scheduler changes only the *count* of prefill tokens, never the batch *structure*. There is no
`llama_decode`/ubatch/KV rewrite in scope.
- **Blocks: nothing in libllama.** The only constraints are pre-existing and orthogonal to the
target workload: (i) `can_batch_with` (same task type + equal LoRA per batch); (ii)
`split_equal(sequential=true)` errors on *coupled* sequences (shared-prompt parallel sampling),
forcing `-kvu`. Neither is introduced by this change.
- **Correction to fold in:** the scope's [C] and the pseudocode imply contiguous `split_simple`
chunking. The real serving/benchmark config (`--parallel 128`, `kv_unified` default = `false`
-> `n_stream = n_seq_max = 128`) takes the **`split_equal(n_ubatch, sequential=true)`** path
(llama-kv-cache.cpp:742), which balances per-sequence rather than slicing contiguously. This
does not break anything (0013 already hits it) but it means the actual scheduled object is a
split_equal ubatch set; P0 must characterize that ubatch shape (not assume contiguous 512-chunks)
and the determinism band is over split_equal groupings. Lock the split path (unified vs not) in
the A/B so the byte-identical-to-0013 gate is meaningful. grpc seam [E] verified at
grpc-server.cpp:761-786 (`kv_paged`, `max_prefill_tokens`/`mpt`); new `mbt`/`prefill_cap` knobs
hang off it identically.
### (2) Does it close the gap - the 2.4x is NOT CUDA graphs, and the TTFT root is quantified.
- **CUDA graphs ruled out (verified).** Both NVFP4 H2H vLLM servers ran `--enforce-eager`
(`h2h_dense_vllm.sh`, `h2h_moe_serve_vllm.sh`; engine logs show `enforce_eager=True`,
`cudagraph_mode=NONE`, `CompilationMode.NONE`). So the npl128 2.4x decode gap is a genuine
**eager-mode kernel + per-step host-overhead** gap (ggml graph rebuild/realloc + ~1k kernel
launches per step on the weak Grace cores, paged-KV gather, MoE expert gather). The scheduler
cannot touch it; the staggered all-128-decoding 157.4 tok/s ceiling is solid. Scope is right to
refuse the 391/811 number. (CUDA graphs are a future *both-sides* lever, not the current cause.)
- **The TTFT gap has a measured root the scope under-uses: prefill_tps collapse.** From the bench,
llama `prefill_tps` falls 1117 -> 752 -> 465 -> **125** (dense, npl 8/32/64/128) while vLLM holds
**flat ~1420** (MoE: 2813 -> 657 vs vLLM flat ~4263). That collapse - not a separate "scheduling
quality" abstraction - is the direct cause of the 491 s / 85 s TTFT, and it is exactly what the
dynamic `T - D` budget attacks: when decode load `D` is low (early in a burst) the leftover
`T - D` lets prefill take ~`n_batch` per step, and because llama's *larger per-step chunk*
compensates for its ~2.4x slower steps, a `T = 2048` budget can sustain prefill_tps at or above
vLLM's ~1420 during the drain. **So burst-TTFT parity is mechanically plausible, not just
"toward"** - the static budget-256 throttles prefill to 256/step (hence its weak 305 s) where the
dynamic budget would not. This strengthens P1's case beyond what the doc claims.
- **Mandatory calibration fix:** that TTFT win **couples to a decode-ITL knob**. Spending the full
`T - D` on prefill during the drain makes those steps full `T`-token (mixed) computes, so
co-batched decoders get 1 token per slow step (ITL spike) *during the drain* - precisely vLLM's
tradeoff, navigated by `T`. The 157/333 ceiling is the **post-drain steady state**, not the
drain phase. Therefore the scope must **co-report drain-phase decode-ITL alongside TTFT** and
treat `T` as the published trade knob; reporting TTFT alone would hide the cost and reporting
decode_agg alone would hide the win (it is averaged across drain + steady state, which is why it
"barely moves"). Soften "P1+P2 reach 25 s / 8 s": the defensible claim is *staggered/realistic
arrival ~2 s, and all-at-once burst approaching vLLM with a tunable decode-ITL cost*.
### (3) Correctness - paged orthogonality confirmed at source; the real risks are config, not code.
- **Paged-KV is the same `llama_kv_cache` class** with `paged_alloc::` hooks inside the existing
find_slot/placement (llama-kv-cache.cpp:1043-1083), driven by per-slot `(seq, pos)` - which this
change does not touch. `init_batch`/split is paged-agnostic. The scope's "orthogonal" claim is
verified, not asserted. Keep the hard `LLAMA_KV_PAGED=1` vs `=0` identical-decisions gate.
- **Determinism**: the FA grouping nondeterminism is over **split_equal** ubatches in the real
config; the `T = n_batch` A/B-must-be-byte-identical-to-0013 gate is the right oracle and is
sound (default-off path is untouched).
- **Low-concurrency regression**: gated to byte-identical when knobs unset; the only live vector is
a **mis-tuned `T`** spiking ITL at low npl (the scope already flags `T` defaults). Config hygiene,
not a code risk. Add a guard/floor so `T` cannot be set below `n_ubatch`.
### (4) Smaller higher-ROI step - yes, and the scope already contains it (P1).
The minimal high-ROI change is **P1 alone**: replace the static read (server-context.cpp:2737-2747)
with `prefill_budget_step = max(floor, T - batch.n_tokens)` computed after the decode-fill at line
2719, and bound the Phase-2 loops by `T` / that budget (3188, 3320, 3326). That is a handful of
line edits at named seams, default-off, and it captures the self-tuning + the bulk of the TTFT win.
The even-smaller validation spike: a one-line `n_prefill_budget = max(floor, T - batch.n_tokens)`
to confirm the prefill_tps/TTFT mechanism before writing the full P1. **P2** (round-robin +
`prefill_cap_per_slot` + checkpoint-aware admission) is genuinely higher-effort and lower-marginal
(it buys TTFT *spread*/tail and burst robustness, not the median); **gate P2 on P1's measured
burst-TTFT-spread and drain-ITL**, do not commit to it up front. There is no smaller step that also
fixes the static budget's npl-dependence - tuning 0013's constant cannot (256 is net-negative at
npl8 and costs MoE TTFT), so P1 is the floor.
### Realistic effort / payoff and sequencing
- **P0** ~0.5-1 wk (harness largely exists in `~/bench/`): add drain-phase decode-ITL to the metric
set, lock the split path, isolate checkpoints (`n_ctx_checkpoints=0`). Gate only.
- **P1** ~2-4 days: small diff + the A/B-vs-0013 byte-identical gate + the npl/dense/MoE sweep.
Payoff: self-tuning hold of 161/333 with no hand-picked constant; burst-TTFT 3-10x better than
0013 (plausibly approaching vLLM on the burst, parity on staggered), at a published `T`-tunable
decode-ITL cost. **This is the high-ROI core and the clean supersession of 0013.**
- **P2** ~1-2 wk, conditional: fairness/admission + checkpoint-cost-awareness + tuning. Payoff: TTFT
tail/spread + no admission collapse under sustained load. Worth it only if P1 metrics show a
residual spread/robustness problem.
- **P3** separate track, high effort: the *only* path to 391/811 is the eager-kernel + per-step
host-overhead residual. Highest-value probe is a **CUDA-graph capture of the steady-state
pure-decode step** - but note this works *independent of the scheduler* (the all-128-decoding
step is already fixed-shape today); the scheduler neither blocks nor specially enables it, so do
not credit graphs to the scheduler. The scope's "uniform decode step is a precondition" is a mild
over-claim; correct it to "graphs apply to the pure-decode steady state, which the scheduler does
not change."
### Bottom line
GO. The work is correctly localized to `update_slots()` batch-formation policy, requires no
libllama changes (the mixed per-seq batch is the existing, shipping path), and supersedes 0013
cleanly. The honest ceiling is real and well-stated; the two fixes are (a) co-report drain-phase
decode-ITL with TTFT and stop selling/charging the decode_agg number, and (b) acknowledge the
`split_equal`/`n_stream=128` path in the determinism and ubatch-shape analysis. Sequence
P0 -> P1, measure, then decide P2; keep P3 (kernel/CUDA-graph) on its own track as the sole owner
of the 2.4x throughput residual.

View File

@@ -1,106 +0,0 @@
# Patch 0021: qwen35 decode conv-state in-place fusion (no-regret, bit-exact)
The no-regret conv-state cleanup from the GDN_RECURRENCE_BYTE_GATE design, point (3).
After the recurrence byte-gate (NO-BUILD: the GDN recurrence is already single-pass at
the f32 byte floor), the conv path was the only remaining bit-exact decode lever.
## What changed
A new fused op `ggml_ssm_conv_update_inplace` (reuses `GGML_OP_SSM_CONV`, discriminated by a
non-null `src[3]`) that, on the single-token decode path, replaces the four-op conv chain:
qkv_mixed transpose -> ggml_concat (build width-K window) [concat_cont 8.14 ms/step]
-> ggml_ssm_conv (depthwise conv) [ssm_conv_f32 ~8.6 ms/step]
-> ggml_silu [folded into ssm_conv on CUDA]
-> ggml_cpy of the shifted ring state into the conv cache [cpy_scalar 5.76 ms/step]
with ONE kernel that, per (channel, sequence), assembles the width-K window in registers from
the K-1 cached taps + the current `qkv_mixed` token, computes the depthwise conv with the SAME
ascending-tap FMA order as `ssm_conv_f32` at i==0, folds silu, writes the conv output, and writes
the 1-token-shifted ring state back IN PLACE into the conv cache slot at kv_head (the exact slot
the baseline `ggml_cpy` wrote). Mirrors the 0018 in-place write-back + 0019 patterns. This is
vLLM's `causal_conv1d_update`.
Files:
- `ggml/include/ggml.h`, `ggml/src/ggml.c`: new builder `ggml_ssm_conv_update_inplace`
(src[0]=conv_states [K-1,channels,n_seqs], src[1]=conv_kernel, src[2]=x_cur [channels,1,n_seqs],
src[3]=conv_state_dst [(K-1)*channels,n_seqs] in-place ring; op_params[0]=fuse_silu).
- `ggml/src/ggml-cuda/ssm-conv.cu`: kernel `ssm_conv_update_f32<apply_silu,d_conv>` (one thread per
(channel,seq)) + `ggml_cuda_op_ssm_conv_update` + a `src[3]`-discriminated branch at the top of
`ggml_cuda_op_ssm_conv`.
- `ggml/src/ggml-cpu/ops.cpp`: `ggml_compute_forward_ssm_conv_update_f32` (threads split over
channels) + branch in `ggml_compute_forward_ssm_conv`.
- `src/models/delta-net-base.cpp` + `models.h`: `build_conv_state_fused` (keeps the cheap build_rs
conv-tap gather; fuses conv+silu+shifted write-back). Read source (gathered scratch) and write
target (cache view) are disjoint buffers -> race-free by construction; no ids/identity logic needed.
- `src/models/qwen35.cpp`, `qwen35moe.cpp`, `qwen3next.cpp`: route the single-token decode path
(`n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar`) to `build_conv_state_fused`; prefill/chunked/
rollback keep the existing concat+ssm_conv+silu+cpy chain.
- `tests/test-backend-ops.cpp`: `test_ssm_conv_update` (16 cases) comparing the fused conv output
vs the CPU reference across backends.
## Gate: test-backend-ops (CUDA0 vs CPU reference)
- SSM_CONV: 45/45 OK (unchanged path intact)
- SSM_CONV_UPDATE: 16/16 OK (new op; d_conv 3/4 x channels 256/3328 x n_seqs 1/4/32/128)
- SSM_CONV_BIAS_SILU: 90/90 OK
## Gate: greedy bit-exactness (--temp 0 --seed 1 --ignore-eos -n 256, -no-cnv, -fa on)
Byte-identical to the clean Lever-1 (0019/0020) baseline, both models:
| model | baseline md5 | fused md5 | result |
|--------------------|----------------------------------|----------------------------------|-----------------|
| q36-27b-nvfp4 | 675cd52265f2b3d7695c8739946d55ea | 675cd52265f2b3d7695c8739946d55ea | BYTE-IDENTICAL |
| q36-35b-a3b-nvfp4 | ac163882eb3812ef08d4c73e6d9a0abf | ac163882eb3812ef08d4c73e6d9a0abf | BYTE-IDENTICAL |
## decode_agg S_TG (npp128 ntg128, -fa on, -c 33000), same-session before/after
Dense q36-27b-nvfp4:
| mode | npl | baseline | fused | delta |
|-----------|-----|----------|--------|---------|
| CUDA-graph| 32 | 199.76 | 202.99 | +1.6% |
| CUDA-graph| 128 | 336.35 | 347.14 | +3.2% |
| eager | 32 | 196.07 | 197.61 | +0.8% |
| eager | 128 | 333.62 | 342.97 | +2.8% |
MoE q36-35b-a3b-nvfp4:
| mode | npl | baseline | fused | delta |
|-----------|-----|----------|--------|---------|
| CUDA-graph| 32 | 421.72 | 432.39 | +2.5% |
| CUDA-graph| 128 | 689.74 | 713.54 | +3.5% |
| eager | 32 | 421.05 | 432.46 | +2.7% |
| eager | 128 | 689.15 | 713.87 | +3.6% |
Dense npl128 (production CUDA-graph) lands at 347.14 t/s, in the predicted 346-349 band, and at
**88.8% of vLLM 391** (up from 86.0%). The lift holds in BOTH graph and eager modes.
## Step time + nsys kernel delta
Per-step decode time (dense npl128, T_TG / ntg=128):
- baseline 48.711 s / 128 = 380.6 ms/step
- fused 47.197 s / 128 = 368.7 ms/step -> **-11.9 ms/step** (matches the predicted +12-14 ms)
- MoE npl128: 185.6 -> 179.4 ms/step (-6.2 ms/step)
nsys eager decode (npp128 ntg24 npl128, 24 decode steps x 48 GDN layers), conv-path kernels:
| kernel | baseline calls | fused calls | per-step (eager) |
|---------------------|----------------|-------------|------------------|
| concat_cont (decode)| 1152 | 0 (GONE) | 7.95 -> 0 ms |
| cpy_scalar (decode) | 1152 of 3648 | 0 (GONE) | 4.29 -> 0 ms |
| ssm_conv_f32 (decode)| 1152 of 2736 | 0 (prefill-only) | 8.65 -> 0 ms |
| ssm_conv_update | 0 | 1152 | 0 -> 7.56 ms |
Decode conv path eager GPU time: ~20.9 ms/step -> ~7.56 ms/step = ~13.3 ms/step saved. concat_cont
and the decode cpy_scalar are eliminated; ssm_conv at decode is replaced by the fused update kernel.
prefill keeps the original chain (concat_non_cont 1584, ssm_conv_f32 1584 unchanged).
## Verdict
Bit-exact, no regression, and lifts decode: dense 336.35 -> 347.14 t/s (+3.2%, 86.0 -> 88.8% of vLLM
391), MoE 689.74 -> 713.54 t/s (+3.5%) at npl128. Step -11.9 ms (dense). Additive and risk-free;
de-risks the in-place conv-cache plumbing the bf16-state lever (design (2)/(4)) also touches.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,639 +0,0 @@
# Critical-Path Gap Analysis - GDN decode region
## vllm-gdn-compare (READ-ONLY, no GPU) - vLLM decode GDN kernel inventory vs llama
### Source ground truth
- Local checkout `/home/mudler/_git/vllm` and the DGX's benchmarked venv
`/home/mudler/vllm-bench/lib/python3.12/site-packages/vllm` are STRUCTURALLY
IDENTICAL (same file `qwen_gdn_linear_attn.py`, byte-for-byte same line numbers
1287/1344/1457/1644/1684). So the analysis below is faithful to what was actually
benchmarked on the GB10. Both are a recent dev build (`__version__ = "dev"`), same
era as the "0.23.0" reference; the GDN path is the refactored
`vllm/model_executor/layers/mamba/gdn/qwen_gdn_linear_attn.py`.
### The headline: vLLM runs the entire GDN region at decode as 2 Triton kernels + 3 GEMMs, ALL fused
Per Qwen3.5 gated-DeltaNet (linear-attn) layer, vLLM decode launches:
| # | Kernel | What is folded in |
|---|--------|-------------------|
| 1 | `in_proj_qkvz` GEMM | (quantized matmul - shared with llama) |
| 2 | `in_proj_ba` GEMM | (quantized matmul - shared with llama) |
| 3 | `_causal_conv1d_update_kernel` (causal_conv1d.py:1193) | conv1d **+ silu activation fused in** (the `activation` arg) |
| 4 | `fused_recurrent_gated_delta_rule_packed_decode_kernel` (fused_recurrent.py:256-336) | **l2norm(q), l2norm(k), scale, softplus gate, A_log decay exp(g), sigmoid(beta), the delta-rule recurrence (b_h*=exp(g); delta update), the output b_o=sum(b_h*b_q), AND the SSM state write-back** - all in one kernel |
| 5 | `RMSNormGated` (gated rms_norm) | **output gate silu/sigmoid * z fused into the rms_norm**; the comment notes the norm+quant is further fusable by the compilation pass (`fuse_norm_quant`) |
| 6 | `out_proj` GEMM | (quantized matmul - shared with llama) |
So the GDN-region "glue" elementwise op count in vLLM is effectively ZERO separate
launches. Everything llama runs as standalone ggml nodes - conv-silu, gate
sigmoid, softplus, l2norm, scale, decay mul, residual add, gather - is absorbed
into kernels #3, #4, and #5.
Verified kernel bodies:
- `fused_recurrent_gated_delta_rule_packed_decode_kernel` lines 313-336:
`b_q/sqrt(sum(b_q^2)+eps)`, `b_k/sqrt(...)` (l2norm), `b_q*scale`,
`softplus_x=where(x<=thr, log(1+exp(x)), x)`, `g_val=-exp(A_log)*softplus_x`,
`beta_val=sigmoid(b)`, `b_h*=exp(g_val)`, `b_v-=sum(b_h*b_k)`, `b_v*=beta_val`,
`b_h+=b_v*b_k`, `b_o=sum(b_h*b_q)`, `tl.store(p_o,...)`, `tl.store(p_ht,...)`.
ONE kernel = recurrence + ALL gating + l2norm + state writeback.
- The non-packed variant `fused_sigmoid_gating_delta_rule_update_kernel`
(fused_sigmoid_gating.py:24-179) is the same fusion (used for the spec-decode /
mixed-batch path); both fold gate+l2norm+recurrence+writeback into one launch.
- Decode dispatch: `_forward_core` (line 1286-1298) routes pure non-spec decode to
`_forward_core_decode_non_spec` (line 1644), which calls exactly
`causal_conv1d_update` (#3) then `fused_recurrent_gated_delta_rule_packed_decode`
(#4). `_output_projection` (line 851) does `self.norm(core_attn_out, z)` (#5,
gated rmsnorm) then `out_proj` (#6).
### vLLM ALSO captures decode in a FULL CUDA graph - the launch bubbles are gone entirely
`vllm/v1/attention/backends/gdn_attn.py`:
- `_cudagraph_support = AttentionCGSupport.UNIFORM_BATCH` (line 82)
- `use_full_cuda_graph = cudagraph_mode.has_full_cudagraphs()` (line 113)
- `build_for_cudagraph_capture` (line 509): "only decode is supported for full
cudagraphs with Mamba" / "GDN only supports decode-only full CUDAGraph capture".
So at decode vLLM captures the WHOLE forward (all 48 layers: GDN linear-attn layers
+ the 1-in-4 full-attn layers + projections + conv + recurrence + gated rmsnorm)
into a single replayed CUDA graph. Per-kernel host launch latency and the
data-dependent inter-op gaps are eliminated at replay time. Even the 2 Triton
kernels per GDN layer incur no host-side launch bubble during graph replay.
### Why this is the 62%-vs-40% explanation (not GEMM throughput)
- llama runs the GDN region as ~7-9 separate ggml nodes per layer at decode
(`ssm_conv`, `gated_delta_net` recurrence, `gdn_gather`, `k_bin_bcast` mul,
`silu`, `sigmoid`, `l2_norm`, `op_add`, `concat`), each a host-launched kernel,
serially data-dependent (conv -> gate -> recurrence -> gather), with the gating
elementwise wedged between recurrence steps. Each launch + the dependency stall
is a bubble ON the critical path. x48 layers x ~8 ops = ~384 launch bubbles/step.
- vLLM has 2 fused Triton kernels per GDN layer AND wraps them in a CUDA graph, so
the GDN-region inter-op bubble count at decode is ~0. The recurrence kernel
itself is already near-parity in llama (gated_delta_net 1.47 ms/call vs vLLM).
The gap is the surrounding launch/sync overhead, which is exactly the 60% idle
measured (llama ~40% busy vs vLLM 62%).
- This matches why P2a and Lever 2 were FLAT: they shrink GPU-busy kernels that are
already overlapped with the 42% mul_mat_q GEMM. The real wall-clock lever is the
SERIAL GDN gating chain's launch bubbles, which vLLM removed by (a) fusion into
the recurrence kernel and (b) CUDA-graph capture.
### What llama would need to match vLLM (two independent wins, either helps)
1. **Op fusion (Lever 3).** Collapse the GDN per-layer gating chain into the
recurrence kernel: fold conv-silu, q/k l2norm, scale, softplus+A_log gate,
sigmoid-beta, the exp-decay mul, the residual add, and the SSM-state write-back
INTO the `gated_delta_net` CUDA kernel (and fuse the output gate silu*z into the
final rms_norm). Target: from ~8 GDN nodes/layer down to ~2 (conv-fused +
recurrence-fused), mirroring vLLM's `fused_recurrent_gated_delta_rule_packed_decode`.
The conv silu fold and the l2norm/scale/gate fold are the high-value pieces -
they are pure elementwise prologues sitting ON the serial chain between conv and
recurrence.
2. **CUDA-graph the decode step.** Even without fusion, capturing the decode forward
in a CUDA graph removes the per-node host launch latency for all ~384 nodes/step.
(Prior A.2 work flagged ggml-cuda graph capture as the orthogonal lever; the
measured GDN structure here is exactly why it should move the wall.) vLLM gets
BOTH; llama gets neither today.
### Bottom line for the gap-analysis agent
The candidate explanation is confirmed at the source level: vLLM's GDN decode region
is 2 fused Triton kernels under a full CUDA graph vs llama's ~8 separate
host-launched, serially-dependent ggml nodes. That launch/bubble delta - not GEMM
compute - is the 62%-vs-40% busy gap. A timeline gap analysis on the existing nsys
trace should show idle GPU between the GDN sub-ops (conv -> gate -> recurrence ->
gather) per layer; if it does, Lever 3 (gating-into-recurrence fusion) and/or
decode CUDA-graph capture are the levers that will move the wall, unlike P2a/Lever 2.
---
## roofline-decode (READ-ONLY, no GPU) - decode-step roofline + bubble budget + parity target
Goal: bound the q36-27b-nvfp4 decode step by the bandwidth floor and the compute floor,
compare to measured llama 384 ms/step vs vLLM 327 ms/step, size the unexplained "bubble
budget", and state the step-time target for parity. Cross-checks vllm-gdn-compare above.
### Inputs (measured / GGUF metadata, no new GPU work)
- DGX GB10 (sm_121): LPDDR5x **273 GB/s**, dense NVFP4 MMA peak ~**500 TFLOP/s** (sparse ~1 PFLOP/s).
Both numbers are shared identically by llama and vLLM (same HW, same weights).
- q36-27b-nvfp4 GGUF (arch qwen35): block_count **64** (full_attention_interval 4 ->
**16 full-attn + 48 GDN** layers), d_model 5120, FFN 17408, attn 24 heads / 4 kv-heads,
head_dim 256, ssm conv_kernel 4 / state_size 128 / group_count 16 / inner_size 6144.
Weight file = **18.804 GB** (NVFP4 + FP8 block-scales + f16 norms/embd), fully GPU-resident.
- Measured llama decode (dense_base.out, -fa -npp128 -ntg128 -npl128, B=128, 128 TG steps):
T_TG 49.154 s / 128 = **384.0 ms/step** (S_TG 333.3 tok/s; matches STATE "~381 ms").
- vLLM dense ref **391 tok/s @128** => 128/391 = **327.4 ms/step**.
### 1. Bandwidth floor (bytes that MUST cross LPDDR5x per step / 273 GB/s)
| term | bytes/step | basis |
|------|-----------|-------|
| Weights (batched, read ONCE/step, reused across all 128 seqs) | ~18.4 GB | file 18.804 minus ~0.4 GB sparse input-embd lookup; lm_head fully read |
| SSM state R+W (48 GDN layers x 128 seqs) | ~19 GB (bracket 10-38) | ~1.5 MB/layer/seq R+W, kernel-grounded: gated_delta_net 1.47 ms/call -> ~400 MB/call @273 GB/s; theoretical d_inner*d_state f32 doubles it |
| KV cache read (16 attn layers, avg 192 ctx, 128 seqs, f16) | ~1.6 GB | 64 KiB/tok over 16 layers; max-ctx 256 -> 2.1 GB |
| Activation/quantize/gate intermediates R+W | ~3 GB | quantize_mmq_nvfp4 + k_bin_bcast + silu + rms tensors @ batch 128 |
| **TOTAL** | **~42 GB/step** | bracket 32-61 GB |
**Bandwidth floor = 42/273 = ~154 ms/step** (central; bracket ~117-224 ms).
Weight-only HARD sub-floor (unavoidable, both engines pay it): 18.4/273 = **67 ms/step**.
KEY: even at batch 128 the FP4 GEMM is STILL memory-bound, not MMA-bound. AI = 2*128/0.53 B
= ~483 FLOP/byte << GB10 ridge 500e12/273e9 = 1832 FLOP/byte. The 42% `mul_mat_q<NVFP4,m=128>`
GPU time is weight-DRAM streaming, not tensor cores -> first-principles reason P2a (-26% MMA
occupancy) and Lever-2 were FLAT on decode.
### 2. Compute floor (FLOPs / ~500 TFLOP/s dense FP4)
| term | FLOPs/step | floor |
|------|-----------|-------|
| FP4 GEMM (all dense matmuls): 2 * ~26e9 params * 128 seqs | 6.66 TFLOP | / 500e12 = **13.3 ms** (6.7 ms @ sparse 1 PFLOP) |
| GDN recurrence (state update + read-out, 48 layers x 128 seqs) | ~0.04 TFLOP | < 0.1 ms (state-bound, not FLOP-bound) |
| **TOTAL** | ~6.7 TFLOP | **~13 ms/step (~4% of the step)** |
### 3. Verdict / bubble budget / parity target
```
compute floor bandwidth floor MEASURED step x above bw-floor
GB10 dense-FP4 ~13 ms ~154 ms (117-224)
vLLM dense @128 327 ms ~2.1x (1.5-2.8x)
llama dense @128 384 ms ~2.5x (1.7-3.3x)
```
- **Binding floor = bandwidth (~130-155 ms), NOT compute (~13 ms).** Compute floor is ~25x
below the wall -> FP4-MMA throughput is irrelevant; matches P2a/Lever-2 flatness exactly.
- **Both engines run ~2-2.8x ABOVE the bandwidth floor.** vLLM itself reaches only ~40-47%
LPDDR5x efficiency -> even the reference is LATENCY/occupancy bound, not byte-bound.
Confirms prior "decode is 2.5x above its bandwidth floor" work.
- **Bubble budget** (wall - bandwidth floor, central 154): vLLM ~**173 ms**, llama ~**230 ms**.
= kernel-launch latency + occupancy gaps + serial data-dependency stalls.
- **The llama-vs-vLLM gap = 384 - 327 = 57 ms/step (14.8% of the step) is 100% BUBBLE.**
Both engines share IDENTICAL bandwidth AND compute floors (same 18.8 GB NVFP4 weights, same
SSM state, same KV, same GB10 273 GB/s + 500 TFLOP). Bytes and FLOPs are byte-for-byte equal,
so the entire 57 ms differential lives in critical-path bubble - NOT bandwidth, NOT compute.
**Parity target: 327 ms/step (391 tok/s @128). llama must shave 57 ms/step = 14.8% off 384 ms.**
Neither floor can move (both already shared with vLLM), so the 57 ms can ONLY come from
collapsing critical-path bubbles -> structurally-correct case for Lever 3 (fuse the serial GDN
gating chain) and/or decode CUDA-graph capture, exactly the two wins vllm-gdn-compare found vLLM
already has. P2a/Lever-2 were flat because they freed OVERLAPPED GPU-busy time BELOW the floor.
### Cross-check / sizing for the gap-analysis (timeline) agent
- GPU-busy sum from nsysab_new (ntg24 window, /24 steps): FP4 GEMM ~243 + gated_delta_net ~76 +
GDN glue (k_bin_bcast mul ~49, silu ~34, concat ~19, gdn_gather ~21, ssm_conv ~12, l2_norm ~6,
op_add ~10) ~152 + quantize ~62 = **~555 ms GPU-busy vs 384 ms wall** -> sum >> wall by ~1.45x,
so heavy overlap is real and GPU-busy% buckets ARE misleading. Do NOT sum kernel times; the
wall is the critical path.
- Concrete budget: if the inter-kernel IDLE gaps + non-overlapped launch latency along the serial
GDN chain (ssm_conv -> gated_delta_net -> gating elementwise -> gdn_gather, x48 layers x N steps)
sum to **>= 57 ms/step**, Lever 3 is justified AND sized. If those critical-path gaps total
< 57 ms, parity is NOT reachable via GDN-gate fusion alone and the gap is elsewhere (GDN core
kernel slower than vLLM fused_recurrent, or scheduler/H2D).
- Structural corroboration (agrees with vllm-gdn-compare): vLLM runs the GDN region as 2 fused
Triton kernels under a full CUDA graph; llama splits it into ssm_conv + gated_delta_net +
gdn_gather + ~6 serially data-dependent elementwise gate kernels. ~384 host-launched nodes/step
on a chain that cannot overlap is precisely the mechanism that produces llama's extra ~57 ms.
Floors are engine-independent lower bounds; the timeline agent owns proving the 57 ms is
recoverable on the critical path. Roofline says: target 327 ms, shave 57 ms, and it can ONLY
come from bubble (not bytes, not FLOPs).
Assisted-by: Claude:opus-4.8 [Claude Code]
## lever3-design (READ-ONLY, no GPU) - concrete fusion of the serial GDN gate chain into the recurrence kernel
### What actually feeds/consumes the recurrence kernel today (qwen35 decode, fused_gdn_ar)
Traced in `src/models/qwen35.cpp::build_layer_attn_linear` ->
`src/models/delta-net-base.cpp::build_recurrent_attn` (fused !keep branch) ->
`ggml/src/ggml-cuda/gated_delta_net.cu`. The model is GDA (g->ne[0]==1, scalar
gate per head; kda=false in the kernel). S_v = ssm_d_state = 128, so the kernel
runs the `<128>` template: warp_size==S_v==128, num_warps=4, rows_per_lane==1,
grid (H, n_seqs, S_v/4=32 z-tiles). Each warp owns one output column `col`; the
128 lanes hold the full head-vector (one element per lane).
Serial pre-GDN gate chain (each a standalone host-launched ggml node, all on the
critical path between the in-proj GEMMs and the recurrence):
1. `beta = ggml_sigmoid(ssm_beta @ cur)` -> kernel reads `beta_val = *beta_t`
2. `alpha = ssm_alpha @ cur`
3. `ggml_add(alpha, ssm_dt)` (k_bin_bcast op_add)
4. `ggml_softplus(...)` (unary_op<softplus>, 1248 inst)
5. `ggml_mul(softplus, ssm_a)` (k_bin_bcast op_mul; ssm_a = -exp(A_log), baked) -> g; kernel does `expf(g_t)`
6. `ssm_conv` then `ggml_silu` (conv path; may already hit the upstream SSM_CONV+SILU fuse) -> v_conv, and the q/k slices
7. `ggml_l2_norm(q_conv)`, `ggml_l2_norm(k_conv)` (l2_norm_f32<32>, 2496 inst = 1248x2) -> kernel reads q_reg/k_reg
Post-GDN gate (consumes kernel output):
8. `build_norm_gated(output, ssm_norm, z)` = rms_norm(output)*ssm_norm (RMS_NORM+MUL fused) then `silu(z)*.` (unary_gated_op<silu>, the 5.9% bucket)
### The fusion: fold steps 1,3,4,5,7 INTO gated_delta_net_cuda (a "fused-gate" mode)
These five are exactly the per-(head) scalar gates (sigmoid beta; softplus+dt+ssm_a
-> g) and the per-head-vector L2 norms of q/k - and the kernel ALREADY loads every
operand it needs:
- It reads `beta_val` (scalar) -> pass RAW beta, do `beta_val = 1.f/(1.f+expf(-raw))` in-kernel. Removes node 1.
- It reads `g_t` (scalar, GDA) and does `expf(g_t)` -> pass RAW alpha + per-head `ssm_dt[h]` + per-head `ssm_a[h]`, compute `g = ssm_a[h]*op_softplus(alpha + ssm_dt[h])` in-kernel, keep the existing `expf(g)`. `op_softplus(x) = (x>20)?x:logf(1+expf(x))` (copy `ggml_compute_softplus_f32` verbatim). Removes nodes 3,4,5.
- It loads the full q/k head-vector into `q_reg[r]`/`k_reg[r]` (one element per lane at S_v==128). L2-normalize in registers: `float qss = warp_reduce_sum<128>(q_reg[0]*q_reg[0]); q_reg[0] *= rsqrtf(qss + eps* ... )` matching the l2_norm formula, same for k. Each warp redundantly recomputes the (identical) norm for its column - cheap, no shared mem, no extra launch. Removes nodes 7 (x2). `eps` (= f_norm_rms_eps) passed as a kernel float param.
That collapses the pre-GDN serial chain to just: in-proj GEMMs -> build_conv_state(concat) -> ssm_conv(+silu) -> [single fused gated_delta_net kernel]. 5 gate kernels removed per SSM layer per decode step.
### Why the OUTPUT gate (step 8) is NOT folded into this kernel
The output gated-rmsnorm reduces over the full head_v_dim (S_v=128) per (head,seq).
In this kernel those 128 elements are produced by 128 DIFFERENT (warp x z-tile)
blocks (4 warps x 32 z-tiles), so an in-kernel head-wide reduction would need a
grid-global sync - not feasible without a grid redesign. Leave step 8 as the
existing RMS_NORM+MUL + unary_gated<silu> fusion (already 2 launches, not in scope).
The conv-silu (step 6) is a convolution, structurally separate; rely on the
existing upstream SSM_CONV(+ADD)+SILU fuse rather than pulling it into the
recurrence kernel.
### Implementation scope
- `ggml/include/ggml.h`: new builder `ggml_gated_delta_net_inplace_ids_fused_gate(ctx, q_raw, k_raw, v, alpha_raw, beta_raw, cache4d, state_dst, ids, ssm_a, ssm_dt, rs_head, eps)` (or an op-param flag GDN_FUSE_GATE on the existing builder + 2 extra srcs). src budget: current op uses src[0..7]; add ssm_a -> src[8], ssm_dt -> src[9]. GGML_MAX_SRC==10, so it fits EXACTLY (zero headroom - note for review).
- `ggml/src/ggml.c`: builder + a new op-param i32 flag (e.g. params[2]=fuse_gate) + f32 param for eps; assert shapes (ssm_a/ssm_dt are [num_v_heads]).
- `ggml/src/ggml-cuda/gated_delta_net.cu`: in `gated_delta_net_cuda`, gate the in-kernel sigmoid/softplus-gate/l2norm behind a `bool FUSE_GATE` template param (4th template bool, keeps the non-fused path byte-identical and avoids register bloat when off). Read ssm_a[h_idx], ssm_dt[h_idx]; compute g per head; sigmoid raw beta; warp-reduce q_reg/k_reg sumsq -> rsqrtf scale. Plumb the 2 new src pointers + eps through `launch_gated_delta_net` and `ggml_cuda_op_gated_delta_net` (read src[8],src[9], op_param eps/flag). The `gdn_gather_nonident` path is unaffected (it gathers state, not q/k/g/beta).
- `ggml/src/ggml-cpu/ops.cpp`: mirror in `ggml_compute_forward_gated_delta_net_one_chunk` (host sigmoid/softplus/l2norm before the per-token math) for CPU parity / test-backend-ops.
- `src/models/delta-net-base.cpp::build_recurrent_attn` (the fused !keep + ids branch, and the inplace non-ids branch): call the fused-gate builder, pass raw alpha/beta/q/k + ssm_a + ssm_dt + eps.
- `src/models/qwen35.cpp` / `qwen35moe.cpp` / `qwen3next.cpp` `build_layer_attn_linear`: when the fuse flag is on, DROP `ggml_sigmoid(beta)`, `ggml_add(alpha,dt)`, `ggml_softplus`, `ggml_mul(.,ssm_a)`, and the two `ggml_l2_norm` nodes; hand the raw tensors + `model.layers[il].ssm_a`, `ssm_dt` to build_recurrent_attn. The conv-silu and z/output-gate path are unchanged.
- Guard the whole thing behind `cparams.fused_gdn_gate` / env `LLAMA_FUSE_GDN_GATE` (default OFF) so it A/Bs against the clean Lever-1 build exactly like P2a/Lever-2, and only the recurrent (GDA) qwen35 family path is touched.
### Numeric considerations / bit-exactness
- sigmoid(beta), softplus(alpha+dt), and the `g = ssm_a*softplus` mul/add are pointwise fp32 with the SAME formula/order as the standalone ggml ops -> these can be **bit-exact** (no reduction). softplus must copy `(x>20)?x:logf(1+expf(x))` exactly.
- q/k l2norm is the ONE op with a reduction: the standalone `l2_norm_f32<32>` does its own warp/block reduction; the in-kernel `warp_reduce_sum<128>` tree may differ in the last ULP, and the eps placement (`x*rsqrt(sumsq+eps)` vs `x/max(sqrt(sumsq),eps)`) must match the ggml l2_norm exactly. Expect **near-bit-exact, not guaranteed byte-identical** greedy output. So unlike Levers 1/2, gate this on a **PPL/KL tolerance** (KL logit delta < ~1e-3, PPL delta within noise) rather than md5 identity. If byte-identity is required, exclude l2norm from the fold (keep nodes 7) and fuse only sigmoid/softplus/gate - but that drops the value to ~0.3% and is probably not worth it.
### Estimated kernels-removed-per-layer and the honest ceiling
- Removed per SSM decode layer-step: sigmoid(beta) + add(dt) + softplus + mul(ssm_a) + l2norm(q) + l2norm(k) = **6 host-launched kernels -> 0**, collapsing 7 nodes (incl. recurrence) to 1. Across 48 SSM layers = **~288 launches/step removed** (matches the instance deltas: l2_norm 2496, softplus 1248, sigmoid 1248, plus the alpha-add/ssm_a-mul share of op_add/op_mul).
- GPU-BUSY ceiling of the removed ops is small: l2_norm 1.0% + softplus ~0% + sigmoid 0.3% + (dt add + ssm_a mul share of op_add 1.7% / op_mul 8.5%). The point of Lever 3 is NOT the freed busy-time (P2a/Lever-2 proved freeing overlapped busy-time is flat) - it is removing ~288 LAUNCH BUBBLES/step that sit on the serial conv->gate->recurrence dependency where the GPU is otherwise idle. The win is wall-clock only if those specific bubbles are on the critical path.
### RISK (must be settled before building)
1. **Same trap as P2a/Lever-2 if the bubbles overlap.** If the scheduler already
overlaps these pre-GDN gate kernels with an adjacent layer's 42% mul_mat_q GEMM,
Lever 3 is FLAT. **Precondition: the timeline gap analysis must show idle GPU
between ssm_conv -> (sigmoid/softplus/l2norm) -> gated_delta_net per layer** at
batch=128 BEFORE building. If the trace shows the gate ops back-to-back with no
gap (overlapped), do NOT build op-fusion; go to lever (2) below.
2. **The bigger bubbles may be elsewhere on the chain.** The large buckets are op_mul
8.5% and unary_gated<silu> 5.9% - much of which is the POST-GDN output gate and
FFN, which this fusion does NOT touch. If the gap analysis pins the dominant idle
to the post-GDN region or to inter-layer launch latency generally, the
higher-leverage Lever 3 is **decode CUDA-graph capture** (removes host launch
latency for ALL ~384 nodes/step at once, exactly what vLLM does), not per-op
fusion. CUDA-graph is the strictly larger hammer here; op-fusion only helps the
pre-GDN slice. Recommend measuring the per-sub-op gap first and preferring the
CUDA-graph lever if the bubbles are spread across the step rather than concentrated
in the pre-GDN gate slice.
3. **src-slot exhaustion** (src[8],src[9] use the last 2 of GGML_MAX_SRC=10) - any
later op needing more srcs on this node has zero headroom; flag for review.
## cudagraph-coverage (READ-ONLY, no GPU) - does the CUDA graph cover the GDN serial chain at B=128?
### Verdict: YES, the graph covers GDN at batch=128 (dense model). No GDN op forces graph-disable or per-step re-instantiation.
Source: `ggml/src/ggml-cuda/ggml-cuda.cu` (graph state machine), `gated_delta_net.cu`
(fused op), `src/models/delta-net-base.cpp` (graph build), `src/llama-memory-recurrent.cpp`
(recurrent head), all on dev tree `~/llama-paged-dev` (HEAD df1cc97, Lever-1). Cross-checked
against the committed A2_CUDAGRAPH_DECODE.md + DECODE_PARITY_EXPLORE.md measurements.
### How graph-disable / re-instantiation are decided (this fork's state machine)
- `ggml_cuda_graph_check_compability` (ggml-cuda.cu:3251) disables the graph for ONLY two
reasons: (a) a split-buffer src, (b) `GGML_OP_MUL_MAT_ID` with non-quantized weights OR
`node->ne[2] > get_mmvq_mmid_max(...)` [TAG_MUL_MAT_ID_CUDA_GRAPHS]. GATED_DELTA_NET,
SSM_CONV, SSM_SCAN, GET_ROWS, CONCAT, the gating elementwise ops are NOT in the disable
list. So no GDN op forces graph-disable.
- `ggml_cuda_graph_update_required` (3297) memcmps, per node, the full `ggml_tensor` struct
(incl. `op_params` and `data`) + each src's `data` ptr / `ne` / `nb`. ANY delta -> the
warmup state machine (ggml_backend_cuda_graph_compute:4464) resets `warmup_complete` and the
WHOLE graph (one key = `cgraph->nodes[0]`) runs eager that step until stable again. Buffer
CONTENTS are NOT compared - a contents-only change (e.g. ids values) is graph-safe.
### Why the GDN region's properties are STABLE across steady decode steps
The fused decode path is `ggml_gated_delta_net_inplace_ids` (delta-net-base.cpp:558-560):
```
state_dst = ggml_view_2d(ctx, ssm_states_all, n_embd_s, n_seqs, nb1,
kv_head * n_embd_s * elsize); // offset = kv_head
ggml_gated_delta_net_inplace_ids(..., cache4d, state_dst, ids, /*rs_head=*/(int)kv_head);
```
Both the `state_dst` view byte-offset and the `rs_head` op_param (read back as
`ggml_get_op_params_i32(dst,1)` in gated_delta_net.cu:330) derive from
`kv_head = mctx_cur->get_head()`. In `llama_memory_recurrent::find_slot`
(llama-memory-recurrent.cpp:610-689) the n_seqs used cells are SWAPPED into the contiguous
range `[min .. min+n_seqs-1]` and `head = min`. The recurrent cache does NOT grow per token
(one state cell per sequence, unlike the KV cache). For a steady 128-seq continuous batch the
same sequences own the same tails every step, so `min`/`head` are constant (=0) -> state_dst
offset constant, rs_head op_param constant. The GDN inputs (q,k,v,g,b, cache4d, ids) are
fixed-shape (n_seqs=128, n_rs slots), so ne/nb are stable, and ggml-alloc hands out the same
compute-buffer offsets each same-topology ubatch -> data ptrs stable. The `ids` (s_copy)
tensor's CONTENTS change per step but its address/ne/nb do not -> graph-safe.
### The fused GDN op is capture-safe (no host-sync, no capture-time cudaMalloc)
`gated_delta_net.cu`: the op launches `gdn_gather_nonident_kernel` + `gated_delta_net_cuda`
on `ctx.stream()` with NO `cudaStreamSynchronize` / host `cudaMemcpy` / `cudaMalloc`. The
gather scratch is `ggml_cuda_pool_alloc` (VMM pool, served from reserved memory after warmup,
no real cudaMalloc during capture). `gdn_gather_nonident` early-returns for identity sequences
(`ids[s]==rs_head+s`), which is the steady-decode case, so its 3.7% is a launched-but-mostly-
noop kernel - still captured into the graph like any other. Capture succeeds (the build runs,
graphs engage), confirming none of these break stream capture.
### The only re-instantiation is NOT GDN-driven
A2 already measured the re-warm cadence: the graph re-instantiates every ~256 tokens because
the FULL-ATTENTION block-table input `idx` has `ne[0]=GGML_PAD(n_kv,256)` (and kq_mask in
lockstep) - those step at 256-token boundaries (paged-attn.cpp:199-213). ~97% of decode steps
replay the captured graph. This is a full-attention-layer input, not a GDN op. (The unpadded
`LLAMA_KV_PAGED_GATHER` fallback grows `ne[0]` every step and runs pure-eager, but that is not
the default decode path and is not the GDN/SSM path.)
### Reconciliation with the "~40% util / 60% idle bubbles" premise (it is refuted for GDN)
The committed nsys sweeps (A2_CUDAGRAPH_DECODE.md, DECODE_PARITY_EXPLORE.md) show the steady
decode is ~99.4-99.5% GPU-BUSY with graphs ON (measured with `--cuda-graph-trace=node`; a
graphs-ON trace WITHOUT that flag under-counts GPU rows and falsely reports idle - Trap #2).
Total exposed idle is ~0.65% of the step; the within-step launch fraction graphs remove is
0.34% (0.37%->0.11%) and is ALREADY collapsed - the GDN sub-op launch gaps are inside the
captured region. The "40% utilization" in the STATE is BANDWIDTH-roofline util, not idle SMs:
decode moves ~55.5 GB/step at 2.48x the 273 GB/s floor, SSM state r+w = 66% of step bytes. The
GDN recurrence is memory-bandwidth-bound at low occupancy (~12-16%), not launch-gap-bound. So
"60% idle bubbles on the serial GDN chain" is not supported by the traces; the gap to vLLM is
SSM-state memory traffic, consistent with P2a/Lever-2 being flat (freeing GPU-busy time, not
wall-clock).
### Graph-safe lever for GDN: none new
- GDN is already graph-covered; there is no "make the GDN ops graph-safe" lever to build - they
are already safe and captured.
- The only genuinely graph-NON-covered idle is the BETWEEN-step host gap (~2 ms/step, ~0.4%):
ggml rebuilds/reallocs the cgraph each step with a new `cgraph->uid`, so the uid fast-path in
ggml_cuda_graph_update_required never fires and the host re-dispatches ~3100 launches on the
Grace cores between graph launches (vLLM builds its graph once + persistent device metadata).
A persistent/reused cgraph across decode steps would let the uid fast-path fire and shrink the
host gap - but at 0.4% of the step it is second-order to the SSM bandwidth floor.
- CAVEAT (MoE, qwen35moe): MUL_MAT_ID at B=128 can trip [TAG_MUL_MAT_ID_CUDA_GRAPHS]
(`ne[2] > mmvq_mmid_max`), disabling the WHOLE MoE-decode graph (GDN included) into eager.
That is a MUL_MAT_ID disable, not a GDN break, and does not touch the dense 335 tok/s headline;
worth a separate confirm for the MoE model.
## decode-timeline-gap (GPU, label gap-analysis) - the decisive fresh node-level measurement
This is the new GPU run the analysis was waiting on. It arbitrates between the
roofline/vllm-gdn-compare theory ("57 ms = 100% bubble, Lever 3 closes it") and the
cudagraph-coverage source verdict ("~99.4% busy, bandwidth-bound, bubbles refuted").
The measurement confirms the latter and refutes the former, with per-kernel numbers.
### Capture (the trap the prior `--trace=cuda` fell into is now avoided)
`nsys profile --trace=cuda --cuda-graph-trace=node` on build-cuda-base (clean
Lever-1, HEAD df1cc97, git-clean mmq.cuh), q36-27b-nvfp4 dense, `-fa on -npp 128
-ntg 24 -npl 128 -c 33000`. Artifacts on DGX: `~/llama-paged-dev/nsysgap.{nsys-rep,
sqlite}`. The decode step is a single CUDA graph (graphId=11, 23 replays = steps
2-24; graphId=1 x8 = prefill). Plain `--trace=cuda` recorded each step as ONE opaque
~380 ms block, so the widely-cited `nsysab_new.kern.txt` breakdown (mul_mat_q 42%,
gated_delta_net 13%) is PREFILL + the single eager capture step, NOT decode. With
node-level trace the graph expands: 168201 kernels = 91499 graph-internal + 76702
eager prefill. **All graph kernels on stream 14 (single stream) -> strictly serial,
no overlap, so any inter-kernel gap is pure GPU idle.**
### One steady decode step (window between decode launches 22413.26 / 22796.74 ms, width 383.48 ms)
Exactly 48 `gated_delta_net` + 16 `flash_attn` = one clean step (48 GDN + 16 attn).
2965 kernels.
| classification | ms/step | % of step |
|---|---|---|
| (a) inter-kernel LAUNCH gaps + (b) SERIAL-DEPENDENCY stalls (LAG sum, single stream) | **0.225** | **0.06%** |
| (c) within-kernel time (GPU running) | 380.4 | 99.94% |
Zero gaps > 5 us. Largest single gap 2.40 us. 1260 sub-1us gaps + 1700 back-to-back.
**The decode step is 99.94% GPU-busy. There are no bubbles.** This independently
confirms cudagraph-coverage's ~99.4% and **refutes** roofline-decode's "57 ms = 100%
bubble" and vllm-gdn-compare's "~384 launch bubbles/step on the critical path".
nvidia-smi's "40% util" = low SM/compute efficiency WITHIN kernels (c) (memory-latency-
bound, ~12-16% achieved occupancy), not wall-clock idle.
### Real decode kernel mix (% of the 380.4 ms step) - corrects the prefill-contaminated kern_sum
| kernel | n/step | ms | % | grid CTAs | waves/48SM |
|---|---|---|---|---|---|
| gated_delta_net_cuda | 48 | **196.37** | **51.6** | 48x128x32 = 196608 | 4096 |
| mul_mat_q (FP4 in/out/qkv/o proj) | 496 | 92.90 | 24.4 | 136 | 1.5 |
| quantize_mmq_nvfp4 | 496 | 17.13 | 4.5 | 483 | 10 |
| nvjet GEMM (lm_head) | 1 | 11.91 | 3.1 | 1944 | 40 |
| flash_attn_ext_f16 (16 attn layers) | 16 | 11.67 | 3.1 | 48 | 1.0 |
| concat_cont (conv-state) | 48 | 8.01 | 2.1 | 20480 | 427 |
| cpy_scalar | 64 | 7.62 | 2.0 | 49152 | 1024 |
| k_get_rows_float | 49 | 7.08 | 1.9 | 15098 | 315 |
| k_bin_bcast (gate mul + add) | 720 | 6.59 | 1.7 | 3169 | 66 |
| ssm_conv_f32 | 48 | 5.64 | 1.5 | 10240 | 213 |
| unary_gated (silu/sigmoid) | 128 | 5.36 | 1.4 | 5888 | 123 |
| mul_mat_q_stream_k_fixup | 304 | 3.94 | 1.0 | 192 | 4 |
| rms_norm_f32 | 209 | 3.52 | 0.9 | 1764 | 37 |
| l2_norm_f32 | 96 | 0.64 | 0.2 | | |
| gdn_gather_nonident | 48 | **0.061** | 0.016 | | |
- `gated_delta_net` is **51.6% of the step**, the single dominant term. The
previously-cited "1.47 ms/call near-vLLM" was the EAGER average over 1248 calls
(range 0.046-4.42 ms = prefill warmups + capture); true steady decode is
**4.08-4.11 ms/call** (gridY=128 = the 128 seqs). 2.8x higher than believed.
- It launches 196608 CTAs / 4096 waves = NOT occupancy-starved; the cost is
bandwidth-bound state traffic (~384 MB read + ~384 MB write per layer for the
48-head x 128-seq x [state 128 x head_v 128] recurrent state, ~190 GB/s effective).
- The Lever-3 narrow target (gating glue) = k_bin_bcast 6.59 + silu/sigmoid 5.36 +
l2_norm 0.64 + softplus 0.13 = **12.76 ms = 3.35%** of the step. `gdn_gather` is
**0.06 ms** (negligible - it early-returns on identity ids as predicted).
### The three answers (with numbers)
1. **Bubbles on the serial GDN critical path?** NO. 0.225 ms idle/step = 0.06%,
zero gaps > 5 us. CUDA graphs eliminated launch overhead; serial dependencies do
not produce idle (each kernel starts < 1 us after the previous). The premise is
refuted by direct measurement.
2. **Would Lever 3 (fuse the gating chain) shrink the step or overlap away?** It
shrinks it, but only by its hard ceiling **12.76 ms = 3.35%** (380 -> 367 ms, 336
-> ~348 tok/s, 86% -> 89% of vLLM). It does NOT close the 14% / 53-57 ms gap.
IMPORTANT mechanism correction: the step is single-stream and 99.94% busy, so
there is NO overlap to absorb freed time (the lever3-design RISK #1 "same trap as
P2a if overlapped" does NOT apply - nothing overlaps). So removing those kernels'
GPU-time DOES cut wall-clock - but the win is removing their HBM byte traffic, NOT
launch bubbles (there are none). And the value is the measured ~12.76 ms, not the
"~288 launch bubbles" framing (those launches cost ~0 inside the graph). This also
explains P2a/Lever-2 flatness correctly: NOT "overlapped busy-time" (no overlap),
but P2a tuned the prefill large-M GEMM (decode GEMMs are 136-CTA tail-bound, untouched)
and Lever-2 relocated mandatory quantize work into the GEMM prologue (net zero).
3. **Do CUDA graphs cover the GDN region at B=128?** YES, fully. Whole step = one
graph, 23 replays, ~0.2 ms host gap between steps. `gdn_gather_nonident` and the
in-place state ops are graph-internal nodes (graphNodeId != 0); no fragmentation.
Confirms cudagraph-coverage. Note: lever #2 from vllm-gdn-compare ("CUDA-graph the
decode step") is ALREADY IN EFFECT in this build and did not close the gap - so it
is spent, not pending.
### Verdict against roofline-decode's own sizing test
roofline-decode stated: "if critical-path gaps total < 57 ms, parity is NOT reachable
via GDN-gate fusion alone and the gap is elsewhere (GDN core kernel slower than vLLM
fused_recurrent)." **Measured gaps = 0.225 ms << 57 ms.** Therefore, by that test, the
53-57 ms / 14% gap is NOT bubble and NOT closable by gating fusion. It lives in
**kernel GPU-time**, dominated by the `gated_delta_net` recurrence (51.6%, bandwidth-
bound) and secondarily the FP4 GEMM + quantize stack (29%). The "57 ms = 100% bubble"
roofline conclusion was an inference from the prefill-contaminated GPU-busy sum
(~555 ms vs 384 ms "implies overlap"); the node-level decode-only measurement shows
per-step GPU-busy = wall (no overlap), so that inference does not hold.
### Recommendation (resized)
- The real lever is the `gated_delta_net` recurrence kernel itself (196 ms, 51.6%):
match vLLM's `fused_recurrent_gated_delta_rule_packed_decode` (vllm-gdn-compare
kernel #4) which folds l2norm + gate + decay + recurrence + state-writeback into a
SINGLE pass over the state, reducing HBM round-trips of the state. The win is byte
reduction in a memory-bound single-stream step, not bubble removal.
- The lever3-design fusion is still worth doing as a component of that (it removes
~12.76 ms = 3.35% of real byte traffic, and unlike its own RISK section feared, it
will NOT be flat because there is no overlap), but on its own it is a ~3% lever, not
the gap-closer. Build it folded into a single-pass recurrence kernel, not as an
isolated gate fold.
- Next decisive measurement (future GPU-agent run): profile vLLM's decode step at
npl128 with the same node-level method and compare per-region GPU-time (GDN
recurrence vs GEMM vs attention) to localize exactly where vLLM spends its 53-57 ms
less. Both engines move near-identical bytes only if vLLM's fused recurrence does
not re-stream state; the per-kernel A/B will show whether the gap is the recurrence
pass or the GEMM/quantize stack.
Assisted-by: Claude:opus-4.8 [Claude Code]
---
## SYNTHESIS (final) - the validated decode-parity picture, ranked plan, and verdict
Reconciles all six investigation sections above plus the three adversarial verdicts
(Verify A/B/C). One sentence: **the "~60% idle" never existed; the decode step is
99.94% GPU-busy single-stream, so the 14% gap to vLLM is kernel GPU-time, dominated by
the bandwidth-bound `gated_delta_net` recurrence (51.6%), and the only gap-closing levers
are byte-reduction inside that kernel - NOT launch-bubble removal.**
### 1. The proven critical-path decomposition of the decode step
Decisive node-level trace (`nsys --cuda-graph-trace=node`, clean Lever-1 build df1cc97,
q36-27b-nvfp4 dense, npl128, GB10/48SM/sm_121, commit a7238525, nsysgap.sqlite). One
steady step = single replayed CUDA graph (graphId=11, 23 replays), all 2965 kernels on
ONE stream (stream 14, strictly serial -> every inter-kernel gap is pure idle). Window
383.48 ms.
BUBBLE CLASSIFICATION (the "where is the ~60% idle" answer - it is NOT idle):
| bucket | ms/step | % step | note |
|---|---|---|---|
| (a) inter-kernel launch bubbles | ~0 | ~0 | graph replay collapses host launch latency |
| (b) serial-dependency stalls (GDN chain) | included in 0.225 | 0.06 | each kernel starts < 1 us after prev; zero gaps > 5 us, max 2.40 us |
| (a)+(b) total exposed idle (LAG sum) | **0.225** | **0.06%** | 1700 kernels back-to-back |
| (d) between-step HOST gap (cgraph rebuild, new uid) | ~0.2 | ~0.05 | the ONLY graph-non-covered idle; ~0.4% in older eager-tail traces |
| (c) within-kernel GPU-busy | **380.4** | **99.94%** | this is the whole step |
The nvidia-smi "40%" is within-kernel SM/bandwidth efficiency (~12-16% achieved
occupancy on memory-latency-bound kernels), NOT wall-clock idle.
KERNEL GPU-TIME DECOMPOSITION of the 380.4 ms busy step (this is where the gap lives):
| kernel | ms | % step | regime |
|---|---|---|---|
| `gated_delta_net_cuda<128>` (48x, 4.08 ms/call) | **196.37** | **51.6** | bandwidth-bound f32 recurrent-state R+W (~384 MB R + 384 MB W/layer) |
| `mul_mat_q` FP4 GEMM (496x) | 92.90 | 24.4 | memory-bound weight stream, 136-CTA tail-bound at decode |
| `quantize_mmq_nvfp4` (496x) | 17.13 | 4.5 | mandatory act-quant (Lever-2 only relocated it) |
| `nvjet` lm_head GEMM | 11.91 | 3.1 | |
| `flash_attn_ext_f16` (16 attn layers) | 11.67 | 3.1 | |
| `concat_cont` (conv-state splice) | 8.01 | 2.1 | Lever-1 target |
| `cpy_scalar` (conv-state writeback + dup) | 7.62 | 2.0 | Lever-1 target (the conv-state share) |
| `k_get_rows_float` | 7.08 | 1.9 | |
| `k_bin_bcast` (gate mul + add) | 6.59 | 1.7 | Lever-3 gate-fold target (partial - rest is residual adds) |
| `ssm_conv_f32` | 5.64 | 1.5 | folds into Lever-1 |
| `unary_gated` (silu/sigmoid) | 5.36 | 1.4 | mostly FFN + output-gate (Lever 3 does NOT touch) |
| `mul_mat_q_stream_k_fixup` | 3.94 | 1.0 | |
| `rms_norm_f32` | 3.52 | 0.9 | |
| `l2_norm_f32` | 0.64 | 0.2 | Lever-3 gate-fold target |
| `gdn_gather_nonident` | 0.061 | 0.016 | negligible (early-returns on identity ids) |
GDN region (recurrence + conv + concat + cpy + gather + l2norm) >= 210 ms = 55%+ of the step.
The widely-cited "gated_delta_net 13%, 1.47 ms/call near-vLLM" from nsysab_new.kern.txt was
PREFILL + the single eager capture step contaminating the average over 1248 calls (range
0.046-4.42 ms); true steady decode is 4.08 ms/call, 2.8x higher, 51.6% of the step.
### 2. Claims A / B / C: which HOLD, which are REFUTED, and the residual uncertainty
**CLAIM A** ("the ~60% decode GPU-idle is inter-op launch bubbles ON the serial GDN
chain"): **REFUTED.** Measured idle = 0.225 ms = 0.06%, not the ~53-57 ms the claim
requires (two-plus orders of magnitude short). Zero gaps > 5 us; CUDA-graph replay
already collapsed launch latency; serial data-dependency does NOT equal idle when the
graph dispatches nodes back-to-back. The "40%" was a misread of within-kernel SM
efficiency; the "555 ms busy-sum > 384 ms wall implies overlap" was a prefill-contaminated
`--trace=cuda` artifact (each step recorded as one opaque ~380 ms block).
**CLAIM B** ("Lever 3 - gate fusion - moves the wall, unlike P2a/Lever-2, by removing
serial launch bubbles"): **REFUTED on mechanism.** (i) There are no bubbles to remove
(0.06%). (ii) The contrast is fictional: the step is single-stream with ZERO overlap
anywhere, so P2a/Lever-2 were NOT flat because they "optimized overlapped work" - P2a
tuned the prefill large-M GEMM (decode GEMMs are a different 136-CTA tail regime) and
Lever-2 merely relocated mandatory quantize work into the GEMM prologue (net zero).
(iii) Where the claim is trivially true (any kernel removal cuts wall in a 99.94%-busy
single-stream step), the slice Lever 3 actually fuses ceilings at **12.76 ms = 3.35%**
(k_bin_bcast 6.59 + silu/sigmoid 5.36 + l2_norm 0.64 + softplus 0.13 - and even that
over-counts, since silu is mostly untouched FFN/output-gate). So the wall DOES move, but
only ~3% (380 -> ~367 ms, 86% -> ~89% of vLLM), and NOT for the claimed reason. Lever 3
is a component, not the gap-closer.
**CLAIM C** ("the residual gap is software-closable LATENCY, not a GB10 hardware floor"):
**REFUTED as worded** (no latency, no idle to close - same data as A). The "not a hardware
floor" half is **UNSETTLED, not proven.** vLLM hits 327 ms on the same silicon, so it is
not an absolute hard floor - but whether the dominant 51.6% `gated_delta_net` term is
software-closable in BIT-EXACT form turns on one unmeasured quantity (below).
RESIDUAL UNCERTAINTY (the single open question that decides everything):
- **The DRAM byte-traffic ratio of llama's recurrence vs vLLM's.** Every section above
ESTIMATED the GDN state bytes (~190 GB/s effective, ~70% of 273 GB/s peak); none MEASURED
it. If llama's `gated_delta_net_cuda<128>` moves ~2x the minimal (s0-read + s1-write)
bytes because the un-fused gate/l2norm/writeback/gather ops re-stream state through HBM,
then the 51.6% is software-closable by a single-pass fused recurrence (Claim C spirit
HOLDS). If llama already moves ~minimal bytes at > 85% of peak and vLLM moves the same,
the recurrence is at the GB10 LPDDR5x floor for this state size -> the gap is a
hardware/architecture floor and is NOT closable in bit-exact form (Claim C REFUTED on
both halves). This is the one measurement that converts the verdict from "refuted as
worded" to a definitive yes/no.
- **The MoE model (qwen35moe) is untested.** At B=128 MUL_MAT_ID can trip
[TAG_MUL_MAT_ID_CUDA_GRAPHS] (`ne[2] > mmvq_mmid_max`) and disable the WHOLE MoE-decode
graph into eager, where the ~3100 per-step launches re-dispatch serially on the Grace
cores and inter-op bubbles WOULD reappear. For MoE only, Claim A could partially hold.
The dense 335 tok/s headline is fully settled.
### 3. Ranked implementation plan for the remaining ~14% (57 ms/step, 384 -> 327)
Every win must come from kernel GPU-time (bytes), because bubbles = 0 and both engines
share identical bandwidth/compute floors. Ranked by expected recovery.
| # | Lever | ms/step recovered | -> % of vLLM | bit-exact | tractability | gate |
|---|---|---|---|---|---|---|
| **1** | **Single-pass fused GDN recurrence** (fold l2norm+gate+decay+recurrence+state-writeback+gather into ONE pass over state, mirroring vLLM `fused_recurrent_gated_delta_rule_packed_decode`) - cuts state HBM round-trips | **0 to ~40** (= the byte-delta; UNKNOWN until ncu) | 86% -> up to ~98% | near (l2norm reduction; KL < ~1e-3) | HIGH (kernel rewrite) | **ncu byte-ratio test FIRST** |
| 2 | **Conv-state concat -> ssm_conv fusion** (Lever 1): pass conv-state + new token as separate srcs, update conv state in place (vLLM `causal_conv1d_update`); removes concat_cont + the conv-state cpy | **~8-12** (concat 8.01 + cpy share of 7.62) | +2-3% | YES | MEDIUM | no-regret, build regardless |
| 3 | **Gate-chain fold** (Lever 3 as designed): sigmoid-beta + softplus+dt+ssm_a gate + q/k l2norm into the recurrence kernel | **~12.76 ceiling** (3.35%) - but SUBSUMED by #1 | +3% | near (l2norm) | MEDIUM | build as a COMPONENT of #1, not standalone |
| 4 | **bf16 recurrent + conv state** (Lever 5): halve the 196 ms recurrence + conv traffic; keep f32 in-register accumulation | **~70-90** (if floor-bound) | could reach/exceed parity | NO (parity-tolerance decision; must match vLLM stored dtype) | HIGH (rewrite + parity validation) | the ONLY lever that moves the floor kernel; separate precision track |
| 5 | gdn_gather skip-launch at steady decode | ~0.06 | ~0 | YES | trivial | not worth it (micro) |
| 6 | GDN occupancy split | 0 | 0 | - | - | NOT a lever: 196608 CTAs / 4096 waves, already saturated, bandwidth-bound |
| 7 | quantize_mmq attack (Lever 2) | 0 | 0 | - | - | SPENT - relocated mandatory work, proven flat |
| 8 | decode CUDA-graph capture | 0 | 0 | - | - | SPENT - ALREADY in effect (graphId=11), did not close gap |
| 9 | persistent cgraph (uid fast-path) | ~0.2 (0.05-0.4%) | ~0 | YES | MEDIUM | second-order to the SSM floor |
Levers 1, 3, and the gather of #5 are the SAME kernel rewrite: build them together as a
single-pass recurrence. Levers 6/7/8 are dead (at-floor or already-shipped). Lever 4 is a
distinct, bit-exactness-breaking precision track.
### 4. The honest verdict and the single highest-value next step
**Is true (bit-exact) decode parity reachable?** UNCERTAIN, and it hinges entirely on the
unmeasured byte ratio:
- If llama's recurrence re-streams state (~2x bytes from un-fused ops): YES - a single-pass
fused recurrence (Lever 1) plus conv fusion (Lever 2) plausibly recover ~20-40 ms, taking
llama to ~345-365 ms = ~90-95% of vLLM, near-bit-exact (gate on KL tolerance).
- If llama is already at the GB10 bandwidth floor for f32 state: NO in bit-exact form - the
57 ms is a hardware floor, and only bf16 state (Lever 4, non-bit-exact) closes it.
Either way, the gating-fold-alone path tops out at ~89% of vLLM, so the project should NOT
ship the isolated gate fold as "the parity lever."
**SINGLE highest-value next IMPLEMENTATION step:** build the **single-pass fused GDN
recurrence kernel** (Lever 1 = fold gate + l2norm + state-writeback + gather into one pass
over the recurrent state) - BUT gate the build on one cheap measurement first, because it
is a HIGH-effort kernel rewrite that is worthless if the recurrence is already byte-minimal.
**The measurement that confirms it before over-investing (one short GPU run, gap-analysis
agent only):** `ncu` on `gated_delta_net_cuda<128>` at B=128 vs vLLM's
`fused_recurrent_gated_delta_rule_packed_decode_kernel` for identical layer dims, two
counters:
- `dram__bytes.sum` (actual DRAM bytes/call)
- `dram__throughput.avg.pct_of_peak_sustained_elapsed` (achieved % of 273 GB/s)
Decision rule:
- llama moves ~2x minimal bytes OR vLLM moves materially fewer for the same math -> redundant
un-fused state round-trips -> BUILD the single-pass fused recurrence; predicted recovery
scales with the byte delta (up to ~40 ms). This is the gap-closer.
- llama already moves ~minimal bytes at > 85% of peak and vLLM moves the same -> the
recurrence is at the GB10 hardware floor -> do NOT build the fusion for throughput (only
the ~3% gate-fold ceiling remains); the sole remaining lever is bf16 state (Lever 4,
accept non-bit-exact), and bit-exact parity is NOT reachable.
**No-regret parallel work** (build regardless of the ncu outcome, bit-exact, medium effort):
the conv-state concat -> ssm_conv in-place fusion (Lever 2, ~8-12 ms = +2-3% toward parity),
which removes concat_cont (8.01 ms) and the conv-state writeback cpy off a bandwidth-bound,
single-stream step where their full GPU-time is wall-clock.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,185 +0,0 @@
# llama-server vs vLLM: decode-step gap decomposition (DGX Spark, GB10 / sm_121)
Profiling study (no engine changes). Question: matched apples-to-apples (both
batched servers, NVFP4-class weights, prefix caching on, both eager), why is
`llama-server` ~4-6x slower **per decode step** than vLLM on Qwen3-32B at a
1024-token shared-prefix / batch-32 fan-out, and what is closable vs structural.
Hardware: NVIDIA GB10 (sm_121), unified LPDDR5X. Model: Qwen3-32B, 64 layers.
llama side: `~/llama-paged-dev/build-cuda/bin/llama-server`, `q3-32b-nvfp4-dense.gguf`
(NVFP4 weights, type-40 FP4-MMA path), `-ngl 99 --parallel 32 -c 40960 -fa on`,
`GGML_CUDA_DISABLE_GRAPHS=1` (eager). vLLM 0.23.0 NVFP4A16 (W4A16/Marlin),
`--enforce-eager`. Workload: 1024-token shared prefix + unique 32-token suffix,
K=32 concurrent, generate 64. All profiling scripts are dev-tree only
(`~/bench/decode_study/`); minimal in-code timers were not needed (server already
reports per-slot `eval time`, which excludes prompt-eval = pure decode).
## TL;DR
1. **The real-server decode is GPU-BOUND, not host-bound.** During steady decode
the GPU is **~94.6% utilized** (nvidia-smi, real run) / 85-95% busy (nsys).
Per-slot CPU sampling, detokenize, and `update_slots` are fully hidden: a 5-stage
sampler chain gives the *identical* step time as greedy (1346 vs 1343 ms). The
"GPU stalls on the CPU serving loop" hypothesis is **refuted** for this workload.
2. **At 1024 context the decode step is ~84% KV/attention, ~16% weight GEMM** - the
opposite of the thin-batch-GEMM story. Attention scaling with context length, not
the matmul, is the load-bearing cost.
3. **The worktree's paged KV engine is a decode REGRESSION: ~1.85x slower than
stock** at 1024 ctx (paged 1279-1343 ms/step vs stock 650-729 ms/step). It
gathers K/V/mask into a contiguous buffer (`ggml_get_rows`) every layer every
step, then runs a dense FA kernel - paying a full extra KV read+copy that vLLM's
in-kernel PagedAttention never pays. Paging helps prefix-prefill memory; it hurts
decode latency.
4. Even **stock** llama-server (~650-729 ms/step) is **~4-5x slower than vLLM**
(~120-185 ms/step). The residual gap is the **long-context decode-attention
kernel** and, secondarily, the **thin-batch FP4 weight GEMM** - both kernel-maturity
gaps vs vLLM's FlashInfer/FA paged-decode + Marlin, not serving-loop gaps.
## The measured numbers (batch 32, server-reported pure-decode step time)
`server_decode_step_ms` = max / mean-of-top-8 of per-slot `eval time ms-per-token`
(the most-contended, full-batch-32 slots; excludes prompt eval).
| config | decode step ms (max / top8) | client wall ms/step |
|------------------------------------------|-----------------------------|---------------------|
| paged, ctx 1024, greedy | 1343 / 1279 | 1468 |
| paged, ctx 1024, **heavy 5-sampler** | 1346 / 1280 | 1470 |
| **stock** (no paging), ctx 1024, greedy | **729 / 650** | 768 |
| paged, **ctx 64** (short), greedy | **215 / 215** | 253 |
| vLLM NVFP4A16, ctx 1024 (K=32) | **~120-185** (270 tok/s) | - |
The brief's reference ~828 ms/step sits between the stock (650-729) and paged
(1279-1343) numbers measured here; the decomposition below is what is robust. Our
fan-out shares no prefix across the 32 slots (each slot independently prefills 1056
tokens - confirmed in the log), so the 32 sequences are genuinely concurrent and the
"max" slot is maximally contended, which is why our paged max runs a little above 828.
### Context sweep - decode step is attention-scaling, not fixed overhead
Pure-decode step vs shared-prefix length (paged, batch 32):
| prefix ctx | decode step ms |
|-----------|----------------|
| 64 | 215 |
| 128 | ~290 |
| 256 | ~410 |
| 512 | ~660 |
| 1024 | ~1280 |
Roughly linear in context length: ~1 ms of added step time per added context token.
The **215 ms at ctx 64 is the fixed floor** (weight GEMM + activations + norm/rope +
loop + sampling, attention negligible). Everything above it scales with KV length =
attention + KV plumbing. At 1024 ctx the fixed floor is only ~16% of the step.
## Where the ~1280 ms paged decode step goes (nsys, pure-decode window)
`nsys profile --delay=70 --duration=25 --trace=cuda` windowed onto steady 32-way
decode (`srv_decode2.nsys-rep`; an earlier 25-60s window was discarded because nsys's
own slowdown stretched the 32 prefills into it, inflating GEMM to a misleading 58%).
GPU busy in-window 85.5% (nsys adds gaps; the real run is ~94.6% by nvidia-smi).
| bucket | % GPU time | abs (of ~1280 ms) | what it is |
|--------------------------------|-----------:|------------------:|------------|
| `flash_attn_ext_f16` ATTENTION | **47.7%** | ~610 ms | decode attention over the 1056-cell KV |
| `cpy_scalar` KV copy/cast | 18.3% | ~234 ms | KV write + f32->f16 casts |
| `get_rows/set_rows` KV gather | 17.8% | ~228 ms | **paged** gather of K/V/mask to contiguous |
| `mul_mat_q` + `quantize_mmq` | 15.7% | ~201 ms | NVFP4 weight GEMM (+ activation requant) |
| rmsnorm / silu / rope / add | ~0.6% | ~8 ms | elementwise |
Cross-check: the GEMM bucket (~201 ms) matches the ctx-64 floor (215 ms) - i.e. the
weight matmul is ~the entire short-context step, and is context-independent, as
expected. KV/attention buckets (47.7+18.3+17.8 = **83.8%**) match the context-sweep
finding that ~84% of the step scales with context.
Power signature: ~33-36 W at 94% "utilization" (GB10 can pull far more). High util%
+ low power = the kernels are **memory/latency-bound, not compute-saturated** - the
classic decode signature (stream 19 GB of NVFP4 weights + a growing KV every step).
### Stock vs paged decomposition
- **Stock** (~650 ms): ~215 ms GEMM floor + ~435 ms attention/KV (contiguous KV read
directly by the FA kernel, **no gather**).
- **Paged** (~1280 ms): same ~215 ms floor + ~610 ms attention + **~455 ms paged
gather/copy overhead** (the `get_rows` of K/V/mask plus the extra KV copy that
feeds the dense FA kernel). That ~455 ms (~36% of the step) is the paged engine's
self-inflicted cost and is the entire ~1.85x stock->paged regression.
## vLLM decode architecture mapped onto each llama bucket
vLLM at ~120-185 ms/step is faster on **every** bucket:
| llama bucket (paged) | ms | vLLM equivalent | does vLLM avoid it? |
|-----------------------------|-------|-----------------|---------------------|
| paged KV gather (get_rows) | ~228 | PagedAttention reads blocks **in-kernel** via a block table | **Yes - entirely.** No gather op exists. |
| KV copy/cast | ~234 | KV written once into block pool; FA reads it in place | Mostly - no per-step recopy |
| decode attention | ~610 | FlashInfer / FA paged-decode GQA kernel, split over KV | Same op, far faster kernel on sm_121 |
| weight GEMM + act quant | ~201 | fused Marlin/Machete W4A16 dequant+MMA, no separate quant pass | Faster + removes the requant kernel |
| CPU sampling / loop | ~0 (hidden) | on-GPU batched sampling | N/A here - already hidden on llama side too |
vLLM's whole-step (~150 ms) is **less than llama's GEMM floor alone (~215 ms)**, so
vLLM is ahead on the matmul *and* the attention *and* avoids the gather. The gap is a
stack of kernel-efficiency wins, not one silver bullet.
## Ranked levers - closable vs structural
1. **Remove the paged gather regression. [Tractable, ~455 ms / ~36% on the paged
path; net-zero risk - it is a regression]** The worktree's paged engine makes
decode 1.85x slower than stock by gathering K/V/mask to contiguous every layer
every step (patch 0003 `ggml_get_rows`). For latency-bound decode, **do not enable
paged KV** - it only ever helps prefix-prefill *memory*, never decode latency.
Fully recovering this *and* keeping paging requires reading paged blocks
in-kernel like vLLM (a from-scratch paged-attention CUDA kernel) - see lever 2.
2. **Long-context decode-attention kernel. [Biggest real lever, ~435 ms of stock /
~610 ms of paged; partly structural]** Even stock is attention-bound at 1024 ctx.
llama.cpp's `flash_attn_ext_f16` decode path is ~4-5x slower than vLLM's
FlashInfer/FA paged-decode GQA kernel on this Blackwell-class part. This is the
cost that *grows with context* - exactly the regime the brief targets. Tractable in
principle (a proper flash-decoding / split-K-over-KV kernel, and a true in-kernel
paged read that also kills lever 1's gather), but it is deep CUDA work on a new
arch and partly gated by kernel maturity on sm_121. **Highest-impact, hardest.**
3. **Thin-batch FP4 weight GEMM floor. [Tractable, ~201-215 ms / 15-30%; bounded]**
The NVFP4 `mul_mat_q` + separate `quantize_mmq` activation pass is memory-bound and
less efficient than vLLM's fused Marlin/Machete W4A16. Fusing dequant into the MMA
and folding the activation quant into the GEMM is tractable kernel work. Bounded
impact: the floor cannot drop below weight-read-bound (~19 GB / HBM BW per step).
4. **Host serving loop / per-slot sampling. [NOT a lever]** Measured zero: greedy ==
heavy-sampler step time; GPU 94.6% busy. On-GPU/batched sampling buys nothing until
the kernels (levers 1-3) get fast enough to expose host overhead. Refutes the
"host-bound serving loop" hypothesis for this decode-bound workload.
5. **Continuous-batch scheduler. [NOT the gap / structural elsewhere]** llama-server
already fuses all 32 slots into one decode step (one set of kernels per step over
batch 32 - confirmed in the trace). vLLM's continuous/chunked-prefill batching wins
on *mixed* prefill+decode overlap, but the steady decode-step gap measured here is
kernel-bound, not scheduler-bound.
## Honest bottom line
The ~4-6x per-step gap is **GPU-kernel-bound**, and it decomposes as:
- ~36% of the *paged* step is a **self-inflicted gather regression** - remove it
(don't run paged for decode-latency workloads).
- The remaining ~4-5x vs vLLM (true even for stock) is **kernel efficiency**:
llama.cpp's long-context decode-attention and thin-batch FP4 GEMM are slower than
vLLM's PagedAttention + Marlin on GB10. That is a **kernel project** (in-kernel
paged attention + flash-decoding + fused W4A16 GEMM), not a serving-loop project.
- Sampling, detokenize, `update_slots`, and the continuous-batch scheduler are **not**
the gap; the GPU is ~95% busy on memory-bound kernels the whole step.
What is closable: lever 1 (immediately, by not paging), lever 3 (bounded, with kernel
work). What is structural / hard: lever 2 (the decode-attention kernel + a real
in-kernel paged read), which is where the context-scaling gap actually lives and where
any serious effort to approach vLLM on GB10 must go.
## Reproduction (dev-tree only, `~/bench/decode_study/`)
- `launch_srv.sh` / `runcfg.sh` - launch llama-server (paged on/off) and a config.
- `client.py` - K=32 token-id fan-out (1024 prefix + 32 suffix), `SAMP=greedy|heavy`.
- `d2drv.sh` - nsys pure-decode window (delay 70s past prefill) -> `srv_decode2.nsys-rep`.
- `cat2.py` - kernel-time categorization from the sqlite export.
- vLLM side: `~/bench/run_vllm.sh` + `vllm_prefix.py` (K=32, ~270 tok/s).
</content>
</invoke>

View File

@@ -1,756 +0,0 @@
# Decode parity exploration (post-SSM-fix) - per-agent findings
Post-SSM-fix decode (patches 0018 in-place state write-back + 0019 fused gather):
dense q36-27b-nvfp4 decode_agg = 256.6 t/s @npl128 = 65.6% of vLLM 391, bit-exact.
The remaining +54% to parity is the question each section below probes. All numbers
DGX GB10 (sm_121), fusion OFF baseline, `decode_agg` = `S_TG t/s`.
---
## Section: per-token-latency (critical path / host-loop) - READ-ONLY
**Verdict: the per-step critical path and host loop are NOT the residual lever.
Post-SSM the GPU is still ~99% busy at npl128; the entire exposed-idle budget is
~0.65% of the step (~2.4 ms), of which graphs already remove the within-step half
(0.34%) and the between-step host gap is ~2 ms/step (~0.4% post-SSM). The 64-layer
sequential chain does NOT under-fill the GPU at batch 128 - every kernel's grid
saturates the SMs on its own. The +54% to parity is GPU kernel work (FP4 GEMM
efficiency + LPDDR5x weight bandwidth), not serialization or host overhead.**
### 1. Measured exposed-idle structure (a2_nsys pre-SSM rep, read-only sqlite sweep)
`paged_off_npl128.sqlite`, steady window 40-97% of trace (14.78 s, ~16.5 decode
steps at the pre-SSM ~896 ms/step). Overlap-correct interval-union sweep:
| activity set | busy % | exposed idle |
|-------------------------|---------|--------------|
| kernels only | 80.25% | 19.74% |
| kernels + memcpy (all) | 99.35% | **0.65%** |
- The 19.4% kernels-only gap = **841 big gaps (median 3.35 ms, ~51/step)** that are
filled by D2D memcpy. These ARE the per-layer gated-DeltaNet recurrent-state copies
(the `gated_delta_net -> ggml_cpy(state->cache) -> next layer reads state` chain).
They were a real critical-path serialization, and **patches 0018/0019 removed exactly
these** (D2D bucket 18.9% -> 0.23%; get_rows gather 18.8% -> 0.7%). Decode rose
+37.8% (186 -> 256 t/s), ~matching the work removed -> the kernels reflowed
back-to-back, so post-SSM these big gaps are CLOSED, not re-exposed (inferred from
the throughput scaling; the post-SSM nsys was not re-profiled by this read-only agent).
- The TRUE exposed idle (kernel+memcpy union) is **0.65%**: 18 host gaps >=0.5 ms,
**median 2.06 ms, max 2.85 ms, ~1.1/step**. This is the single between-step host gap
(sample-128 + `update_slots` + next-batch build) that does NOT overlap GPU compute.
- Within-step launch gaps: 24,190 micro-gaps, median 2.14 us, summing to 50.6 ms =
**0.34%** of the window - the pure launch overhead that CUDA graphs collapse
(measured 0.37% -> 0.11% in A2_CUDAGRAPH_DECODE; graphs already engage on the
default paged decode with a 256-token reset cadence).
### 2. Post-SSM scaling of the FIXED host gap
The ~2 ms/step between-step host gap is FIXED work (independent of GPU kernel time).
As decode accelerated it grew only as a fraction of a shrinking step:
| build | step ms @npl128 | host gap | host gap % of step |
|---------------|-----------------|----------|--------------------|
| pre-SSM (146) | ~877 | ~2 ms | 0.24% |
| post-SSM (256)| ~499 | ~2 ms | **~0.40%** |
| vLLM (391) | ~328 | (n/a) | (would be ~0.6%) |
Even fully removing it (perfect overlap) buys ~0.4%. It is a second-order floor, not
the lever - it only becomes material once the kernels are fast enough to drop GPU-busy
below the host time, which is not the case at 65% of parity.
### 3. The 64-layer chain does NOT under-fill the GPU at batch 128
The decode is an intrinsically sequential depth-64 chain (autoregressive: layer N
needs layer N-1; cannot be parallelized across layers). The question is whether each
individual kernel fills the SMs at batch 128. It does:
- **GDN kernel** (`gated_delta_net.cu`): launch grid `dim3(H, n_seqs, ceil(S_v/4))`
= `48 x 128 x 32 = 196,608 blocks` (dense, H=48 value heads, S_v=128). Block
`(warp_size, 4, 1)`. Massively oversubscribes the GB10 SMs. Each warp loads its
state shard into registers once and runs a single `n_tokens==1` iteration - O(1) in
context (confirmed flat across 4x ctx in GDN_DECODE_VERIFY).
- **FP4 GEMM** (`mul_mat_q`, mmq_x=128): M=128 token tile, well into the M-batched
regime, full SM occupancy (and Track B P2a already showed it goes 2 CTA/SM).
- The 99.35% kernel+memcpy busy reading IS the direct proof there is no under-fill at
npl128: if the chain under-filled, busy% would be well below 99%.
Under-fill only appears at LOW batch (npl32/npl4), where it manifests as the
weight-bandwidth/GEMV regime (npl32 = 170 t/s vs npl128 = 256): fewer tokens amortize
the same per-step weight read, NOT idle SMs. That is a bandwidth floor, not a
host/scheduler problem.
### 4. What the host actually does per step (eager rep runtime API)
Steady-window `CUPTI_ACTIVITY_KIND_RUNTIME` totals (host-thread wall, overlaps GPU):
| API | n | total | avg |
|---------------------------|-------|---------|---------|
| cudaStreamSynchronize | 1723 | 7775 ms | 4513 us |
| cudaLaunchKernelExC | 30983 | 4045 ms | 131 us |
| cudaLaunchKernel | 20385 | 2694 ms | 132 us |
| cudaMemcpyAsync | 2085 | 96 ms | 46 us |
~104 stream-syncs/step and ~3100 kernel launches/step in eager mode (collapsed by
graphs to ~900 launches/step). The 7.8 s of sync is the host BLOCKING on the busy
GPU (it overlaps GPU compute, it is NOT exposed idle) - the GPU stays 99.4% busy. The
sampled-token path is `cudaMemcpyAsync` (96 ms total, negligible, non-blocking). The
only NON-overlapped residue is the ~2 ms/step between-step gap in section 1.
### 5. vLLM host-loop comparison (per VLLM_DECODE_GROUNDING.md)
vLLM's eager decode is host-cheap BY CONSTRUCTION and hides the host fully behind the
async CUDA stream WITHOUT pipelined scheduling (`async_scheduling` was OFF; it won the
2.4x with synchronous scheduling): persistent pre-allocated input buffers updated by
vectorized numpy (no per-token Python), attention metadata `build()` once per step
reused across all layers, no GPU->CPU sync in the hot path, sampled-token D2H
non-blocking + event-gated, and a fixed small launch sequence (~2 ops/Linear). The
next-step host prep overlaps the current-step GPU compute on the async stream. The key
asymmetry vs llama: vLLM builds its graph ONCE and reuses persistent device
KV/block metadata; ggml rebuilds/reallocates the cgraph each decode step (new
`cgraph->uid`) and re-dispatches ~3100 launches from the loop on the weak Grace cores.
But this asymmetry is hidden under GPU compute on BOTH sides at npl128: llama's host
loop is a 0.4% exposed gap, not a 2x lever. vLLM's host cheapness is why ITS step is
328 ms host-free, but llama's 499 ms is also ~99% GPU - the 171 ms difference is GPU
kernel time (FP4 GEMM), not host.
### 6. Is any host/serialization lever CUDA-graph or scheduler addressable?
- **Within-step launch idle (0.34%)**: CUDA-graph addressable, ALREADY captured by
default (0.37 -> 0.11%). Worth ~0% of decode_agg (measured +0.1-0.8%, noise).
Nothing left to win here.
- **Between-step host gap (~2 ms, ~0.4%)**: NOT removed by a graph (the graph replays
the forward; the host still samples + runs `update_slots` + rebuilds the batch
between replays). It is SCHEDULER addressable - overlap step N+1's host prep with
step N's GPU compute, mirroring vLLM's persistent-buffer + build-once-reuse +
non-blocking-D2H pattern (and ideally reuse the ggml cgraph across steps instead of
rebuilding it every ubatch). But the ceiling is ~0.4% of the step, so it is a
cleanup, not a parity lever.
- **The +54% to parity is none of the above.** It is GPU kernel work: post-SSM the FP4
GEMM family is ~48% of decode (the dominant residual), GDN recurrence ~22.5%, and the
decode is weight-bandwidth/latency-bound on LPDDR5x (Track B P2a: a -24.7% FP4-GEMM
kernel left decode_agg FLAT, the freed compute became idle gaps -> decode is not
GEMM-compute-bound but bandwidth/latency-bound). The lever lives in cutting DRAM
traffic per step (fused act-quant to drop the separate `quantize_mmq` pass, native
FP4-MMA, and/or NVFP4-dense weight quant), NOT in the host loop or CUDA graphs.
### Evidence
- Read-only sqlite sweeps on `~/bench/a2_nsys/paged_off_npl128.sqlite` (this agent).
- `gated_delta_net.cu` launch grid (DGX `~/llama-paged-dev`).
- A2_CUDAGRAPH_DECODE.md, SSM_DECODE_FIX_RESULTS.md, GDN_DECODE_VERIFY.md,
VLLM_DECODE_GROUNDING.md, THROUGHPUT_B_P2a_POSTSSM_RESULTS.md.
# Decode-Parity Exploration
## Section: gdn-source-compare (llama gated_delta_net.cu vs vLLM fused_recurrent_gated_delta_rule)
### Model config (Qwen3.5-27B dense, from vLLM config.json)
- linear_key_head_dim K = 128, linear_value_head_dim V = 128
- linear_num_key_heads = 16, linear_num_value_heads = 48 (GVA 3:1), conv_kernel = 4
- 64 layers, full_attention_interval 4 -> 48 linear (GDN) : 16 full-attn
- Recurrent state per (seq, v-head) = V*K = 128*128 = 16384 f32 = 64 KiB.
Per layer per seq = 48 * 64 KiB = 3 MiB. Both engines store state in f32.
### Which kernels run at decode
- llama: ggml_gated_delta_net_inplace_ids -> gated_delta_net_cuda<S_v=128, KDA=false, keep_rs_t=false>.
Gate is SCALAR per head (graph reshapes gate/beta to ne[0]=1), so the cheaper !KDA branch runs (one expf per token, not per-channel).
- vLLM: enable_packed_recurrent_decode -> fused_recurrent_gated_delta_rule_packed_decode_kernel
(the dedicated single-token decode kernel, NOT the generic varlen fwd kernel).
### The state HBM traffic is IDENTICAL - it is NOT the lever
Per (seq, v-head) per decode token both engines read 64 KiB state + write 64 KiB state, f32, coalesced.
The dominant memory term is equal. llama is NOT moving more state bytes than vLLM.
=> The 1.46 ms/call is llama achieving LOWER effective bandwidth on the SAME bytes,
plus extra non-state work, NOT a fundamental HBM-traffic deficit. Hence closable.
### Algorithmic / parallelization delta (the real differences)
1) Reduction strategy (biggest structural difference)
- llama: WARP-PER-OUTPUT-COLUMN. State stored transposed M[col][i]=S[i][col]. Each warp owns
one V-column; the contraction over the 128 K-rows is a cross-lane warp_reduce_sum.
TWO warp_reduce_sum per token (one for kv = S^T@k, one for attn = S^T@q) = ~10 shuffle
rounds on the critical path, with n_tokens=1 they are NOT amortized.
- vLLM: THREAD-PER-OUTPUT-ROW. b_h is a [BV,BK]=[32,128] tile; each thread owns a FULL K-row
of state. sum(b_h*b_k, axis=K) and sum(b_h*b_q) are THREAD-LOCAL 128-wide reductions -
ZERO cross-thread shuffles. Outer-product update b_h += b_v*b_k is also thread-local.
Same FLOPs, but vLLM has no shuffle-reduction latency in the recurrence.
2) Occupancy / launch geometry (likely the dominant bandwidth gap)
- llama: block = (32 lanes, 4 warps) = 128 threads; grid = (H=48, n_seqs, ceil(128/4)=32).
Per (head,seq) it launches 32 blocks * 128 threads = 4096 threads to touch a 16384-elem state
(only 4 state elems/thread). launch_bounds(128, 2) budgets registers for >=2 blocks/SM; with
s_shard[4]+k_reg[4]+q_reg[4]+addressing the register pressure caps it near ~2 blocks = 8 warps/SM
(~12-16% occupancy on GB10). A memory-bound kernel at ~8 warps/SM cannot generate enough in-flight
loads to saturate 273 GB/s -> low achieved bandwidth on the state read/write.
- vLLM: 1 warp/program (num_warps=1), grid (NV=4, B*HV), small register footprint, num_stages=3
software-pipelines (prefetches) the state load. Far higher memory-level parallelism per SM.
3) Redundant non-state traffic in llama
- q,k re-loaded by EVERY column-warp: 128 column-warps/head each reload the same 128-float q and k
=> ~128x amplified L2 loads of q/k per head/token (vLLM reloads ~4x, once per NV program).
Small (L2-resident) but adds load-issue + L2 pressure competing with the state stream.
- Output store: llama writes attn_data[col] from lane 0 only (31/32 lanes idle), scattered
single-float stores; vLLM stores a contiguous BV=32 vector (coalesced).
4) Fusion delta (per-layer kernel-launch / HBM round-trip count)
- vLLM packed_decode FUSES into ONE kernel: q/k l2norm + q*scale + softplus(a+dt_bias) +
(-exp(A_log)) gate + sigmoid(beta) + the recurrence + state write-back.
- llama computes these as SEPARATE ggml ops/kernels in the graph before the GDN op:
ggml_l2_norm(q), ggml_l2_norm(k), ggml_add(+dt), ggml_softplus, ggml_mul(gate),
ggml_sigmoid(beta) (+ conv/silu), each a launch + small HBM round-trip. Plus a separate
gdn_gather_nonident_kernel launch per layer (a no-op at steady-state decode: every block
early-returns on the identity check, but still a grid launch of n_seqs blocks).
Across 48 linear layers this is ~6-10 extra small kernels/layer (~300-480 extra launches/token).
Whether this dominates depends on CUDA-graph capture (see A2_CUDAGRAPH_DECODE.md); if captured,
launch latency is hidden and the cost reverts to the per-op HBM round-trips + dependency gaps.
### What a faster llama GDN decode kernel would need (optimization scope)
- A. Re-parallelize like vLLM: thread/lane owns a full K-row (or K-shard) so the kv and attn
contractions become register-local FMAs, eliminating the two warp_reduce_sum per token.
- B. Raise occupancy for the memory-bound regime: drop/raise the launch_bounds minBlocks hint
(the `,2)` is too low), shrink the block, cut registers, and add a software-prefetch of the next
state shard so more state loads are in flight per SM. This directly lifts achieved bandwidth on
the equal state bytes - the single highest-leverage change.
- C. Load q,k ONCE per (head,seq) into shared memory instead of 128x per-column reload; coalesce
the output store across the warp.
- D. Fuse the gate/l2norm/scale (softplus, exp(A_log), sigmoid, l2norm) INTO the recurrence kernel,
reading raw a/b/A_log/dt_bias from registers, removing ~6 elementwise passes + their HBM round-trips
per layer (matches vLLM's packed_decode). Drop the gather no-op kernel at steady-state decode
(or fold the identity check into the recurrence prologue, which it already partly does).
- E. (Longer term) bf16 state would HALVE the dominant traffic, but vLLM keeps f32 too, so this is a
divergence-from-reference not a parity lever.
### Bottom line
llama's GDN decode kernel is NOT moving more state HBM bytes than vLLM (the dominant term is equal),
so the 1.46 ms/call is an EFFICIENCY gap, not a traffic floor: (1) cross-warp shuffle reductions on
the n_tokens=1 critical path, (2) low occupancy (~8 warps/SM from launch_bounds + register pressure)
starving memory-level parallelism so the equal state bytes move at lower effective bandwidth, plus
(3) 128x redundant q/k L2 loads and (4) ~6-10 unfused gate/norm elementwise kernels per layer that
vLLM folds into one packed-decode kernel. Highest-leverage fixes: raise occupancy + prefetch (B) and
row-local reductions (A); secondary: gate/norm fusion (D) and q/k shared-mem reuse (C).
---
## Section: validate-findings (adversarial re-derivation from raw DGX data) - READ-ONLY
Re-queried `CUPTI_ACTIVITY_KIND_KERNEL` + `CUPTI_ACTIVITY_KIND_MEMCPY` directly (kernel and
memcpy summed separately so D2D is never lumped into compute), not from summary text.
### CLAIM 1 - decode decomposition
PRE-FIX (`a2_nsys/paged_off_npl128.sqlite`, last 17s) vs `decode_decomp.txt`, match <=0.1pp:
gated_delta_net 23.40% (doc 23.43), k_get_rows 21.99% (21.88), MEMCPY-DtoD 18.89% / 382 GB /
1583 ops (18.90 / 356 GB / 1584), mul_mat_vec_q 15.53% (15.51), mul_mat_q 10.48% (10.37).
=> CONFIRMED exactly. gated_delta_net = largest single non-GEMM kernel; FP4-GEMM group ~28%;
full attention 0.37%.
D2D collapse: only on-box post-fix decomp is `ssm_decomp/after.sqlite`; MEMCPY-DtoD there =
526 ops / 0.9 ms / 0.05 GB = 0.008% of busy (from 382 GB / 18.89%). => CONFIRMED, stronger than
the doc's "0.23%" (382 GB state copy-back gone; exact "0.23%/2.93GB/734ops" not reproducible -
my DtoD 0.05 GB, the 2.16 GB is DtoH).
FLAG (refutes part of the Step-2 decomp): `after.sqlite` is a Step-1 build (patch 0018 only),
NOT Step-2. It still shows k_get_rows_float 28.44% (gated_delta_net 28.96%, FP4-GEMM group ~33%),
no `gdn_gather_nonident` kernel, profiled S_TG=164 (~Step-1 180, not Step-2 256); mtime 00:31
predates the 08:48 rebuild that carried patch 0019. The Step-2 split in `SSM_DECODE_FIX_RESULTS`
("get_rows 18.8%->0.7%, FP4-GEMM ->48%, GDN 22.5%") has NO surviving sqlite, and the script meant
to produce it (`ssm_decomp.sh`) CRASHED (Python SyntaxError, see `ssm_decomp_after.out`). So
"FP4-GEMM ~48%" is UNVERIFIED against raw Step-2 data: measured ~33% on Step-1; removing the 28%
get_rows bucket lifts it to ~46% arithmetically, so ~48% is plausible but not directly measured.
Section 1 above and SSM_DECODE_FIX_RESULTS both inherit this unverified Step-2 split.
### CLAIM 2 - 146 -> ~257 ("+66%")
146.23 baseline CONFIRMED (`ssm_decode_baseline.out`); final 256.57 / 252.50 / 254.02 across
SSM_DECODE_FIX_RESULTS + THROUGHPUT_B_P2a, within ~1.6%. Magnitude CONFIRMED. TRAP: 146->257 is
+76% (146->254 = +74%), NOT +66%. "66%" is the % of vLLM (257/391 = 65.7%), not the speedup.
### CLAIM 3 - P2a GEMM-remap FLAT on decode
THROUGHPUT_B_P2a: dense npl128 252.50->254.02 (+0.6% noise), npl32 -0.4%, MoE flat; FP4 GEMM
kernel itself -24.7%, PREFILL +12.7%. Pre-SSM corroborated by THROUGHPUT_B_P1. => CONFIRMED.
### CLAIM 4 - 65% of vLLM (254 vs 391)
254/391 = 64.96%, 256.57/391 = 65.6%; vLLM 391 = enforce_eager apples ref. => CONFIRMED.
### Traps checked
GGML_CUDA_DISABLE_GRAPHS set `=1` explicitly (not the empty-value trap); graphs ON/OFF within
noise. memcpy-in-compute lumping AVOIDED (separate table sums). Decomp reps are ntg24-under-nsys
(S_TG 149/164) - valid for SHARES only; throughput correctly from unprofiled ntg128 logs.
### Net verdict
1 pre-fix decomp CONFIRMED exact; D2D collapse CONFIRMED (stronger); Step-2 0.7%/48% split
UNVERIFIED (producer script crashed, only post-fix sqlite is Step-1). 2 magnitude CONFIRMED,
"+66%" label REFUTED (true +76%; 66% = % of vLLM). 3 CONFIRMED. 4 CONFIRMED.
---
## Section: weight-bandwidth (whole-step DRAM budget, READ-ONLY math)
Agent label: weight-bandwidth. Method: exact GGUF tensor accounting (q36-27b-nvfp4,
arch qwen35, 64 layers) + activation-state math + existing nsys/decode_decomp; no GPU started.
Config = the production decode number: llama-batched-bench -fa on -npp128 -ntg128 -npl 128
(B = n_parallel = 128 sequences, S_TG = 254 t/s post-0019). GB10 LPDDR5x peak ~273 GB/s.
### Exact per-step DRAM byte budget at B=128 (ctx avg ~192 over the ntg128 window)
NVFP4 type-40 = 0.5625 B/weight (4-bit data + e4m3 per-16 micro-scale; verified: 5120*48*0.5625=138240).
WEIGHTS (read ONCE per step, shared across all 128 seqs):
- NVFP4 layer weights (type40, 64 layers): 13,062.7 MB = 12.76 GB
(per SSM layer 215.6 MB x48 = 9867.7 MB ; per full-attn layer 199.7 MB x16 = 3195.0 MB)
- LM head output.weight: type 30 = **bf16, NOT quantized** = 2425 MB = 2.37 GB (read in full each step)
- per-layer norms/conv1d/ssm_a/dt_bias (type0 f32): 10.1 MB
- token_embd: EXCLUDED (get_rows gathers only 128 rows, negligible)
=> WEIGHTS TOTAL = 15.14 GB / step
PER-SEQUENCE STATE (x128 seqs, read + write every step):
- SSM recurrent state: inner_size(6144) x state_size(128) x 4B(f32) = 3.0 MB / layer / seq
x 48 SSM layers x 128 seq = 18.43 GB read + 18.43 GB write = **36.86 GB / step**
- conv state: conv_k(4) x conv_dim(10240) x 4B = 160 KB / layer / seq
x 48 x 128 = 0.96 GB read + 0.96 GB write = 1.92 GB / step
- KV cache (16 full-attn layers, GQA n_kv_head=4, k+v_len=512, f16):
4096 B/tok/layer x 16 x ~192 ctx x 128 seq = ~1.6 GB read / step
TOTAL ~= 15.14 (W) + 36.86 (SSM state) + 1.92 (conv) + 1.6 (KV) = **~55.5 GB / step**
### Floor vs measured -- decode is NOT at the bandwidth floor
Bandwidth floor = 55.5 GB / 273 GB/s = **203 ms/step**
Measured llama = 128 tok / 254 t/s = **504 ms/step** => **2.48x the floor** (eff BW 110 GB/s = 40% of peak)
vLLM 391 t/s = 128 / 391 = **327 ms/step** => 1.61x the floor (eff BW 170 GB/s = 62% of peak)
The SAME 55.5 GB/step floor applies to vLLM: identical NVFP4 weights, and its
fused_recurrent_gated_delta_rule reads+writes the identical f32 recurrent state. So both engines
face the same DRAM wall; vLLM simply moves those bytes at 62% of peak vs llama's 40%. The 62/40 =
1.55x utilization gap is EXACTLY the 254->391 (1.54x) throughput gap. => Decode-parity is a
bandwidth-UTILIZATION / launch-serialization problem, NOT a DRAM-traffic-volume problem. Bandwidth
is not the binding constraint (we sit 2.5x above the floor); confirms the GDN-kernel section above.
### Traffic composition is STATE-dominated at B=128 (qualifies the "weight-quant" verdict)
SSM state r+w = 66% of step traffic; weights = 27%; conv = 3.5%; KV = 3%.
At B=128 weights are a minority of traffic, and we are 2.5x above the floor anyway -> NVFP4-dense
weight quant (the QWEN36_NVFP4 verdict's lever) cannot move batch-128 decode much. Weight-quant
helps PREFILL (compute/weight-bound, already +12.7% from the GEMM remap) and LOW-batch decode.
Cross-check at B=32: traffic ~25.2 GB/step (weights now 60%), floor 92 ms, measured 189 ms = 2.05x
floor. The sublinear scaling 32->128 (4x batch, only 1.5x throughput: 169->254) is fully explained
by per-seq state traffic growing with B while weights stay amortized -> at B=128 the step has become
state-traffic-heavy but is STILL 2.5x off the floor, i.e. latency/overlap-bound, not byte-bound.
### Redundant traffic llama reads that vLLM avoids (cut list, by impact)
1. (HISTORICAL, FIXED by 0018) Redundant DtoD recurrent-state copy = +18.4 GB/step EXTRA
(pre-fix decode_decomp: MEMCPY-DtoD 18.9%, 80 copies/step ~230 MB each = 18.4 GB; nsys window
356 GB/19.8 steps). This doubled state traffic and was the dominant pre-fix waste. Verified gone
post-fix: the THROUGHPUT_B_P2a A/B kernel sum (npp128 ntg24 npl128) lists gated_delta_net /
mul_mat_q / quantize but NO MEMCPY-DtoD term. (The committed ~/bench/a2_nsys sqlites are all
PRE-fix S_TG~149 traces; re-profiling deferred to the designated profiler.) This single removal
(18.4 GB/273 ~= 67 ms/step of bytes plus the killed overlap stalls) is the bulk of 146->254.
2. conv state as a SEPARATE ssm_conv kernel + separate buffer: 1.92 GB r+w/step AND 48 extra kernel
launches/step. vLLM folds the causal conv into its recurrence kernel. Cut ~= 7 ms bytes + 48
launches/step of serialization.
3. Residual get_rows gather post-0019 (~0.7%, decode_decomp pre-fix k_get_rows was 21.9% / ~96
ops/step = 2/SSM-layer): vLLM indexes the per-seq state in-kernel; llama still does a small
gather/scatter. ~0.13 GB. 0019 already folds most of it; fold the identity check fully into the
recurrence prologue.
4. quantize_mmq_nvfp4: 448 ops/step re-quantizing activations to NVFP4 before each FP4 matmul.
Activation BYTES are negligible, but it is 448 extra kernel launches/step that vLLM fuses into
the GEMM prologue -> pure launch latency, not traffic.
5. NOT redundant: weight bytes (identical NVFP4 to vLLM), SSM-state r+w (inherent, vLLM pays it),
NVFP4 scale scalars (8 B/tensor). Note the LM head is bf16 not quantized (2.37 GB/step, 16% of
weight traffic) -- fp8 LM head would save ~1.2 GB/step but only matters if vLLM also quantizes it.
### Bottom line (weight-bandwidth)
At B=128, decode moves ~55.5 GB/step and runs at 2.48x the 273 GB/s floor (40% util) vs vLLM's 1.61x
(62% util). Same bytes, same floor for both engines -> decode is bandwidth-UTILIZATION-bound, not
traffic-bound. There is NO large redundant-byte stream left to cut post-0018/0019 (the 18.4 GB/step
DtoD redundancy is already gone); the remaining 254->391 is recovered by raising achieved bandwidth
(occupancy + prefetch on the GDN state loads, conv fusion to drop 48 launches/step) so the EXISTING
55.5 GB/step moves at vLLM's 62% instead of 40%. Weight-quant (NVFP4-dense) is a PREFILL / low-batch
lever, largely orthogonal to the batch-128 decode-parity gap.
---
## Section: explore-other-levers (broad sweep for OTHER llama-specific decode inefficiencies) - READ-ONLY, no GPU
Scope handoff: GDN-kernel internals -> `gdn-source-compare`; host loop / graphs / gaps ->
`per-token-latency`; weight-byte / utilization -> `weight-bandwidth` section above (which already
covers the BF16 lm_head and the "same bytes, 40% vs 62% util" framing - I concur, no need to repeat).
This section covers the levers NONE of those own: the FP4 act-quant fusion, the M=128-vs-M=1 ggml
fusion gate, TMA scoping, and the conv-state residual.
**Terminology fix that matters for the whole doc:** in this repo's benches **"fusion OFF" means
`LLAMA_FUSE_NVFP4_QUANT=0`** (Track A's NVFP4 act-quant producer), confirmed in
`a2_nsys.sh`/`a2_4cell.sh`/`trackA_clean.sh`. It does NOT set `GGML_CUDA_DISABLE_FUSION`, so the
**standard ggml-cuda elementwise/GLU/rope fusion is ON** in every result. The header's "fusion OFF
baseline" is only about the act-quant producer.
**Framing (consistent with the sections above, sharpened):** the binder is bandwidth-UTILIZATION /
the kernel-dependency chain, not traffic or per-kernel compute (P2a -24.7% GEMM and graphs both
flat). The thing that raises utilization AND shortens the chain is the same: **fewer, fused kernels
per step** - removing whole passes vLLM doesn't run. So rank by "whole pass eliminated", not "us
shaved".
### L1. Re-test Track A act-quant fusion (`LLAMA_FUSE_NVFP4_QUANT=1`) POST-SSM. [impact ~8-11%, tractability HIGH - code exists, owned by tasks 38-41]
`quantize_mmq_nvfp4` is a standalone full-activation requantize run once per NVFP4 GEMM at M=128
(the weight-bandwidth section counts 448 such launches/step). vLLM has **zero** equivalent:
`rms_quant_fusion.py:98` folds it into RMSNorm, `act_quant_fusion.py:40,128` into SiLU+mul - the
activation never hits a temp buffer. Track A built exactly this fused producer (tasks 38-40 DONE),
but `LLAMA_FUSE_NVFP4_QUANT=1` regressed, and EVERY post-SSM bench ran with it OFF. **The regression
is likely stale:** pre-SSM the GPU was 99% busy on the state-copy chain, so folding act-quant into
the norm only relocated busy work into a lower-occupancy kernel with no idle to reclaim. Post-SSM the
chain has real idle and removing 448 launches/step both shortens the dependency chain and lifts
utilization - exactly the post-0018/0019 bind. Highest-value CLEAN removal; needs only a re-bench
(re-run `trackA_clean.sh` on the post-0019 build), not new code. Do not treat the prior regression
as final.
### L2. M=128 norm->matmul prologue fusion - the ggml fusion gate that does NOT fire at decode batch. [impact ~5-15% aggregate, tractability MEDIUM]
ggml-cuda's built-in `rms_norm+mul+mul_mat_vec_q` fusion (`ggml_cuda_should_fuse_mul_mat_vec_q`,
ggml-cuda.cu:2502) is gated to `dst->ne[1]==1` - it ONLY fires at **npl1** (M=1). At npl128
(`mul_mat_q`, M=128) it does NOT fire, so the per-layer RMSNorm stays a separate kernel feeding the
GEMM and the act path is unfused (L1). vLLM fuses both into the GEMM prologue at all M. This is the
M=128 generalization of the existing M=1 fusion + L1; largest aggregate surface but real kernel work.
Implies a regime split worth stating loudly: **npl1 single-stream latency already gets this fusion;
the npl128 throughput number does not** - tune the two separately.
### L3. TMA weight feed: a PREFILL / npl1-latency lever, NOT an npl128-decode lever.
Answering the brief's question (GEMM idle = FEED problem TMA fixes, or off-critical-path TMA can't?):
P2a cut GEMM compute and the freed time became IDLE, so at npl128 the GEMM finishes early and the
stall is BETWEEN kernels, not inside the GEMM waiting on weight tiles. TMA accelerates a
*feed-stalled* GEMM; at npl128 the GEMM is not the binder, so TMA won't move npl128 S_TG. It pays on
(a) **prefill** (compute/feed-bound; the remap already gave +12.7%) and (b) **npl1 decode**, a pure
weight-feed GEMV (full model / 273 GB/s ~ 19-20 tok/s ceiling). Scope TMA to prefill + low-batch
latency; do not bank it for batch-128 decode parity. (Consistent with the weight-bandwidth section's
"NVFP4-dense is a prefill/low-batch lever".)
### L4. In-place / `ids` conv-state - apply the 0018/0019 pattern to `ssm_conv`. [impact ~1-3%, tractability HIGH, proven pattern, bit-exact-able]
After the SSM fix the residual D2D is the conv-state copy (`build_conv_state`,
delta-net-base.cpp:449-525: `build_rs` reads 3 prior samples, `ggml_concat` the new token, writes
the last 3 back), plus `ssm_conv` (~0.8-1.5%) and a per-GDN-layer `concat_cont` (48/step). The exact
in-place + `ids`-read treatment from 0018/0019 applies to the conv state, and `ssm_conv`+`concat`
can fold into the GDN kernel prologue (it already has `ids` plumbing). Small ceiling but bit-exact,
low-risk, and removes ~48 launches/step from the chain - this is the "conv fusion to drop 48
launches/step" the weight-bandwidth section calls for, made concrete via the proven patch pattern.
### Deferred (covered by other sections, I concur)
- GDN occupancy / row-local reductions / gate-norm fusion -> `gdn-source-compare`. Add only: bf16
state halves the dominant traffic but vLLM keeps f32, so it is a divergence-from-reference, not a
parity lever - last priority, quality-risk.
- BF16 lm_head / weight-byte / 40%-vs-62% utilization -> `weight-bandwidth` section. lm_head NVFP4 is
an absolute ~1-2% trim, not a vLLM-relative gap (vLLM likely keeps it bf16 too).
- Full-attention KV path (16 attn layers, 0.4-1.8%, O(ctx) but tiny) -> CLOSED, not a lever.
### Bottom line (this section's net-new)
Ranked by "whole pass vLLM eliminated": **L1 (re-test act-quant fusion post-SSM - clean removable
pass, code already written, just needs a post-0019 re-bench)** > **L2 (M=128 norm/act prologue
fusion - biggest aggregate surface, real work)** > **L4 (conv-state in-place - cheap, proven 0018/0019
pattern, -48 launches/step)**. **L3 (TMA) is mis-scoped if aimed at npl128 decode** - it is a prefill
/ npl1-latency lever, same bucket as NVFP4-dense weight quant. Caveat inherited from
`validate-findings`: the post-SSM act-quant absolute share (L1) is on an unverified Step-2 decomp
(only clean post-fix sqlite is Step-1); re-measure on a clean Step-2 nsys when the profiler runs.
Assisted-by: Claude:opus-4.8 [Claude Code]
---
## Section: profile-both-engines (GROUND-TRUTH post-SSM nsys of llama AND vLLM at npl128) - THE GPU PROFILER
Agent label: profile-both-engines (the only GPU agent). Fresh post-SSM nsys traces of
BOTH engines at the same shape (128-seq decode, 128-token prompts), q36-27b-nvfp4 dense.
llama = `build-cuda-base` (no FP4 flag, byte-identical to stock, HEAD 46d7dd8 = patch 0019
SSM fix), `llama-batched-bench -npp128 -ntg32 -npl128 -fa on`, eager (DISABLE_GRAPHS=1) for
a clean per-kernel trace. vLLM = 0.23.0 in-process offline (`VLLM_ENABLE_V1_MULTIPROCESSING=0`
so cudaProfilerApi controls the worker), enforce_eager, max_num_seqs 256, 128 prompts.
Decode-only windows (prefill excluded), overlap-correct interval-union busy, GPU-accurate
per-call kernel durations. This is the post-SSM **Step-2** trace `validate-findings` flagged
as having no surviving sqlite - it now exists: `~/bench/postssm_decomp/`.
### 0. THROUGHPUT GROUND TRUTH (un-profiled, prefill-subtracted) - resolves the 391 reference
The vLLM 391 reference is real and reproduced. Prefill-subtracted decode step (two-length
w16/w64 timing, in-process, batch 128):
| engine / mode | ms/step | decode tok/s | notes |
|--------------------------|---------|--------------|--------------------------------|
| llama post-SSM (graphs) | ~510-522| **245-251** | S_TG @npl128 ntg32 (this run) |
| vLLM enforce_eager | 324.9 | **394.0** | == the ~391 ref (h2h log 371-384)|
| vLLM cuda-graphs | 304.9 | **419.8** | graphs buy only +6% |
- **CUDA graphs are NOT the parity lever**: vLLM is already 394 t/s EAGER; graphs add +6%
(394->420). llama-batched-bench already runs WITH graphs at 245. So the gap is eager-vs-eager
kernel work, confirming `per-token-latency` and `A2_CUDAGRAPH_DECODE`.
- TRAP I hit and corrected: the FIRST vLLM nsys window (0.35-0.99) read 468 ms/step / 273 t/s -
WRONG, contaminated by prefill chunked-GDN kernels AND eager-nsys host overhead. The tight
decode-only window (0.62-0.98) reads **326.5 ms/step**, matching the un-profiled 324.9 ms
exactly -> the tight window is faithful; per-kernel numbers below use it.
### 1. POST-SSM per-step decode decomposition, SIDE BY SIDE (GPU-accurate, prefill-free)
Both at batch 128. llama 510 ms/step (98.7% GPU-busy), vLLM 326 ms/step (97.9% GPU-busy).
ms/step = on-device kernel time per real decode step (nsys host overhead does not inflate GPU
kernel duration; per-step = GPU-ms / real-step-count from the decode-only GDN call count).
| component (per step) | llama ms/step | llama % | vLLM ms/step | vLLM % |
|-----------------------------|---------------|---------|--------------|--------|
| GDN linear-attn recurrence | 193 (48x4.03) | 38% | 174 (48x3.62)| 53% |
| FP4 matmul + act-quant | **236** | **46%** | **117** | **36%**|
| - mul_mat_vec_q (GEMV) | 132 (48x2.75) | 26% | - | - |
| - mul_mat_q (GEMM) | 88 (448 calls)| 17% | cutlass 61 | 19% |
| - quantize_mmq_nvfp4 | 16 (448) | 3% | nvjet 53+cvt2| 17% |
| full attention (16 layers) | 6.6 (16) | 1.3% | 6.2 (16) | 1.9% |
| SSM conv + glue/elementwise | ~45 | 9% | ~22 | 7% |
| MEMCPY (D2D+H2D) | 2.5 (131 MB) | 0.5% | 0.36 (85 MB) | 0.1% |
| **TOTAL** | **~510** | 100% | **~326** | 100% |
### 2. The three load-bearing comparisons (the brief)
**(1) GDN compute: llama vs vLLM = NOT the gap.** Per-call GPU duration:
llama `gated_delta_net_cuda<128>` = **4.03 ms/call**, vLLM
`fused_recurrent_gated_delta_rule_packed_decode` = **3.62 ms/call**. llama is only **+11%**
slower per call (+19 ms/step). GDN is comparable; it is the largest single kernel on BOTH sides
(38% llama, 53% vLLM) but it explains only ~19 ms of the 185 ms gap (~10%). This REFUTES the
framing that the GDN kernel is the dominant residual lever - it is a minor overage post-0018/0019.
(The `gdn-source-compare` occupancy/shuffle deltas are real but worth ~19 ms/step, not 1.5x.)
**(2) DRAM bytes/step: llama does NOT read more.** Explicit memcpy: llama **131 MB/step** vs
vLLM **85 MB/step** - llama moves a hair more in copies but both are <0.5% of the step. The big
per-layer state copies are GONE (pre-SSM 18 GB/step DtoD -> post-SSM 131 MB/step) - **the SSM fix
(0018/0019) is confirmed working in this trace.** Weight DRAM (read inside the GEMM/GEMV kernels,
not memcpy) is the SAME ~15 GB NVFP4 for both engines; at 273 GB/s that is a ~52 ms floor, and
BOTH engines sit far above it (326-510 ms), so BOTH are compute/kernel-bound, NOT
weight-bandwidth-bound, and llama reads no extra bytes. The 254-vs-391 gap is NOT a byte-volume
deficit - it is effective-bandwidth/compute-efficiency in the FP4 matmul kernels (see 3).
**(3) GPU-busy% / idle structure: identical, both ~98% busy.** llama 98.7% busy (1.3% idle),
vLLM 97.9% busy (2.1% idle). Neither engine is idle/gap/host-bound at npl128. The entire gap is
the GPU doing MORE kernel-time per step on llama: llama's non-GDN GPU work = ~310 ms/step vs
vLLM's ~146 ms/step. That 164 ms delta is concentrated in the FP4 matmul path.
### 3. THE single biggest llama-specific overage: the FP4 matmul path (+119 ms/step = 64% of the gap)
llama spends **236 ms/step** on FP4 matmul+quant; vLLM does ALL its matmul (cutlass FP4 GEMM +
cublas nvjet + act cvt) in **117 ms/step** - even though vLLM ALSO carries ~18 ms/step of extra
PyTorch eager elementwise glue that llama's fused ggml kernels avoid. llama is **2.0x slower on
FP4 matmul**, and that +119 ms is **64% of the entire 185 ms/step gap**.
Inside llama's FP4 path the dominant, untouched cost is **`mul_mat_vec_q` = 132 ms/step (26% of
decode), 48 calls/step (exactly one per GDN layer), 2.75 ms/call, grid 5120x128**. This is the
**FP4 GEMV ("vec_q") kernel running at decode batch 128** for the gated-DeltaNet in-projections -
a non-tensor-core, memory-bound-style kernel doing M=128 work without GEMM-grade weight-read
amortization. vLLM runs the equivalent projections through cutlass batched FP4 GEMM (tensor-core,
weight read amortized across the 128-row batch) at a fraction of the cost. **There is no
GEMV-at-batch-128 on the vLLM side at all.**
Key cross-check with Track B P2a: P2a optimized `mul_mat_q` (the 17%/88 ms tensor-core GEMM, made
it -24.7%) and decode stayed FLAT - because the BIG FP4 cost is `mul_mat_vec_q` (26%/132 ms),
which P2a never touched. **Track B optimized the wrong FP4 kernel.** The lever is to route the
GDN in-projection at M=128 through a tensor-core GEMM (mul_mat_q / MMQ) instead of the vec_q path,
and to fuse the act-quant (L1) + the norm prologue (L2) so the 448 `quantize_mmq_nvfp4` launches
fold away - exactly what `explore-other-levers` L1/L2 propose. My measurement RANKS them: the
mul_mat_vec_q->GEMM routing is the single highest-value target (132 ms), then act-quant fusion
(16 ms + 448 launches), then the GDN +19 ms.
### 4. Reconciling with the `weight-bandwidth` section (unification, not contradiction)
weight-bandwidth concluded "same 55.5 GB/step, llama 40% util vs vLLM 62% util -> utilization-bound."
My per-kernel data LOCALIZES that utilization gap: it lives in the **FP4 matmul kernels** (which
do the bulk of the ~15 GB weight read), NOT in the GDN state traffic. GDN moves its (equal) state
bytes at comparable rate on both engines (4.03 vs 3.62 ms/call). So the "40% vs 62%" is the
`mul_mat_vec_q`/`mul_mat_q` weight-read efficiency vs cutlass FP4 GEMM. Raising decode parity =
raise the FP4-matmul achieved bandwidth (tensor-core GEMM routing + act/norm prologue fusion),
not the GDN kernel and not byte-cutting.
### Verdict (profiler)
- Reproduced both engines at their true operating points: llama 245 / vLLM 394 eager / 420 graphs.
Graphs are not the lever (+6%). Both engines ~98% GPU-busy; gap is GPU kernel-time, not idle/host.
- GDN compute is comparable (llama +11%/call, +19 ms/step) - NOT the dominant residual.
- bytes/step: llama does not read more (131 vs 85 MB memcpy; identical weight bytes); SSM fix's
18 GB/step DtoD removal CONFIRMED in-trace.
- **The single biggest llama-specific overage is the FP4 matmul path: 236 vs 117 ms/step (+119 ms
= 64% of the 185 ms gap), dominated by `mul_mat_vec_q` (FP4 GEMV at batch 128, 132 ms/step, 26%,
one per GDN layer).** Highest-value lever = route the GDN in-projection through a tensor-core FP4
GEMM at M=128 + fuse act-quant/norm prologue (L1/L2). Track B optimized the wrong FP4 kernel.
### Evidence (DGX, this agent)
- `~/bench/postssm_decomp/postssm_base.{nsys-rep,sqlite,gpu_trace.csv,run.log}` (llama post-SSM).
- `~/bench/postssm_decomp/vllm_decode.{nsys-rep,gpu_trace.csv}` (vLLM eager decode trace).
- `~/bench/postssm_decomp/vllm_decode_g1.*` (vLLM graphs run), `~/bench/vllm_tps.py` (throughput).
- Scripts: `~/bench/postssm_llama_decomp.sh`, `~/bench/vllm_nsys_run.sh`, `~/bench/decode_decomp2.py`
(decode-only windowed, overlap-correct, MB-memcpy, per-step reconstruction).
Assisted-by: Claude:opus-4.8 [Claude Code]
---
## Section: SYNTHESIS (cross-check + ground-truth + ranked levers + verdict) - FINALIZED
Agent label: synthesize. Read-only (no GPU). Cross-checks all sections above against the
fresh `profile-both-engines` ground-truth, then mechanism-confirms the dominant lever by
reading the model graph + ggml-cuda dispatch source on the DGX (`~/llama-paged-dev`, HEAD
46d7dd8 = patch 0019). All throughput vs the vLLM 391 t/s eager apples-to-apples reference.
### 0. Headline
Post-SSM dense decode = 256.6 t/s @npl128 = 65.6% of vLLM 391, bit-exact. The residual is
NOT a hardware/architecture floor and NOT the GDN recurrence kernel, the host loop, CUDA
graphs, or DRAM byte-volume. It is ONE concrete, llama-specific kernel-routing defect:
**the gated-DeltaNet output projection (`ssm_out`) runs as an FP4 GEMV (`mul_mat_vec_q`)
at decode batch 128 instead of a tensor-core FP4 GEMM (MMQ), costing 132 ms/step = 26% of
decode = the single biggest overage vs vLLM (which packs the same projection into a cutlass
M=128 GEMM).** The fix is a ~2-line reshape, bit-exact, and is the highest-value next step.
### 1. Cross-check: which prior findings HELD, were REFUTED, or are SUPERSEDED
HELD (confirmed by both the adversarial re-derivation and the fresh profile):
- Pre-fix decomposition (gated_delta_net 23.4%, k_get_rows 21.9%, MEMCPY-DtoD 18.9% / 382 GB,
mul_mat_vec_q 15.5%, mul_mat_q 10.5%): reproduced to <=0.1pp (validate-findings).
- SSM-fix D2D collapse: the 18.4 GB/step redundant recurrent-state copy is GONE. Confirmed
three ways: validate (18.9% -> 0.008% on the post-fix sqlite), weight-bandwidth (A/B kernel
sum lists no DtoD term), and IN-TRACE by the profiler (18 GB/step DtoD -> 131 MB/step). The
SSM fix (0018/0019) is the real breakthrough and is working.
- P2a FP4-GEMM occupancy remap FLAT on decode (+0.6% noise) while the `mul_mat_q` kernel itself
shrank -24.7% and prefill rose +12.7%: confirmed. Decode is not GEMM-occupancy-bound.
- 65% of vLLM (254/391 = 64.96%, 256.6/391 = 65.6%): confirmed.
- Decode is NOT at the bandwidth floor: 55.5 GB/step moved at 2.48x the 273 GB/s floor (40% util)
vs vLLM 1.61x (62% util) on the SAME bytes. Confirmed + LOCALIZED below.
- Host loop / 64-layer serialization is NOT the lever: both engines ~98% GPU-busy at npl128
(llama 98.7%, vLLM 97.9%); the entire exposed-idle budget is ~0.65%. Confirmed by the profiler.
- CUDA graphs are NOT the lever: vLLM is 394 t/s EAGER, graphs add only +6% (420); llama already
runs with graphs. Confirmed by the profiler.
REFUTED / CORRECTED:
- "GDN recurrence kernel is the dominant residual lever" (the STATE brief's "gated_delta_net
1.46 ms/call, the largest single kernel" and the gdn-source-compare framing): REFUTED. The
profiler's fresh side-by-side per-call duration is llama 4.03 ms vs vLLM 3.62 ms = only +11% /
+19 ms/step = ~10% of the 184 ms gap. It IS the largest single kernel on both sides (38% llama,
53% vLLM) but the largest GAP is elsewhere. (The brief's "1.46 ms/call" is a stale/narrower
window; the authoritative post-SSM per-call is 4.03 ms.) gdn-source-compare's occupancy/shuffle/
fusion anatomy is correct but addresses a SECONDARY +19 ms target, not parity.
- "+66% SSM-fix gain" label: REFUTED. 146 -> 254-257 is +74 to +76%; "66%" is the percent-of-vLLM,
not the speedup (validate-findings).
SUPERSEDED (the gap validate-findings flagged, now filled by real data):
- The "FP4-GEMM ~48% / get_rows 0.7% / GDN 22.5%" Step-2 split had NO surviving sqlite (the
producer script crashed; only a Step-1 build was on the box). The profiler's fresh Step-2 trace
replaces it with a FINER, load-bearing breakdown: the ~46% "FP4 matmul" bucket is NOT one GEMM
family - it splits into `mul_mat_vec_q` 26% (the o_proj GEMV, the real culprit), `mul_mat_q` 17%
(the tensor-core GEMM P2a already optimized), and `quantize_mmq_nvfp4` 3%. Lumping them as
"48% FP4-GEMM" hid that Track B P2a optimized the 17% MMQ while the 26% MMVQ was the bind. This
is why P2a was flat on decode: **it optimized the wrong FP4 kernel.**
### 2. Ground-truth per-step decode decomposition + the single biggest overage
From the profiler's fresh post-SSM eager nsys, both at batch 128, prefill-free, GPU-accurate:
| component (per decode step) | llama ms | llama% | vLLM ms | vLLM% | gap (llama-vLLM) |
|-----------------------------|----------|--------|---------|-------|------------------|
| GDN recurrence kernel | 193 | 38% | 174 | 53% | **+19** |
| FP4 matmul + act-quant | 236 | 46% | 117 | 36% | **+119** |
| - mul_mat_vec_q (o_proj GEMV) | **132** | **26%** | 0 | - | **+132** |
| - mul_mat_q (MMQ GEMM) | 88 | 17% | 61 (cutlass) | 19% | +27 |
| - quantize_mmq_nvfp4 | 16 | 3% | 55 (nvjet+cvt)| 17% | -39 |
| full attention (16 layers) | 6.6 | 1.3% | 6.2 | 1.9% | +0.4 |
| SSM conv + glue/elementwise | 45 | 9% | 22 | 7% | +23 |
| MEMCPY | 2.5 | 0.5% | 0.36 | 0.1% | +2 |
| **TOTAL** | **~510** | 100% | **~326**| 100% | **+184** |
The +119 ms FP4-matmul gap is ENTIRELY the `mul_mat_vec_q` o_proj GEMV (+132), partly offset
by llama being -39 on activation-quant (16 vs vLLM's heavier eager 55) and +27 on the MMQ. So
the one lever that matters is the +132 ms/step o_proj GEMV; everything else nets to ~+52 ms.
**MECHANISM (confirmed by source read, not inferred).** In the dense Qwen3.5-27B GDN block
(`src/models/qwen3next.cpp` `build_recurrent`), the recurrent core keeps the SSM layout
`[feat, n_seq_tokens, n_seqs]`. At decode `n_seq_tokens=1, n_seqs=128`. The output projection is:
```cpp
// current code (qwen3next.cpp, end of the GDN block)
ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm,
head_v_dim * num_v_heads, n_seq_tokens, n_seqs); // [6144, 1, 128]
cur = build_lora_mm(model.layers[il].ssm_out, final_output); // <-- the matmul
cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs); // collapse AFTER
```
`final_output` is 3D `[6144, n_seq_tokens=1, n_seqs=128]`, so `src1->ne[1] = 1`. The ggml-cuda
dispatch (`ggml-cuda.cu:2553`) picks MMVQ when `src1->ne[1] <= MMVQ_MAX_BATCH_SIZE (8)`, with the
128 sequences carried in `ne[2]`. Result: a per-sequence FP4 GEMV, output rows 5120 x 128 seqs =
**`mul_mat_vec_q`, grid 5120x128, 48 calls/step (one per GDN layer)** - matching the profiler's
trace exactly. MMVQ does NOT amortize the `ssm_out` weight read into shared memory across the 128
sequences (it is built for batch <=8), so each of the 128 sequences re-streams the weight tiles -
the "40% vs 62% utilization" the weight-bandwidth section measured lives HERE, in this kernel, not
in the GDN state traffic. vLLM packs all 128 decode tokens into one cutlass M=128 GEMM (its GDN
kernel is literally `..._PACKED_decode`), so it has NO GEMV-at-batch-128 at all.
This also pins WHY it is decode-specific: at prefill the tokens are in `ne[1]` (n_seq_tokens=prompt
len), so `ne[1] >> 8` -> MMQ already; only the decode layout (128 seqs x 1 token, batched in ne[2])
trips the GEMV path. The in-projection (`wqkv`) is unaffected: its input is the 2D residual stream
`[n_embd, 128]` (reshaped to 3D only AFTER the matmul), so `ne[1]=128` -> MMQ today. The o_proj is
the unique 3D-input matmul, which is exactly why the profiler counted one MMVQ per GDN layer.
### 3. Ranked remaining decode levers (impact x tractability, cumulative ceiling toward 391)
Anchored on llama 256.6 t/s (499 ms/step) -> vLLM 391 (327 ms/step), gap 172 ms/step. Recover
figures past Lever 1 are ESTIMATES (the profiler measured the costs, not the post-fix kernels);
each needs a confirming re-profile. Ceilings are cumulative.
| # | lever | targets (ms/step) | est. recover | cumulative decode_agg | % of vLLM | tractability |
|---|-------|-------------------|--------------|-----------------------|-----------|--------------|
| 1 | **o_proj MMVQ -> MMQ** (collapse final_output to 2D before `ssm_out`) | vec_q 132 | ~100-110 | ~320-330 | **~82-85%** | **VERY HIGH** (2-line reshape, bit-exact, MMQ already proven on NVFP4 at M=128 by the in_proj) |
| 2 | act-quant + norm prologue fusion (explore L1 `LLAMA_FUSE_NVFP4_QUANT=1` re-bench + L2 M=128 gate) | quant 16 + 448 launches/step | ~15-25 | ~345-360 | ~88-92% | MED-HIGH (producer code exists, tasks 38-40; needs post-0019 re-bench, the pre-SSM regression is stale) |
| 3 | GDN-area fusion + occupancy (gdn A-D: row-local reduction, raise launch_bounds occupancy, fold gate/l2norm/softplus into the recurrence) | GDN +19 + glue +23 | ~25-40 | ~375-388 | ~96-99% | MED-LOW (real kernel rewrite + numeric re-validation) |
| 4 | conv-state in-place + conv fuse (explore L4, the proven 0018/0019 pattern on `ssm_conv`/concat) | part of glue, 48 launches/step | ~5-10 | ~388-395 | ~99-101% | HIGH (bit-exact, proven pattern) |
| - | between-step host gap / cgraph reuse | ~2 ms/step | ~2 | +~0.4% | n/a | LOW value (cleanup, not a parity lever) |
| x | CUDA graphs | - | 0 | already on | n/a | NOT a lever (+6% even for vLLM) |
| x | TMA weight-feed / NVFP4-dense weight-quant | prefill / npl1 | 0 at npl128 | n/a | n/a | MIS-SCOPED for batch-128 decode (prefill / low-batch levers; prefill already +12.7%) |
Note on Lever 1+2 coupling: routing the o_proj to MMQ ADDS one activation-quant (q8_1/NVFP4) per
o_proj, so Lever 2 (fusing that quant into the preceding `build_norm_gated`) compounds Lever 1
rather than overlapping it. Lever 3's "glue +23 ms" and Lever 1's quant are the same elementwise
passes vLLM folds into its packed kernel, so 2+3 share surface - treat the estimates as a band,
not a sum.
### 4. Verdict: is true decode parity reachable?
**Yes, parity is reachable in software, and the residual is NOT a hardware/architecture floor.**
Proof of "not a floor": both engines read identical NVFP4 weights and read+write identical f32
recurrent state = identical 55.5 GB/step DRAM floor (203 ms) on the identical GB10 LPDDR5x; vLLM
achieves 62% bandwidth utilization (327 ms/step) where llama achieves 40% (499 ms/step). The 1.54x
throughput gap equals the 1.55x utilization gap, and that utilization gap is now LOCALIZED to
specific llama kernels - chiefly the o_proj MMVQ - every one of which is closable in software. The
GDN recurrence (the supposed floor) is only +11%/call between the two engines.
How far each tier reaches:
- The first ~84% of parity (256 -> ~325) is nearly FREE: Lever 1 is a 2-line reshape that moves
the GDN output projection from a per-sequence FP4 GEMV to a tensor-core M=128 FP4 GEMM, bit-exact,
no new kernel (MMQ already runs the in-projection at this exact shape and type).
- ~84% -> ~92% (Levers 1+2) is low-effort: the fused act-quant producer already exists (tasks
38-40), it just needs a post-0019 re-bench because its pre-SSM regression was measured when the
GPU was 99% busy on the now-removed state-copy chain (no idle to reclaim then; real idle now).
- ~92% -> ~100% (Levers 3+4) is the diminishing-returns tail and the only genuinely HARD work:
matching vLLM's fully-fused `packed_decode` GDN kernel (row-local reductions, higher occupancy,
folding the gate/l2norm/softplus elementwise passes into the recurrence). This last ~8% is "hard
but not floored" - it is kernel engineering, not a hardware wall.
**Single highest-value next step (do this first):** apply Lever 1 - collapse `final_output` to 2D
`[head_v_dim*num_v_heads, n_seq_tokens*n_seqs]` BEFORE the `ssm_out` matmul (drop the now-redundant
post-matmul `reshape_2d`):
```cpp
// route the GDN output projection through tensor-core MMQ at decode:
// M = n_seq_tokens*n_seqs (=128 at decode) instead of ne[1]=1 -> MMVQ GEMV. Free, bit-exact.
ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm,
head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
cur = build_lora_mm(model.layers[il].ssm_out, final_output); // now [n_embd, n_tokens], M=128 MMQ
```
Then the profiler re-measures the realized o_proj-as-MMQ cost on a clean post-0019 nsys (the one
number this synthesis estimates rather than measures) and confirms the 256 -> ~320-330 lift. The
same 3D-input-matmul pattern almost certainly affects the MoE checkpoint (q36-35b-a3b) decode and
any other matmul that consumes a tensor still in the `[feat, 1, n_seqs]` SSM layout - grep those
and apply the same collapse. Levers 2-4 follow in priority order; none requires a model or accuracy
compromise, so bit-exactness is preserved throughout.
### Evidence (this section)
- Source read (DGX `~/llama-paged-dev`, read-only): `src/models/qwen3next.cpp` (GDN in/out proj
layout, lines ~286-305 and ~518-528), `ggml/src/ggml-cuda/ggml-cuda.cu:2553` (MMVQ dispatch on
`ne[1]<=8`), `ggml/src/ggml-cuda/mmvq.cuh:3` (`MMVQ_MAX_BATCH_SIZE 8`), `mmq.cu:267` (NVFP4 is
MMQ-supported).
- All five prior sections of this doc + the profiler's `~/bench/postssm_decomp/` traces.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,532 +0,0 @@
# Track B: the FP4-MMA weight-GEMM for GB10 decode parity with vLLM — build-ready scope + honest go/no-go
Scope only (build-ready plan + honest verdict). **Not implemented in this workflow.** Track B is the
residual-kernel track after track A (fuse the standalone `quantize_mmq_fp4` activation-requant, the
8.2% decode bucket — tasks 38-41, the fused `rms_norm+mul+nvfp4-quant` producer + prequantized-MMQ
consumer) is handled separately. Track B owns the **weight GEMM**, the ~59% bucket.
**The load-bearing question, restated:** at the decode batch shape (M≈128 tokens fused into one
ubatch, NVFP4 weights), is the weight GEMM **compute-bound** (FP4-MMA throughput is the lever →
parity reachable with a better kernel) or **bandwidth-bound** (273 GB/s weight-read is a hard floor →
parity capped)? And given the GB10 occupancy history, can a better FP4-MMA decode GEMM actually reach
vLLM's **391 (dense) / 811 (MoE)** decode-agg tok/s @npl128, or only partway?
Hardware: NVIDIA GB10 / DGX Spark, sm_121 (CC 1210 = `GGML_CUDA_CC_DGX_SPARK`), unified LPDDR5x.
Dev tree `~/llama-paged-dev` (branch `paged`, build-cuda sm_121). All numbers are reasoned from the
committed nsys decomposition + measured GB10 specs + a source read of the FP4-MMA kernel; **no new GPU
benchmarks were run** (track A is on the box).
## 0. Grounded inputs (measured, committed)
| quantity | value | source |
|---|---|---|
| LPDDR5x bandwidth (spec) | **273 GB/s** | `BLACKWELL_KERNEL_GAPS.md`, `VLLM_DECODE_GROUNDING.md` |
| LPDDR5x bandwidth (achieved, batch-1 weight read) | **~216 GB/s** (19 GB / ~88 ms irreducible) | prior batch-1 study |
| FP4 (NVFP4/MXFP4) dense peak | **~427500 TFLOP/s** (2× BF16; GB10 is 1:1:2 BF16:INT8:FP4) | `BLACKWELL_KERNEL_GAPS.md` §2 |
| BF16 / INT8 peak | ~213 TFLOP/s / ~215 TOPS (INT8 == BF16 on GB10) | same §2 |
| Demonstrated GB10 FP4-MMA efficiency | **~17%** of FP4 peak at prefill M=512 (MXFP4 dense 1153 t/s); **~3% dense / ~35%-of-BW MoE at decode** | `BLACKWELL_KERNEL_GAPS.md` §6, `GDN_DECODE_VERIFY.md` |
| Dense Qwen3.6-27B NVFP4 weights | **18.8 GB** file; ~18 GB matmul tensors | `du` on DGX |
| MoE Qwen3.6-35B-A3B NVFP4 weights | **23.85 GB** file; ~22 GB read/step @npl128 (~98% experts hit) | `du` on DGX |
| Decode step decomposition (dense npl128, nsys, GPU 92.7% busy) | GEMM_weight **59.2%**, act_quant 8.2%, GDN 10.4%, full-attn 1.8%, elementwise/norm/rope 13.5%, embed 2.9%, copy 1.8% | `GDN_DECODE_VERIFY.md` §3a |
| Measured per-step @npl128 | dense **~795 ms** (llama) → **~328 ms** (vLLM); MoE **~384 ms** → **~158 ms** | `VLLM_DECODE_GROUNDING.md` |
| Aggregate decode @npl128 (the parity scoreboard) | dense **161** (llama) vs **391** (vLLM); MoE **333** vs **811** | `QWEN36_NVFP4_BENCH.md` |
`decode_agg = npl / step_s = 128 / step_s`. Crossover formula throughout:
`M* = b · peak / (2 · BW)`, `b` = bytes per weight element. Below `M*` bandwidth-bound, above it
compute-bound.
---
## 1. The kernel-approach decision: TUNE the existing FP4-MMA `mul_mat_q`, do NOT write a cutlass kernel
This is the first thing track B must settle, and the evidence settles it decisively.
| option | verdict | why |
|---|---|---|
| **(A) Tune the existing `mul_mat_q<NVFP4>` FP4-MMA path** | **CHOSEN — the tractable spine** | The kernel already exists, is **bit-exact** (`test-backend-ops MUL_MAT` 1103/1103), is genuine **W4A4** (below), and already **beats vLLM at batch-1 prefill** (MXFP4 1153 t/s vs vLLM's 800 W4A16 — vLLM has no FP4 cubins on sm_121). The deficit is **decode-shape scheduling**, not the math op. Host-side selection + a bounded occupancy tune respects the GB10 lessons and is build-ready against known files/lines. |
| **(B) New cutlass-style SM120 FP4 collective** | **REJECTED** | Repeats the **proven GB10 dead-end**: the from-scratch W4A16 BF16 GEMM hit only ~915 TFLOP/s (¼ of MMQ) and was **STOPPED** (`W4A16_MARLIN_KERNEL_PLAN.md`) because deep `cp.async` + XOR-swizzle **collapse GB10 occupancy**. Worse, **CUTLASS's own SM120 grouped block-scaled FP4 GEMM is broken on consumer Blackwell** (garbage/init-fail — CUTLASS #3096/#2800) — it is the exact reason vLLM falls back to **BF16 Marlin** for its MoE on sm_121. "Port cutlass" is not even a working option for the MoE arm. |
| **(C) Marlin-style W4A16 (FP4→BF16 dequant + BF16 HMMA)** | **REJECTED for the win, noted for context** | This is what **vLLM's MoE actually runs** on sm_121 (W4A16, BF16 activations, dequant-in-mainloop). On GB10 **INT8 == BF16 == ½ FP4 rate**, so a BF16-HMMA path concedes the 2× FP4 advantage llama already has. We do not want to *descend* to vLLM's slower arithmetic class; we want to keep the FP4-MMA class and schedule it better. |
**Decision: track B = tune `mul_mat_q<NVFP4>` (dense, `mmq.cu`/`mmq.cuh`) + the grouped `mul_mat_q`
id-branch (MoE, `mmid.cu` + the same `mmq.cuh`).** No new kernel, no rewrite, no descent to BF16.
The win is kernel *engineering around an FP4-MMA llama already possesses*, so there is **no
hardware-instruction wall** — but it is gated by whether MMQ's occupancy-bound design can be pushed
to the bandwidth floor at the thin decode M-tile.
### What "the existing path" actually is (source-read, DGX `ggml/src/ggml-cuda/`)
Decode runs **one `mul_mat_q` per weight, M=128** (all 128 slots' single tokens fused into one
ubatch — confirmed `mul_mat_q(M=128)` in `GDN_DECODE_VERIFY.md`, not 128× M=1). The NVFP4 path:
`mmq.cu` `use_native_fp4` gate (L125) → `quantize_mmq_fp4_cuda` act-quant (L138 dense / L200 id;
**track A's fuse target**) → `mul_mat_q``vec_dot_fp4_fp4_mma` (`mmq.cuh:997`) →
`mma_block_scaled_fp4` (`mma.cuh:1126`).
**Confirmed W4A4 (this corrects an earlier "A is 8-bit-class" framing):** `block_fp4_mmq`
(`mmq.cuh:53`) is `uint32_t d4[4]` (four `ue4m3` block scales) + `int8_t qs[4*32]` = **256 FP4 (e2m1)
values packed 2-per-byte**. `quantize_mmq_fp4_cuda` (`quantize.cu:422`) emits FP4 via
`ggml_cuda_float_to_fp4_e2m1`. The MMA is
`mma.sync.aligned.kind::mxf4nvf4.block_scale.scale_vec::4X.m16n8k64.row.col.f32.e2m1.e2m1.f32.ue4m3`
(`mma.cuh:1145`) — **both operands e2m1, ue4m3 block scales**. So llama's dense FP4-MMA path is
already the *same arithmetic class as vLLM's cutlass W4A4 dense*. The `sizeof(block_fp4_mmq) ==
sizeof(block_q8_1_mmq)` static_assert is a shared-tile-footprint convention, **not** an 8-bit
activation. **Consequence: there is no "make activations 4-bit" work to do and no activation-traffic
halving to win — that is already banked. The entire dense deficit is scheduling/occupancy.**
Geometry (`vec_dot_fp4_fp4_mma`): `MMQ_NWARPS=8`, `iter_k=MMQ_ITER_K_FP4=512`, tiles
`tile_A<16,8,int>` (weights, 16 N-rows × 64 FP4-in-K), `tile_B<8,8,int>` (acts, 8 M-cols × 64
FP4-in-K), `tile_C<16,8,float>` (16 N-rows × 8 M-cols), `nfrags = MMQ_TILE_NE_K/tile_A::J`. The M loop
is `for (j0=0; j0<mmq_x; j0 += ntx*tile_C::J)` — M tiled in steps of `tile_C::J=8`.
---
## 2. The roofline — answering the load-bearing question
**Answer: BANDWIDTH-bound on the hardware roofline, but COMPUTE-bound in practice by the kernel's own
under-occupancy. The 273 GB/s is NOT the wall at the parity target.**
### 2a. DENSE Qwen3.6-27B, M=128
`b = 18e9/27e9 = 0.667 B/param`; FLOPs/step `= 2·128·27e9 = 6.91 TFLOP`.
- **Weight-read floor** (18 GB read ONCE for all 128 tokens): @273 GB/s = **65.9 ms → 1,942 tok/s**;
@216 GB/s = 83 ms → 1,542 tok/s.
- **Crossover** at FP4 peak: `M* = 0.667·500e12/(2·273e9) = 611`. **M=128 ≪ 611 → an ideal FP4 GEMM
at decode is BANDWIDTH-bound.** At the kernel's *achieved* ~3% efficiency the effective peak
collapses and drags M* to ≈30, putting the *current* kernel in self-inflicted compute-bound
territory.
- **Where llama sits:** GEMM = 59.2% × 795 ms = **471 ms = 14.7 TFLOP/s = 2.9% of FP4 peak = 7.1×
slower than the 66 ms weight-read floor.** Not a bandwidth wall — a kernel running deep in
compute-bound territory at single-digit efficiency.
- **Where vLLM sits:** step 328 ms ≈ llama's GEMM bucket (471 ms) alone. The **entire 2.42× gap is
the GEMM.**
### 2b. MoE Qwen3.6-35B-A3B, M=128
@npl128, 128 tok × top-8 / 256 experts ⇒ ~98% experts read ⇒ ~22 GB/step (the full weight set), per-
expert M ≈ **4 tokens**.
- **Weight-read floor:** 22/273 = **80.6 ms → 1,588 tok/s** (@216: 102 ms → 1,255).
- **Compute floor:** only ~3B active params ⇒ 0.77 TFLOP ⇒ 1.5 ms @peak — **trivial. MoE decode is
purely bandwidth/occupancy-bound, never compute-bound.** The hard part is saturating 273 GB/s while
feeding ragged M≈4 tiles.
- **Where llama sits:** GEMM = 59% × 384 = **227 ms = 97 GB/s = 35% of peak BW** (occupancy/tile-fill
loss, not compute).
- **Where vLLM sits:** step 158 ms ≈ grouped Marlin-NvFp4 at the ~80 ms floor + ~78 ms non-GEMM —
already pushing the MoE BW floor.
**Both weight-read floors (dense ~1,940, MoE ~1,590 tok/s) sit 46× ABOVE vLLM's 391/811. Bandwidth
is not the wall; the GB10 FP4-MMA occupancy efficiency is.**
---
## 3. The code-level inefficiencies, and the M-tile asymmetry that drives the whole plan
The selection is `mul_mat_q_case` (`mmq.cuh:4108`): it loops `mmq_x = 8..mmq_x_max(=128) step 8` and
keeps the `mmq_x` that **minimizes `ntiles_x = ceil(ncols_max/mmq_x)`**, stopping at `ntiles_x==1`.
`mmq_y` (the weight-row tile) is pinned at **128** by `get_mmq_y_host` (L143). This produces the
single most important structural fact for track B:
> **`mmq_x` tiles M (tokens / output columns) — shrinking it RE-READS the weights `ntiles_x` times.
> `mmq_y` tiles N (weight rows / output rows) — shrinking it does NOT re-read weights (each weight row
> lives in exactly one row-tile); it only lowers shared footprint and raises occupancy.** The two
> regimes pick opposite knobs:
| | dense decode (M=128, no `expert_bounds`) | MoE decode (per-expert M≈4) |
|---|---|---|
| selection picks | `mmq_x=128``ntiles_x=1`**weights read ONCE** (the one-read optimum) | `mmq_x=128` applied **per expert** → tile ~3% filled |
| shrink `mmq_x`? | **NO — re-reads 18 GB ×`ntiles_x`**, fatal in the BW-bound regime | **YES, FREE** — 1 col-tile/expert regardless, no re-read → strictly occupancy-positive |
| FP4-MMA M-frag fill | **full** (128/`tile_C::J`=16 frag-groups, all live) → no fragment waste | **wasted** (~1 of 8/16 frag-groups live, rest masked tails) |
| BW-neutral occupancy lever | **`mmq_y`↓** (more resident CTAs, weights still read once) — kernel-structure change | **`mmq_x`↓** (toward density ≈8) — host-side template switch |
| dominant loss | **occupancy** at the heavy 128×128 tile (exposed weight-load latency) | **tile-fill** (dense-tuned M-tile applied to ragged M≈4) |
This asymmetry is the spine of the plan: **MoE's lever is host-only `mmq_x`↓ (already landed as patch
0015 auto-cap→64; ideal ≈816); dense's lever is `mmq_y`↓ + occupancy, a bounded kernel change.**
The five inefficiencies, ranked:
1. **Separate activation-quant pass (track A's bucket, 8.2%).** `quantize_mmq_fp4_cuda` writes the
whole activation tensor to `block_fp4_mmq` in a standalone kernel; vLLM fuses `scaled_fp4_quant`
into the preceding RMSNorm/SiLU epilogue. **Handoff (track A → B):** B must consume A's prequantized
`block_fp4_mmq` y-tile in place of calling `quantize_mmq_fp4_cuda`, so the fusion saves the
activation round-trip, not just the launch (see §4.4).
2. **No weight-load software pipeline → exposed latency at thin M (the #1 dense kernel lever).**
`load_tiles_nvfp4_nvfp4` (`mmq.cuh:946`) does plain global→shared stores → `__syncthreads`
`vec_dot_fp4_fp4_mma` (`load_ldmatrix` of A + MMA): a **load→sync→compute→repeat** cadence with **no
`cp.async` double-buffering** overlapping the next k-block weight load with the current MMA. At
M=128 the per-tile MMA work is small, so serialized weight-load latency dominates → 2.9% (dense) /
35%-of-BW (MoE). **Caveat (the GB10 wall):** a *deep* pipeline + XOR-swizzle collapses GB10
occupancy (`W4A16_MARLIN_KERNEL_PLAN.md`). The fix is **occupancy-first** (raise resident CTAs to
hide latency via CTA-parallelism), **shallow 2-stage prefetch second**, never Marlin's 4-stage.
3. **`mmq_x` maximized for dense = occupancy-heavy, but pinned by the one-read constraint.** At dense
decode the 128×128 tile (8 warps, large shared) is low-occupancy on the occupancy-dominated GB10 —
but you cannot shrink `mmq_x` without doubling the 18 GB weight read. So the dense occupancy fix is
**`mmq_y`↓** (BW-neutral), not `mmq_x`↓.
4. **MoE per-expert M-tile waste (the structural MoE gap).** The 128-wide (or patch-0015 64-wide)
tile is applied per expert at density ≈4, so the accumulator is ~36% filled and ~1 `tile_C` frag-
group is live, the rest masked `need_check` tails. Ideal `mmq_x` ≈ tokens/expert ≈ 8 (= `tile_C::J`).
At ≤1 col-tile/expert this costs **no** extra weight read → strictly occupancy-positive. (This is
the MoE arm of inefficiency 3; scoped in `MOE_GROUPED_GEMM_SCOPE.md`.)
5. **`iter_k=512` (FP4) couples to occupancy.** The FP4 main loop stages 512 K-elements/iter → larger
shared footprint → adverse in the occupancy-bound regime. A P2 tuning knob.
**Ruled out (do not chase):** redundant weight reads on the *current* selection (none — dense
`ntiles_x=1`, MoE ≤1 col-tile/expert); stream-K fixup (it *helps* fill the small GB10 grid at thin M);
raw FP4-MMA peak rate (already beats Q4-MMQ and is BW-bound at batch 1 — latency-hiding binds first).
---
## 4. The specific build-ready changes
All against DGX `~/llama-paged-dev/ggml/src/ggml-cuda/`. Every change is gated and defaults to exact
stock behavior until proven.
### 4.1 Dense M-tile / occupancy (the make-or-break)
- **Keep `mmq_x=128` at dense decode** (the one-weight-read optimum; do **not** shrink it — that
re-reads 18 GB). Lock this as an invariant in P0.
- **Make `mmq_y` decode-selectable** (`get_mmq_y_host`/`get_mmq_y_device`, L143/L157). Today pinned
128; try **64** (and 96) at decode. `mmq_y` is coupled to `nwarps × tile_C::I` via the MMQ
static_assert, so this is a **warp/fragment remap** (bounded kernel change), not a pure host switch:
fewer N-frags per warp or fewer warps → smaller per-CTA shared → **more resident CTAs → latency
hidden by CTA-parallelism**, with **weights still read once** (BW-neutral). This is the primary
dense occupancy lever and respects every GB10 rule.
- **Host-only knobs first (P1, zero kernel):** the `mmq_get_granularity_host` choice (L274 — sets
`rows_per_warp=2·granularity`, `ntx`), and the stream-k-vs-xy-tiling threshold (`launch_mul_mat_q`
~L3954, `tiles_efficiency_percent` L4001). Plus one **empirical A/B**: does eating a 2× weight
re-read at `mmq_x=64` buy enough occupancy to net positive? (Diagnostic: if yes, occupancy is badly
broken and P2 `mmq_y`↓ has large upside; if no, the tile is already BW-saturated and P2's ceiling is
lower.) All behind `GGML_CUDA_FP4_MMQ_Y` / `GGML_CUDA_FP4_GRAN` / `GGML_CUDA_FP4_FORCE_STREAMK`.
### 4.2 FP4-MMA fragment usage
- Fragments stay `tile_A<16,8,int>` / `tile_B<8,8,int>` / `tile_C<16,8,float>` — these match the
`m16n8k64` block-scaled FP4 MMA and must not change (they are the instruction shape). At dense M=128
all 16 `tile_C::J`-groups are live → **no dense fragment work needed**. The lever is *how many of
these tiles are resident per SM* (occupancy), set by `mmq_y`/`nwarps`/granularity, not the fragment
shape.
- MoE: shrink `mmq_x` toward `tile_C::J`=8 so the live frag-group count matches density (§4.3).
### 4.3 MoE M-tile (`MOE_GROUPED_GEMM_SCOPE.md`, partly landed)
- **Patch 0015 already auto-caps `mmq_x`→64 at decode** via per-expert density in `mul_mat_q_case`
(the `expert_bounds != nullptr` block, L4118-4165; env `LLAMA_MOE_DECODE_TILE`,
`LLAMA_MOE_DENSITY_MAX`). Tighten the decode tile toward **816** (= density) and sweep.
- **Optional [2]: block-padded `mm_ids_helper`** (`mmid.cu`) — pad each expert segment to a multiple
of the tile, removing `need_check` masked tails and tightening the stream-k schedule. Medium risk
(scatter + write-back masking); behind `LLAMA_MOE_BLOCK_ALIGN`.
### 4.4 Scale handling + the act-quant fusion handoff (the track A → B ABI contract)
- **Weight scales** (`ue4m3`, one per 16 weights) load in `load_tiles_nvfp4_nvfp4` into `x_sc`
(`x_u32 + 64 + kbx`), consumed as `scaleA` in `vec_dot_fp4_fp4_mma` and passed as the block-scale
operand to `mma_block_scaled_fp4`. **No change** — already a first-class MMA scale operand.
- **Activation scales** (`ue4m3`) live in the `block_fp4_mmq` y-tile `d4[4]`, consumed as `scaleB`.
- **The handoff contract:** track B must hold the **`block_fp4_mmq` y-tile layout invariant**
(`uint32_t d4[4]` ue4m3 scales + `int8_t qs[128]` = 256 packed FP4, `mmq.cuh:53`). Track A's fused
`rms_norm+mul+nvfp4-quant` producer (task 39) writes exactly this struct; track B's "prequantized
MMQ consumer" (task 40) makes `mul_mat_q` accept a prebuilt `src1_q8_1` buffer and **skip the
`quantize_mmq_fp4_cuda` call** (`mmq.cu:138`/`200`). The numerics must be **bit-identical** to the
unfused path (same `e2m1` rounding, same `ue4m3` block scale per 16) so the parity gate stays green
with the fusion on or off. B owns the consumer seam; A owns the producer kernel; the `block_fp4_mmq`
struct is the frozen interface between them.
### 4.5 GB10-fit rules (binding constraints on every kernel change)
- **Small shared mem + high occupancy.** Do **not** add deep `cp.async` stages or XOR-swizzle shared
layouts — they are exactly what collapsed W4A16 on GB10 (`W4A16_MARLIN_KERNEL_PLAN.md`: a 16 KB
XOR-swizzle dropped q4_K from 6.63→2.84 TFLOPS).
- **Preserve the skew-pad** (`MMQ_MMA_TILE_X_K_FP4 = 2·MMQ_TILE_NE_K + 8 + 4`, the `% 8 == 4`
padding, `mmq.cuh:221/233`) — conflict-free `ldmatrix` at ~zero shared cost.
- **Stay on the FP4-MMA path** (`block_fp4_mmq` / `mma_block_scaled_fp4`) — the only path at GB10's
FP4 = 2× INT8/BF16 rate. Never descend to BF16/INT8 (1:1 on GB10).
- **Occupancy beats a conflict-free-but-wide layout.** Buy latency-hiding with *more resident CTAs*
(smaller `mmq_y`, smaller shared), not a deeper pipeline.
- Tuning is **empirical**`nsys` (throughput) is available, **`ncu` is not** on the DGX (no driver
perms). Sweep configs, measure decode_agg, bracket thermals (same-session cold A/B only).
---
## 5. Correctness / parity gate (every phase)
- **Primary, bit-exact:** `test-backend-ops test -o MUL_MAT -b CUDA0` and
`test-backend-ops test -o MUL_MAT_ID -b CUDA0` must stay **1103/1103** with the flag set **and**
unset, and **byte-identical** when unset. The CPU reference is the deterministic oracle; the op test
is exact (the GB10 greedy-decode non-determinism band applies only to end-to-end, never to the op
test).
- **Add decode-shape cases if absent:** `type_a ∈ {NVFP4, MXFP4}`, `type_b = F32`, dense **n=128** at
the real FFN K/N; for `_ID`, `n_mats=128, n_expert_used=8, n_tokens ∈ {8,32,64,128}` **plus ragged
small-M** (experts with 0/1/2 tokens, `n_tokens` not a multiple of `mmq_x`) — exactly where `mmq_x`/
`mmq_y` changes and block-pad masking can leak.
- **Fusion-handoff parity (P3):** with track A's fused producer on, the prequantized-consumer path
must produce dst **identical** to the unfused `quantize_mmq_fp4_cuda` path (same `e2m1`/`ue4m3`
rounding).
- **End-to-end:** `llama-batched-bench -fa on -npp 512 -ntg 256 -npl 128` on `q36-27b-nvfp4.gguf`
(dense) and `q36-35b-a3b-nvfp4.gguf` (MoE); confirm decode_agg climbs per §6 and output stays within
the documented CUDA batch-shape non-determinism band vs the CPU oracle. All scripts **dev-tree-only**.
---
## 6. Phased plan, with expected decode_agg at each phase
Per-step model used (ms @npl128): **dense 795** = GEMM 471 + act 65 + GDN 83 + attn 14 + rest 162;
**MoE 384** = GEMM 227 + act 31 + GDN 38 + attn 8 + rest 81. `decode_agg = 128 / step_s`.
### DENSE (parity target 391)
| phase | work | GEMM ms | step ms | **decode_agg** | **% of vLLM 391** | risk |
|---|---|---:|---:|---:|---:|---|
| **P0** harness | Lock baseline: 1103/1103, decode n=128 perf, nsys window, the 471 ms / 2.9% eff datum. Pin `mmq_x=128` one-read invariant. | 471 | 795 | **161** | 41% | low |
| **P1** host-only tile/grid + re-read A/B | granularity + stream-k threshold sweep; the `mmq_x=64` re-read-vs-occupancy diagnostic. **Honest: small**`mmq_x` is pinned, so this mostly de-risks P2. | ~400 | ~724 | **~177** | ~45% | low |
| **P2** `mmq_y`↓ + occupancy/shallow-prefetch | The make-or-break: raise resident CTAs (`mmq_y` 128→64, granularity, shallow 2-stage weight prefetch, skew-pad), push GEMM toward the **6681 ms BW floor (1721% FP4 eff)**. **KILL-GATE: if eff plateaus <15% (GEMM >110 ms) → dense parity OFF, report partial.** | **6681** | 390405 | **316328** | **8184%** | **med-high** |
| **P3** co-land track A | Consume A's prequantized `block_fp4_mmq` y-tile; the 65 ms act bucket folds away. | 6681 | **325340** | **376394** | **96101%** | low |
Dense climb: **161 → ~177 → 316328 → 376394** tok/s = **41% → 45% → 8184% → 96101% of vLLM 391.**
Robust to the 273-vs-216 GB/s uncertainty (@216 GB/s P3 → ~359 tok/s = 92%). **Parity within error,
contingent on P2 clearing the kill-gate and on A landing.**
### MoE (parity target 811)
| phase | work | GEMM ms | step ms | **decode_agg** | **% of vLLM 811** | risk |
|---|---|---:|---:|---:|---:|---|
| **P0** harness | Lock 1103/1103 + the monotonic `85→1771` batched-bench curve + 227 ms / 35%-BW datum. | 227 | 384 | **333** | 41% | low |
| **P1/P4** MoE `mmq_x`↓ (patch 0015 → tighten to 816) | Free per-expert tile shrink (no re-read); reclaim the 36% fill waste, raise occupancy. | ~140 | ~297 | **~431** | ~53% | low |
| **P2** block-pad align + occupancy | Remove `need_check` tails, tighten stream-k; push toward the 80 ms floor. | ~100 | ~257 | **~498** | ~61% | med |
| **P3** co-land track A | act bucket (31 ms) folds away; GEMM at the ~80 ms floor. | 80 | **207** | **618** | **76% — CEILING** | low |
MoE climb: **333 → ~431 → ~498 → 618** tok/s = **41% → 53% → 61% → 76% of vLLM 811.** **The 76% is the
hard ceiling from the GEMM track:** even a *perfect* weight-read-floor grouped GEMM leaves llama's
non-GEMM (GDN 38 + attn 8 + rest 81 = 127 ms) at **1.6× vLLM's whole ~78 ms non-GEMM**, so the step
cannot drop below ~207 ms. The remaining ~49 ms to vLLM's 158 ms step is elementwise + host-loop
(GDN state I/O is intrinsic and vLLM pays it identically — `GDN_DECODE_VERIFY.md`), **outside track B.**
### Explicitly NOT in scope (and why)
- A from-scratch W4A16 / CUTLASS SM120 collective — repeats the STOPPED occupancy dead-end and
CUTLASS's grouped FP4 is broken on sm_121.
- Deep multi-stage `cp.async` / XOR-swizzle — proven to collapse GB10 occupancy.
- "Make activations 4-bit" — already W4A4; no work, no win there.
- The non-GEMM MoE residual (elementwise, host CUDA-graph, GDN bf16 state) — needed for MoE parity but
**separate tracks**; B owns the GEMM only.
---
## 7. The honest ceiling — does B reach TRUE PARITY?
- **DENSE: TRUE PARITY is PLAUSIBLY REACHABLE, conditional, no margin.** The entire 2.42× gap is the
GEMM bucket; its ideal floor (66 ms) is 7× below the current 471 ms and is **bandwidth-bound, not
hardware-capped**. **B (GEMM → BW floor) + A (act-fuse) lands 376394 tok/s = 90103% of vLLM 391.**
The catch: it needs **~1721% FP4-MMA efficiency at decode M=128**, and GB10 has only demonstrated
~17% — and that at the *easier* prefill M=512 tile. It is a **reach, not a lock**, gated by the P2
occupancy kill-gate and contingent on track A. **GO (conditional).**
- **MoE: full parity is NOT reachable from track B.** Realistic ceiling **~76% of vLLM (618 vs 811)**
even with a perfect weight-read-floor grouped GEMM, because (1) the MoE floor is the hardest
grouped-GEMM regime (M≈4/expert, vLLM ships purpose-built Marlin-NvFp4) and (2) ~24% of the step is
non-GEMM outside this track. Worth doing (333 → ~618, a 1.85× and a real win), but it **cannot
deliver 811 alone.** **PARTIAL / NO-GO for parity-from-B.**
- **The 273 GB/s is not the ceiling — the GB10 FP4-MMA occupancy efficiency is.** Decode M=128 is a
*different* regime from the dead W4A16 path: bandwidth/occupancy-bound (saturate LPDDR5x at a thin
M-tile via resident CTAs), not compute-throughput-bound (pack MMAs). The existing path is already at
the BW floor at batch 1 (88 ms), so the work is **keeping it bandwidth-bound as M grows to 128**
(occupancy via `mmq_y`↓ + shallow prefetch), a **tune of a working path**, not the greenfield
rewrite. The binding risk is whether that occupancy can be bought without tripping the GB10 wall —
which is exactly what the P2 kill-gate measures.
**Bottom line for the "TRUE PARITY" ask:** GB10 **can** plausibly deliver **dense** decode parity with
vLLM via a tuned FP4-MMA decode GEMM **+ track A**, at the top of the demonstrated efficiency envelope
with no margin. GB10 **cannot** deliver **MoE** decode parity from the GEMM track alone (ceiling ~76%);
MoE parity is a B-plus-non-GEMM program. **Verdict: GO for dense (conditional, B+A, kill-gated),
PARTIAL for MoE.**
---
## 8. One-paragraph summary
The decode GEMM at M=128 is **bandwidth-bound on paper** (crossover M*≈611 ≫ 128) with weight-read
floors 46× above vLLM, so **273 GB/s is not the wall** — but llama's FP4-MMA kernel runs at ~3% of
FP4 peak, in **self-inflicted compute-bound territory** (471 ms vs a 66 ms floor). The path is already
**W4A4** and already **beats vLLM at batch-1 prefill**, so the fix is **tuning the existing
`mul_mat_q<NVFP4>`**, not a cutlass rewrite (a proven GB10 dead-end, and broken on sm_121 anyway). The
M-tile asymmetry sets the levers: **dense** is pinned at `mmq_x=128` (one weight read) so its occupancy
win is **`mmq_y`↓ + shallow prefetch** (BW-neutral), while **MoE**'s win is the free per-expert
**`mmq_x`↓** (patch 0015). **Track B (GEMM → BW floor) + track A (fuse act-quant)** plausibly reaches
**90103% of vLLM dense (391)** — TRUE PARITY on the table for dense, but only at the **top of the
demonstrated GB10 FP4-efficiency envelope (~1721%)**, with **no margin**, gated by the P2 occupancy
kill-gate. **MoE parity is not reachable from the GEMM alone** (ceiling ~76% of 811), because its floor
sits in the hardest grouped-GEMM regime and ~24% of its step is non-GEMM. **Verdict: GO for dense
(conditional, B+A), PARTIAL for MoE.**
---
## 9. Adversarial review (skeptical staff CUDA engineer, post-W4A16): the parity go / no-go
Reviewer stance: I lived through the W4A16 GB10 effort that plateaued at ~9-15 TFLOP/s (~21% of the
BF16 ceiling) after multi-week work and was STOPPED at the occupancy wall. I read this scope and the
grounding (`QWEN36_NVFP4_BENCH`, `VLLM_DECODE_GROUNDING`, `GDN_DECODE_VERIFY`, `DECODE_GAP_STUDY`,
`BLACKWELL_KERNEL_GAPS`, `W4A16_MARLIN_KERNEL_PLAN`) and stress-tested the verdict against them. Net:
the plan is **directionally right and tractably scoped**, the kernel-approach decision (tune, do not
rewrite) is correct, but the **"GO for dense, TRUE PARITY 96-103%" headline outruns its own caveats**.
The honest landing is **dense ~80-90% (parity is the optimistic tail), MoE ~55-65% (parity not
reachable from B)**. The decision to commit to B is nonetheless sound, for a reason the doc under-sells
(low regret), and there is **one technical gap (TMA) and one sequencing error (A last) that must be
fixed**.
### 9.1 Is this the W4A16 wall again? No - and the batch-scaling signature proves why
The decisive evidence the doc has but does not fully exploit is the **npl-sweep** (`QWEN36_NVFP4_BENCH`):
dense llama-as-%-of-vLLM = **99 / 56 / 46 / 41** at npl 8 / 32 / 64 / 128. At **npl8 the kernels are at
parity** (99%); the gap **opens monotonically as M grows**. Decompose this:
- At M=8 the dense GEMM is weight-read-bound at the floor (~88 ms, same as batch-1). llama == vLLM there,
so **llama's FP4-MMA kernel demonstrably HITS the weight-read floor at small M.** This is the existence
proof the W4A16 path never had: it is a *working, floor-reaching* FP4-MMA kernel, not a greenfield
build stuck at 1/4 of MMQ.
- At M=128 vLLM's GEMM **stays at ~88 ms** (flat: it amortizes the one weight read over 128 tokens and
hides the MMA behind the load), while **llama's balloons to 471 ms** (5.4x). llama **falls off the
floor** as M grows; vLLM **holds it**.
So the problem is **not** "build a fast 4-bit GEMM from scratch on an occupancy-hostile part" (the dead
W4A16 problem). It is **"keep a working FP4-MMA kernel on the bandwidth floor as the M-tile grows from 8
to 128"** - a tune of a working path. **Verdict: this is NOT the W4A16 wall** (different regime, working
path, dual existence proof at M=8 and from vLLM at M=128). **But it shares W4A16's one binding
constraint:** holding the floor as M grows requires hiding LPDDR5x weight-load latency at the larger
tile, which is the same occupancy / latency-hiding game GB10 historically loses. The doc is right that
it is a different and more tractable regime; it under-states that the *binding risk is identical*.
### 9.2 Why is vLLM 2.4x faster if both share 273 GB/s? Compute-side scheduling, and the gap is ~82% (not 100%) GEMM
The load-bearing question, settled by 9.1: at M=128 the gap is **not** that vLLM beats the shared
bandwidth floor - it is that **llama falls off the floor into self-inflicted compute/occupancy-bound
territory while vLLM stays on it.** The lever is therefore latency-hiding at the M=128 tile
(compute-side scheduling: occupancy, prefetch, tile shape), with the 273 GB/s weight-read floor as the
hard target both engines share. This confirms the doc's roofline and its central claim that the kernel,
not the hardware, is the limiter.
**But the doc's "the entire 2.42x dense gap is the GEMM" is an ~82% truth, not a 100% one.** Decompose
the dense step (numbers from the doc's own inputs):
```
llama step @npl128 795 ms (decode_agg 161)
vLLM step @npl128 328 ms (decode_agg 391)
total gap 467 ms
llama GEMM 471 ms
vLLM GEMM (at the floor) ~66-88 ms (66 @273 GB/s spec, 88 @216 GB/s achieved)
=> GEMM gap 383-405 ms = 82-87% of the 467 ms total gap
=> non-GEMM gap 62-84 ms = 13-18% of the total gap
```
So **B alone (GEMM -> floor) caps near ~80-84%** (step 412-390 ms = 311-328 t/s), **not parity.** Parity
needs the non-GEMM 62-84 ms too: ~65 ms of it is track A's act-quant bucket, the residual ~0-19 ms is
elementwise + host outside both A and B. This is the crux of the sequencing answer (9.6): **B is
necessary but on its own lands ~80%; it is track A that tips dense over the parity line, not B.** The
parity story is *entirely* contingent on A, which the P3 framing buries.
### 9.3 The sharpest risk the doc misses: vLLM's existence proof uses the technique the doc forbids (TMA)
vLLM holds the M=128 floor with **cutlass SM120 = TMA + a warp-specialized deep async producer/consumer
pipeline** (Research 1). That deep pipeline is **exactly what the doc forbids on GB10** (rule 4.5: "do
not add deep cp.async stages ... they collapsed W4A16"). So **B's chosen GB10-friendly route (`mmq_y`-down
occupancy + a shallow 2-stage prefetch) is a different bet from the one that produced the existence
proof.** Reaching the same floor by a friendlier route is plausible but **unproven**, and if the
occupancy-only route plateaus short of the floor, B underperforms its target with no fallback in scope.
The doc conflates two different things under "deep pipeline":
- **manual `cp.async` + XOR-swizzle** - register/shared-hungry, **collapsed W4A16 occupancy on GB10**
(correctly banned).
- **TMA (tensor-memory-accelerator) bulk async copy** - a single descriptor drives the copy, **far lower
register/occupancy cost**, and it is precisely how cutlass gets pipeline depth **without** the
occupancy hit (Research 1 says this explicitly). TMA is available on sm_120/121.
**Recommendation (binding):** B must put a **TMA-driven weight feed in scope as a first-class P2 option**,
not categorically forbid pipeline depth. The occupancy-only route is the right *first* experiment
(cheapest, respects the W4A16 lesson), but if P2 plateaus below the floor, **TMA is the demonstrated way
to get depth without the occupancy collapse** and is what the vLLM existence proof actually uses.
Declaring the floor "unreachable" without trying TMA would repeat the W4A16 mistake in reverse:
abandoning the path that works because the *manual* version of it failed.
### 9.4 Tractability: bounded tune, confirmed - with the TMA caveat
The proposed changes are genuinely **bounded and build-ready**, not a greenfield kernel:
- **MoE arm = DEMONSTRATED tractable.** Patch 0015 already auto-caps `mmq_x` per-expert and is committed
and measured. Tightening to 8-16 + block-pad is the same lever, lower risk. This is real, banked
evidence that the "tune `mul_mat_q`" approach works on this exact kernel family.
- **Dense arm = plausibly bounded.** `mmq_y`-down is a warp/fragment remap that touches the
`nwarps x tile_C::I == mmq_y` static_assert coupling, so it is a contained *kernel* edit (not a pure
host switch, as the doc itself notes). The host-only P1 knobs are zero-risk. The **prefetch piece is
where the residual occupancy risk lives** - and per 9.3, TMA belongs here.
- **Rejecting (B) cutlass-rewrite and (C) BF16-Marlin-descent is correct.** Cutlass grouped FP4 is broken
on sm_121 (the reason vLLM itself falls to Marlin for MoE); BF16 Marlin concedes GB10's 2x FP4 edge.
**Verdict: tractable, not greenfield.** The MoE arm is proven; the dense arm is a contained edit with a
real but bounded occupancy risk, gated by the P2 kill-gate. The one scope gap is TMA (9.3).
### 9.5 Honest expected outcome (the numbers I would defend)
| | B alone | B + A (median) | B + A (optimistic, spec BW) | parity? |
|---|---:|---:|---:|---|
| **DENSE** (target 391) | ~80-84% (311-328 t/s) | **~92-95% (360-372 t/s)** | ~101% (394 t/s) | **optimistic tail only** |
| **MoE** (target 811) | ~53-61% (431-498 t/s) | **~70-76% (570-618 t/s)** | 76% (618 t/s, CEILING) | **no** |
Reconciliation with the doc: the doc's B+A = "96-103%" uses the **spec-BW (66 ms floor)** end. At the
**achieved 216 GB/s (88 ms floor)** the same arithmetic gives **~94%**, and that still assumes B hits the
floor. So the honest dense median is **~92-95%, with TRUE PARITY as the upside, not the expectation**,
contingent on a conjunction of three things: (a) P2 clears the occupancy kill-gate to the floor, (b) the
GB10-friendly *or* TMA feed actually reaches the cutlass floor (9.3), and (c) track A lands. Three ANDs =
tail, not median.
**The low-regret point the doc under-sells (and the real reason to commit):** even the *kill-gate-tripped*
outcome is a large win. At the doc's own 15%-FP4-eff kill threshold (GEMM ~110 ms), B+A still lands
**~89%** (step 369 ms); at a merely-partial occupancy win (eff 3% -> 5%, GEMM ~276 ms) B+A still lands
**~61%**. Since the M=8 parity proof guarantees the floor is reachable in principle and patch 0015 proves
the tune works, **getting *some* improvement at M=128 is high-probability; the only open question is how
close to the floor.** So the outcome distribution is heavily positive (very likely 60-90%, possibly
parity) with a bounded downside - B is **low-regret**, which matters more for the go decision than whether
the parity tail hits.
### 9.6 Sequencing vs track A: land A FIRST (the doc has this backwards)
The doc runs A as a parallel track merging at **P3 (last)**. That is backwards for de-risking, for three
reasons:
1. **A defines B's interface.** B's "prequantized-MMQ consumer" consumes A's fused `block_fp4_mmq`
producer (the frozen struct in 4.4). Building B against a not-yet-landed producer means B's consumer
seam is speculative until P3.
2. **A defines B's baseline and the kill-gate threshold.** A alone (act-fuse, folding the 65 ms /8.2%
bucket, plus any of the elementwise/host it captures) plausibly moves dense **41% -> ~50-55%** before
B touches a kernel. B's *true residual is the GEMM after A removed the act round-trip*, not the raw
59%. Running B's P2 against the stock 41% baseline mis-sizes the required GEMM speedup and the
<15%-eff kill-gate.
3. **A is lower-risk and independently shippable.** It is the safe win; it should not wait behind the
risky kernel tune.
**Recommendation:** land A (tasks 38-41) first, **re-measure** the decode_agg and the GEMM share
post-A, **then** run B's P2 and recompute the kill-gate against the post-A number. This makes the
make-or-break decision cheaper, better-informed, and bankable-either-way.
### 9.7 Verdict (go / no-go)
- **DENSE: CONDITIONAL GO - commit to B, but scope and message it as "close most of the GEMM gap"
(expected ~80-90%, parity the upside), NOT "true parity."** Justified because: the approach is
bounded/tractable (9.4), it is a working-path tune with a dual existence proof (9.1), and the outcome
is low-regret (9.5) - even a tripped kill-gate roughly doubles today's 41%. Conditions: (i) **land A
first** (9.6); (ii) **gate hard at P2** (eff < 15% -> stop chasing parity, but keep the partial win);
(iii) **put TMA in scope** as the floor-reaching fallback before declaring the floor unreachable (9.3).
- **MoE: NO-GO for parity from B (confirmed).** The doc's ~76% ceiling is honest, arguably optimistic
(it assumes the ragged M~4/expert grouped GEMM hits its 80 ms floor, the hardest regime, where vLLM
ships purpose-built Marlin). Realistic B+A landing **~70-76%**, B alone ~55-61%. Still worth doing -
the `mmq_x`-down / block-pad work is cheap and partly landed (patch 0015) - but it must be sold as a
**1.7-1.85x win, not parity**; MoE parity is a **B-plus-non-GEMM** program (elementwise fusion, host
CUDA-graph, GDN bf16 state).
- **One line for the parent:** GB10 can plausibly reach **dense** decode parity with vLLM only at the
**top of its FP4 envelope and only as B + A together** (B alone caps ~80%; A is what tips it over),
and **cannot** reach **MoE** parity from the GEMM track alone (ceiling ~76%). **Commit to B** as a
high-value, low-regret, bounded GEMM-gap-closing tune (honest expected landing **dense ~80-90%, MoE
~55-65%**), **sequence track A first**, **gate at P2**, and **add a TMA weight-feed option** so the
occupancy-only route is not the only shot at the floor that vLLM's TMA pipeline demonstrably reaches.

View File

@@ -1,86 +0,0 @@
# Decode-Parity: Parked Levers (future exploration)
**Context.** The bit-exact decode-parity effort shipped patches **0018-0023**: dense decode
38% -> **95% of vLLM** @npl128 on GB10 / DGX Spark (LPDDR5x ~273 GB/s), every patch
**byte-identical to llama's own f32 output** (md5-gated). The gated-DeltaNet recurrence (the
dominant ~50% kernel) now runs at **84.6% of peak BW = past vLLM's 82.4%**, at the DRAM floor.
bf16 SSM state was fully built and **shelved** (real +25-31% lever but fails the f32 KL gate).
The remaining non-recurrence kernels (FP4 GEMM, attention, lm_head) are at their bit-exact
floor: any knob changes a reduction order vs the f32 reference. So further *bit-exact* decode
gains are marginal; the levers below are the honest pick-up points, ranked by promise.
---
## 1. Hybrid-precision SSM state (the most promising)
The bf16 build (`BF16_SSM_STATE_RESULTS.md`) proved the throughput lever is large -
recurrence **-49%/call** (dense 3.38 -> 1.73 ms), dense decode ~**490 t/s = 125% of vLLM** (clean
runs), MoE @128 **+24.9%** - but bf16 fails the f32 KL gate (KLD 0.06-0.17 at >=1024 ctx,
~10% argmax flips). The discrimination showed the error is **intrinsic to bf16 over the
long-memory heads** (exp(g) ~ 1, where the per-step decay does not contract the rounding);
short/fast-decaying heads are fine.
**Lever:** a per-head (or per-channel) precision split - keep the long-memory heads (g near 1)
in f32, store the fast-decaying heads (g well below 1, where rounding contracts) in bf16. Could
capture most of the speedup while passing the KL gate. Needs a g-magnitude classifier at graph
build + a mixed-dtype recurrent-state cache. **HIGH promise, moderate effort.** The bf16 kernel
plumbing already exists (DGX `~/llama-paged-dev/BF16_SSM_STATE.diff`); this adds the per-head
dtype selection on top.
*Note:* plain bf16 (no split) is also a legitimate **opt-in for precision-tolerant deployments** -
it is exactly vLLM's own GDN precision (vLLM's recurrent cache is bf16), so "match vLLM speed at
vLLM precision" is a one-flag away if a user wants it. We declined it as the *default* because our
f32 is a strictly higher bar.
## 2. Dense CUDA-graph instability
The bf16 dense decode was **bimodal** across runs (287 / 336 / 487 / 498 t/s) - a dense-path
CUDA-graph capture/replay instability (good runs hit ~490). The f32 dense path measured stable
(371-376) but the bimodality is a latent fragility worth root-causing; a robust graph capture on
the dense path could stabilize and possibly lift dense decode. **Moderate promise**, diagnostic.
## 3. Dense rms_norm -> fp4 producer-fold (~1.5-2.5%, parked as flat-risk)
The last bit-exact bucket (`RMSNORM_FP4_FOLD.md`). Folding the standalone `quantize_mmq_nvfp4`
into the rms_norm+mul producer at the FFN boundary (f32 output dead -> droppable) could recover
~1.5-2.5% dense. Parked because: the Lever-2 precedent measured **flat**, it has the worst
gain/plumbing ratio (3-op `{RMS_NORM,MUL,MUL_MAT(NVFP4)}` graph fusion + a pre-quantized-src1
GEMM path + scratch-pool / CUDA-graph-lifetime plumbing), and the gain risks being swallowed by
the ~0.3-0.5% bench noise floor. Revisit only with the inter-node graph-CSE plumbing built and
proven on a same-build flag toggle (decode_agg lift above noise AND md5 == 0023). **LOW promise.**
## 4. Datacenter Blackwell (sm_100)
This effort targeted **consumer** Blackwell sm_12x (sm_120 RTX 50-series, sm_121 GB10). Datacenter
Blackwell (B100/B200/GB200, sm_100 / cc 10.0) has HBM3e (much higher BW) and different MMA
characteristics - the LPDDR5x bandwidth floor that dominates GB10 decode does **not** apply, so the
whole calculus changes (likely compute-bound, not BW-bound; the recurrence would not be the binding
kernel). A separate investigation if datacenter Blackwell becomes a target.
## 5. Prefill / TTFT scheduler + paged-pool burst degradation (HIGH priority - the weakest benchmark number)
The final benchmark (`QWEN36_NVFP4_BENCH.md`) exposed TTFT as the clear weak spot vs vLLM. Two distinct
issues:
- **Static decode-first budget tradeoff:** the QoS budget (patches 0013/0016, `LLAMA_MAX_BATCH_TOKENS=512`)
maximizes decode tok/s + memory but throttles burst-prefill, so under a synchronized 128-way burst TTFT
climbs to **903 s dense / 213 s MoE @npl128** vs vLLM's chunked-prefill 6-18 s. A dynamic/adaptive budget
(by concurrency + queue depth), or matching vLLM's chunked-prefill interleave, would rebalance.
- **Paged-pool burst-degradation BUG (concrete, found in the benchmark):** after a high-npl burst, a
server's *subsequent lower-npl* prefill collapses (fresh npl8 = 507 t/s / 6 s TTFT; npl8 after an npl64
burst = 65 t/s / 64 s). Decode stays robust; only prefill degrades -> root-cause the paged-pool state
that persists across the burst.
**HIGH promise** for the serving experience: decode (dense 90-117%, MoE 77-83% of vLLM) and memory (1.5-3x
lower) are already strong; TTFT is the one number holding back a clean public win.
## 6. MoE-specific recurrence tuning
The occupancy retune (0022) was tuned on the dense path; it lifted MoE +8.3% as a side effect. The
MoE path (`MUL_MAT_ID` grouped GEMM + the shared GDN recurrence, expert routing changes the GEMM
shapes) may have MoE-specific occupancy headroom. Worth a MoE-targeted reprofile.
---
*All shelved per the host handover - experiments parked. Pick up from the linked result docs in this
directory.*

View File

@@ -1,208 +0,0 @@
# GDN decode verify: is llama.cpp's Gated-Delta-Net decode O(1) or an O(ctx) re-scan?
Verdict-first, then the evidence. This closes lever 5 of `VLLM_DECODE_GROUNDING.md` ("Verify
llama's GDN/linear-attention decode path"): on the Qwen3.6 hybrid models, is llama re-scanning the
context (O(ctx)) in the linear-attention layers, or keeping vLLM's O(1)-in-context recurrent state?
Method: GGUF-metadata + source reading on the `paged` dev tree (`~/llama-paged-dev`, build-cuda
sm_121) on `dgx.casa`, plus nsys CUDA-kernel decode traces on `~/bench/q36-27b-nvfp4.gguf`
(GB10 / DGX Spark, `GGML_CUDA_DISABLE_GRAPHS=1`, paged KV, `-fa on`). Models:
`~/bench/q36-27b-nvfp4.gguf` (dense, arch `qwen35`), `~/bench/q36-35b-a3b-nvfp4.gguf`
(MoE, arch `qwen35moe`).
## TL;DR verdict
**llama.cpp's GDN decode is EFFICIENT: it is O(1)-in-context, a single fused CUDA kernel that
reads + updates a fixed-size cached recurrent state, structurally identical to vLLM's
`fused_recurrent_gated_delta_rule`. It is NOT a re-scan, NOT a context-scaling blowup, and NOT a
major contributor to the ~2.4x eager-decode gap.** There is no GDN-specific bottleneck to fix, so
the cheap model-specific lever this probe was hunting for does not exist. The 2.4x is the general
kernel work (the FP4 weight GEMM, which dominates the step, plus the O(ctx) full-attention decode
kernel in the minority of full-attention layers), exactly as `VLLM_DECODE_GROUNDING.md` concluded.
The decisive datum: at matched batch (npl4), pure decode, 4x more context, the GDN kernel time is
**flat** while the full-attention kernel grows ~3.1x:
| kernel | ctx 1024 | ctx 4096 | ratio | meaning |
|--------|---------:|---------:|------:|---------|
| `gated_delta_net_cuda` (GDN linear-attn) | 10.3 us/launch | 8.0 us/launch | **~1.0x (flat)** | **O(1) in ctx** |
| `flash_attn_tile` (full-attn layers) | 27.1 us/launch | 85.0 us/launch | **3.1x** | O(ctx), as expected |
| total ms / decode step | 84.9 | 86.0 | 1.01x | GEMM-bound, ctx-independent |
Identical decode-step counts in both windows (~190 steps, ~9134 GDN launches), so this is a
per-step like-for-like comparison: the GDN layers do **not** get more expensive as context grows.
## 1. Architecture (confirmed from GGUF metadata + tensor names)
Both Qwen3.6 models are hybrid: a `full_attention_interval` of 4 means every 4th layer is standard
full attention and the other 3/4 are Gated-Delta-Net (GDN) linear attention with a recurrent state.
**Dense Qwen3.6-27B (`general.architecture = qwen35`):**
- `block_count = 64`, `full_attention_interval = 4` -> **16 full-attention layers + 48 GDN layers**.
- Full-attn: `head_count = 24`, `head_count_kv = 4` (GQA), `key_length = value_length = 256`,
rope `freq_base = 1e7`, mrope sections `[11,11,10,0]`.
- GDN/SSM: `ssm.state_size = 128`, `ssm.conv_kernel = 4`, `ssm.group_count = 16`,
`ssm.time_step_rank = 48`, `ssm.inner_size = 6144`. So the recurrent state per GDN layer is
`[S_v=128, S_v=128, H_v=48]` per sequence (`H_v = inner_size/state_size = 6144/128 = 48` value
heads), i.e. a 128x128 state matrix per head, ~3.1 MB (F32) per sequence per layer.
**MoE Qwen3.6-35B-A3B (`general.architecture = qwen35moe`):**
- `block_count = 41`, `full_attention_interval = 4` (~10 full-attn + ~31 GDN layers).
- `head_count = 16`, `head_count_kv = 2`, `key_length = value_length = 256`,
`expert_count = 256`, `expert_used_count = 8`, `expert_feed_forward_length = 512`.
- Same SSM dims: `state_size = 128`, `conv_kernel = 4`, `group_count = 16`,
`inner_size = 4096` -> `H_v = 32` value heads.
**Tensor names confirm the op split (27B, per-layer dump):**
- GDN layers (e.g. `blk.0.*`): `ssm_alpha`, `ssm_beta`, `ssm_conv1d`, `ssm_a`, `ssm_dt.bias`,
`ssm_norm`, `ssm_out`, plus `attn_qkv` / `attn_gate` (the in/out projections of the linear-attn
block). No `attn_k/v/output`, no per-head q/k norm.
- Full-attn layers (e.g. `blk.3.*`, every 4th): `attn_q`, `attn_k`, `attn_v`, `attn_output`,
`attn_q_norm`, `attn_k_norm`. No `ssm_*`.
llama loads the GDN layers through the **recurrent memory** (`llama-memory-recurrent`), not the KV
cache: the conv state and the SSM state live in `conv_states_all` / `ssm_states_all` and are read
and written every step. Only the 16/10 full-attention layers use the (paged) KV cache. This is the
SSM-style recurrent path, not standard attention.
## 2. llama.cpp GDN decode implementation: O(1) recurrent-state update (code-proven)
Graph build (shared by both models): `src/models/delta-net-base.cpp`, dispatched from
`src/models/qwen35.cpp` and `src/models/qwen35moe.cpp` (the MoE class inherits
`llm_build_delta_net_base` and calls the same `build_recurrent_attn`, qwen35moe.cpp:472).
**Decode dispatch (`build_delta_net`, delta-net-base.cpp:425-447):** when `n_seq_tokens == 1`
(decode), it takes `build_delta_net_fused` if `cparams.fused_gdn_ar` (the default, see below), else
`build_delta_net_autoregressive`. Both are O(1):
- `build_delta_net_autoregressive` (delta-net-base.cpp:289-371) is the explicit rank-1 recurrence on
the fixed-size state `s` shaped `[S_v, S_v, H_v, n_seqs]`: `s *= exp(g)` (decay),
`sk = sum_rows(s * k)`, `d = (v - sk^T) * beta`, `s += k (x) d^T` (rank-1 update),
`o = sum_rows(s * q)`. **No loop over past tokens, no KV read** - it touches only the state and
the single new token's q/k/v/g/beta. `GGML_ASSERT(n_tokens == 1)`.
- `build_delta_net_fused` (delta-net-base.cpp:373-423) collapses the same recurrence into one op,
`ggml_gated_delta_net(q, k, v, g, b, s, K=1)`.
**State is cached across steps, not rebuilt (`build_recurrent_attn`, delta-net-base.cpp:527-606):**
the input state `s` is read from `ssm_states_all` via `build_rs`, and the new state is copied back
with `ggml_cpy(new_state, view(ssm_states_all, ... kv_head ...))` (lines 555-558). The causal-conv
state is handled the same way in `build_conv_state` (449-525): the previous `conv_kernel-1 = 3`
samples are read from `conv_states_all`, the new token is appended, and the last 3 are written back.
So both pieces of GDN state persist in the recurrent cache exactly like a KV cache persists tokens -
this is the recurrent analogue, fixed size, independent of context length.
**Defaults (`src/llama-context.cpp:200-201`):** `cparams.fused_gdn_ar = true` and
`fused_gdn_ch = true`. They are only auto-disabled if the fused op cannot be scheduled on the same
device as the layer (`device_gdn != device_kv`, lines 540-595); on a single GB10 with `-ngl 99`
that does not happen, so the **fused single-kernel path is what runs**.
**The CUDA kernel (`ggml/src/ggml-cuda/gated_delta_net.cu`) is the crux, and it is unambiguously
O(1) in context:**
- Launch grid `dim3(H, n_seqs, ceil(S_v/4))` and block `(min(warp,S_v), 4, 1)` (lines 184-185):
the grid spans heads x sequences x state-columns. **There is no context-length dimension and no
context-length argument anywhere in the kernel signature** (q/k/v/g/beta are the new token(s)
`[S_v, H, n_tokens, n_seqs]`; `curr_state` is the fixed `[S_v, S_v, H, n_seqs]`).
- Each warp loads its shard of the fixed-size state into registers **once** (lines 57-61), then
loops `for (t = 0; t < n_tokens; t++)` (line 63). At decode `n_tokens == 1`, so it is a single
iteration: read the one new token, do the rank-1 update
`s_shard[r] = g * s_shard[r] + k[i] * delta_col` and the readout `attn = S^T q` (lines 84-141),
then write the updated state back (lines 161-167). No second loop, no read of any past KV.
- Work per decode step is therefore proportional to `S_v * S_v * H * n_seqs` (the state size x
batch) and **constant in context length**. This is precisely vLLM's
`fused_recurrent_gated_delta_rule_packed_decode_kernel` (one batched launch updating a
fixed-size `[K,V]` state) cited in the grounding doc.
A chunked GPU kernel for prefill is a TODO (delta-net-base.cpp:181 `//TODO: Add chunked kernel`);
the chunked CPU/graph path (`build_delta_net_chunking`) only runs for multi-token ubatches
(prefill), never at decode.
## 3. nsys decode profiling: GDN is a small share and does not scale with context
Qwen3.6-27B NVFP4, sm_121, `GGML_CUDA_DISABLE_GRAPHS=1`, paged KV, `-fa on`, `llama-server` driven
to steady decode by a looping completion client. Kernel time bucketed by name (full classifier and
sqlites under `~/bench/gdn_study/`).
**(a) Share at the headline batch (npl128, ctx 1024), GPU 92.7% busy:**
| bucket | % of busy | us/launch |
|--------|----------:|----------:|
| GEMM_weight (`mul_mat_q`/`mul_mat_vec_q`) | 59.2 | - |
| **GDN_recurrent (`gated_delta_net_cuda`)** | **8.9** | 369 |
| GEMM_act_quant (`quantize_mmq_nvfp4`) | 8.2 | - |
| elementwise / act_glu / norm / rope | ~13.5 | - |
| embed_gather (`get_rows`) | 2.9 | - |
| **ATTENTION_full (`flash_attn`, 16 layers)** | **1.8** | 107 |
| copy_cast (`cpy`) | 1.8 | - |
| **GDN_conv (`ssm_conv`)** | **1.5** | - |
The whole GDN path (recurrent 8.9% + conv 1.5%) is ~10% of the step; full attention is ~2%; the
**weight GEMM dominates at ~67% (59.2% GEMM + 8.2% act-quant requant)**. This is the dense model,
where the grounding predicted the GEMM would be the lever.
**(b) Share at low batch (npl32, ctx 1024), weight-bandwidth (GEMV) regime, GPU ~100%:**
GEMM_weight 88.7%, GDN_recurrent 0.8%, ATTENTION_full 0.7%, GDN_conv 0.3%. At low batch the
weight-read GEMV swamps everything and GDN is negligible; the GDN share tracks the batch, not the
context.
**(c) Context-scaling control (the decisive test): matched batch npl4, pure decode, ctx 1024 vs
4096.** Small batch -> fast prefill -> a clean pure-decode capture (verified: GEMM is the M=1
`mul_mat_vec_q` decode GEMV, and the client completed decode rounds inside the window). Identical
decode-step counts (~190 steps, gated_delta_net launched 9141 vs 9134 times), so per-launch time is
a true per-step comparison:
| kernel / bucket | ctx 1024 | ctx 4096 | ratio |
|-----------------|---------:|---------:|------:|
| `gated_delta_net_cuda` us/launch | 10.3 | **8.0** | **0.78x (flat)** |
| GDN_recurrent share | 0.6% | 0.4% | flat/down |
| `ssm_conv` (GDN_conv) us/launch | 5.2 | 5.2 | 1.00x |
| `flash_attn_tile` us/launch | 27.1 | **85.0** | **3.14x** |
| ATTENTION_full share | 0.6% | 1.8% | 3.0x up |
| total ms / decode step | 84.9 | 86.0 | 1.01x |
The GDN kernel time is flat (even a hair faster) across a 4x context increase, while the
full-attention kernel grows ~3x, exactly the O(1)-vs-O(ctx) signature. The total step time barely
moves because at this batch the (context-independent) FP4 weight GEMM is 88% of the step. This is
the empirical confirmation of the code analysis: **llama's GDN decode does not re-scan the context.**
(An earlier npl32 ctx4096 attempt was discarded: with 32 parallel slots each independently
prefilling ~4100 tokens, the nsys window caught prefill, not steady decode - the `mul_mat_q(M=128)`
+ `flash_attn_ext_f16(ctx4096)` signature gave it away. The npl4 runs above avoid this by keeping
prefill short.)
## 4. Verdict and fix scope
**Efficient, not a bottleneck.** llama.cpp runs the Qwen3.6 GDN/linear-attention layers as a fused,
single-CUDA-kernel, O(1)-in-context recurrent-state update, with the conv and SSM state cached in
the recurrent memory across decode steps. It is algorithmically the same as vLLM's O(1)
`fused_recurrent` decode. The probe's worst case (llama re-scanning context => GDN layers ballooning
with context and concurrency) is **falsified**: the GDN kernel is flat across 4x context, and the
op carries no context-length parameter at all.
**So the GDN path is not the cheap model-specific lever.** It is a small-to-moderate, context-flat
share of the step (~0.4-0.8% at low batch, ~10% including conv at batch 128), and removing it would
not dent the 2.4x. The gap is the general kernel work, confirming `VLLM_DECODE_GROUNDING.md`:
1. the **FP4 weight GEMM** is the dominant bucket (~59% GEMM + ~8% `quantize_mmq_nvfp4` requant that
vLLM fuses away via native FP4-MMA / grouped Marlin); this is the biggest, hardest lever.
2. the **full-attention decode kernel** is the O(ctx) residual (the only thing that grows with
context, ~3x per-launch over 4x ctx), in the minority of full-attention layers.
If anything on the GDN side is ever worth touching, it is a bounded micro-optimization, not a
complexity fix: the kernel is memory-bound on the F32 recurrent state (state read+write is
`S_v^2 * H * batch` = ~0.79 GB/step over 273 GB/s at batch 128, hence the ~8.9% share), and this
traffic is **intrinsic to the architecture - vLLM pays the identical state I/O**, so it is not a
llama-specific inefficiency. A future win could keep the recurrent state in bf16 or fuse the
`ssm_conv` + gated-norm into the delta-net kernel to shave that ~10%, but the ceiling is small and
it does not close the 2.4x. The throughput effort stays where the grounding put it: the FP4 GEMM
(fused act-quant + native FP4-MMA) and the full-attention decode kernel, with a CUDA-graphed
steady-state step as the bounded host-side add-on.
## Reproduce
- Metadata: `python3 gguf-py/gguf/scripts/gguf_dump.py --no-tensors ~/bench/q36-27b-nvfp4.gguf`.
- Code: `src/models/delta-net-base.cpp` (build_delta_net 425, autoregressive 289, fused 373,
build_recurrent_attn 527, build_conv_state 449); `src/llama-context.cpp:200-201,540-595`
(fused_gdn defaults/guard); `ggml/src/ggml-cuda/gated_delta_net.cu` (kernel 4-168, launch grid
184-185, dispatch 226-312).
- Profiles: `~/bench/gdn_study/drv.sh <label> <P> <K> <ctx> <delay> <dur>` runs `llama-server` under
nsys and drives `clientloop.py`; `catgdn.py <sqlite>` buckets kernels. Sqlites:
`gdn_npl128_ctx1024`, `gdn_npl32_ctx1024`, `gdn_npl4_ctx1024`, `gdn_npl4_ctx4096`.

View File

@@ -1,344 +0,0 @@
# GDN recurrence byte gate + fused single-pass kernel design
Label: llama-fused-recurrence-design (READ-ONLY, no GPU). Source-and-math design only;
the byte-ratio measurement itself is produced by the `ncu-byte-gate` agent.
## TL;DR (the correction the workflow was set up to settle)
**The recurrence kernel is ALREADY single-pass on the f32 state.** `gated_delta_net_cuda<128>`
(after patches 0018 in-place write + 0019 fused gather) loads the whole `s0` column into registers
ONCE (`s_shard[rows_per_lane]`), runs the entire token loop in registers, and writes the new state
back ONCE - directly into the persistent cache slot (0018) or scratch. For decode `n_tokens==1`,
`keep_rs_t==false`: one register load, one register store, no re-read of state from DRAM.
The byte-gate's working hypothesis - "un-fused l2norm/gate/decay/recurrence/state-writeback/gather
each touching the f32 state, so a fused pass halves DRAM bytes" - is **false for the state**. Only
the recurrence kernel touches the 3 MB/seq state. The surrounding ops (`l2_norm`, `silu`, `sigmoid`,
the `gate` exp/softplus, `ssm_conv`, `concat`, `cpy`) all operate on the **small activations**
(q/k/v/g/beta), which are 100-800x smaller than the state. There is no 2x state re-streaming to
recover; the recurrence kernel is byte-minimal on state by construction.
Therefore a fused single-pass kernel **cannot move the dominant 196 ms recurrence** - that cost is
f32-state read+write bandwidth, already a single pass. The two real levers are decoupled:
1. **Fold the surrounding activation ops into the kernel** (MEDIUM effort): recovers the small
per-op buckets (`ssm_conv` 1.5% + `silu`/`sigmoid` 1.4% + 2x `l2_norm` + `concat` 2.1% + conv
`cpy` 2.0%, ~6-8% of the step) plus per-op launch overhead. Bit-exact. Ceiling ~93-96% of vLLM.
2. **bf16 state cache** (HIGH effort, NON-bit-exact): halves the dominant byte stream. The only
large lever on the 196 ms. Target KL < 1e-3 by keeping f32 register accumulation, storing only
the persisted cache in bf16.
Which of (1)/(2) is worth building hinges on the `ncu-byte-gate` byte ratio (below).
## Byte arithmetic (dense q36-27b-nvfp4, decode, npl128, S_v=128, H_v=48, batch=128)
State per (seq, GDN layer) = S_v^2 * H_v = 128*128*48 = 786,432 f32 = **3.0 MiB**.
Per kernel call (one GDN layer, full 128-seq batch), single pass:
- state read = 786,432 * 128 * 4 = 402.65 MB
- state write = 402.65 MB
- **state R+W = 805.3 MB/call** (768 MiB)
- activations (q,k 1 MB each; v 3 MB; attn-out 3 MB; g,beta tiny) ~= 8 MB/call = **<1%**.
Measured 4.08 ms/call (node-level trace) -> effective **197.4 GB/s**.
GB10 / DGX Spark LPDDR5X peak ~= **273 GB/s** -> **~72% of peak.**
48 GDN layers/step -> 38.7 GB of state traffic/step -> 196 ms = 51.6% of the 383.48 ms step. v=8MB
activation traffic is noise; state is 99% of the recurrence bytes.
### What this means for the open question
- The recurrence is single-pass, coalesced (transposed layout: lane reads `state[col*S_v + i]`,
consecutive lanes -> consecutive `i`), running at ~72% of peak BW. It is NOT at the 85% hardware
floor, but it is NOT re-streaming state either. The 72->85% headroom (~30 ms, bit-exact) is an
occupancy/coalescing tune, NOT a fusion win.
- vLLM `fused_recurrent_gated_delta_rule` does the SAME single-pass recurrence. If vLLM's recurrent
state cache is bf16 (model dtype) while llama's is f32, vLLM moves HALF the bytes on the dominant
stream - that alone is ~98 ms, i.e. essentially the whole residual decode gap. **This is the
single most decision-relevant number for the `ncu-byte-gate` agent to confirm: the dtype/bytes of
vLLM's GDN state cache vs llama's f32, plus llama's measured achieved-BW % on the recurrence
kernel.** If vLLM is bf16-state -> build (2). If vLLM is also f32-state and at ~85% -> llama is
at the floor, only (1) + coalescing remain and bit-exact parity tops out ~95%.
## The fused single-pass kernel design
Two deliverables, layered. Build (1) first (bit-exact, de-risks the graph), gate (2) on the byte
verdict.
### (1) `ggml_gated_delta_net_decode_fused` - fold the activation ops into the kernel
Folds the pre-recurrence activation ops and the post-recurrence gated RMSNorm into the existing
single-pass recurrence kernel, so q/k/v/g are produced and consumed in registers/shared and never
make a separate DRAM round-trip, and the per-op launches collapse to one.
Current decode op chain in `build_layer_attn_linear` (qwen35.cpp 386-461), per GDN layer:
```
wqkv GEMM -> qkv_mixed (keep: GEMM, separate)
wqkv_gate GEMM -> z (keep: GEMM, separate)
ssm_beta GEMM -> beta -> sigmoid [FOLD beta sigmoid]
ssm_alpha GEMM -> alpha -> +ssm_dt -> softplus -> *ssm_a (gate) [FOLD softplus/mul -> per-head g]
build_conv_state: reshape, transpose qkv, CONCAT, cpy [concat/cpy -> conv-state plumbing, see note]
ggml_ssm_conv(conv_input, conv_kernel) [FOLD depthwise conv, K=4]
ggml_silu(conv_output) [FOLD silu]
views q_conv/k_conv/v_conv
ggml_l2_norm(q_conv); ggml_l2_norm(k_conv) [FOLD 2x l2norm]
[repeat_4d skipped on fused path]
ggml_gated_delta_net_inplace_ids(...) <-- THE recurrence kernel (196 ms)
build_norm_gated(output, ssm_norm, z): RMSNorm + silu(z) + mul [FOLD post gated-RMSNorm]
ssm_out GEMM (keep: GEMM, separate)
```
Fold list (what moves INTO the kernel):
- `beta` sigmoid: scalar per (head,seq); apply in-kernel when reading beta.
- `gate` g = softplus(alpha+dt)*a (GDA, g->ne0==1): scalar per (head,seq); compute/exp in-kernel.
The kernel already does `expf(*g_t)` (non-KDA path, line 85) - so feed RAW `alpha`+`dt` and the
`a` scale and do softplus+mul+exp in-kernel; removes the `add`/`softplus`/`mul` launches.
- `ssm_conv` (depthwise causal conv1d, kernel width 4) + `silu`: per channel a length-4 dot of the
conv state with `ssm_conv1d` then silu. This is the prologue: each warp/thread, before loading
state, computes its q/k/v channel by reading 3 cached conv-state taps + the current qkv_mixed
token, dotting the 4-wide kernel, applying silu. The conv state (conv_kernel-1=3 taps x conv_dim)
is tiny and already cached; fold its read here and its 1-token shift write into the epilogue
(replaces the `concat`+`cpy` conv-state update).
- `l2_norm` of q and k: a warp reduction over S_v of the per-head q/k vector. The recurrence kernel
already does warp reductions over S_v (the kv/attn dot products) - the l2norm reuses the same
warp-reduce primitive on q_reg/k_reg right after they are loaded, before the recurrence math.
- Post: `build_norm_gated` = RMSNorm(output, ssm_norm) * silu(z). The kernel already holds the
attn output `attn_col` per (head,seq,col) in registers at the end; fold an S_v warp-reduce RMS,
multiply by `ssm_norm` weight and by `silu(z)` (z read once), and write the final gated output -
removing the `rms_norm`+`silu`+`mul` launches and one activation round-trip.
State traffic UNCHANGED (still one read + one write). Activation traffic for conv/silu/l2norm/norm
collapses into the kernel's register/shared path; ~6 separate launches become 0. Expected recovery:
the ~6-8% surrounding-op buckets + launch overhead. **Bit-exact** if the numeric ordering is held
(see Numeric notes). Conservative ceiling ~365-375 tok/s dense (~93-96% of vLLM 391).
Data flow (per (h_idx=head, sequence=seq) block, decode n_tokens=1, S_v=128, num_warps=4):
1. PDL sync.
2. Prologue (per channel/lane): read 3 conv-state taps + current `qkv_mixed[t]` for this channel,
dot with `ssm_conv1d[0..3]`, add conv bias if any, `silu`. Produces this lane's q/k/v element.
3. l2norm q,k: warp-reduce sum(q^2), sum(k^2) over the S_v dim; scale q_reg,k_reg by rsqrt(.+eps).
4. Load `s0` column into `s_shard` (UNCHANGED single read).
5. Recurrence (UNCHANGED math: g-decay, kv = S^T k, delta = (v - g*kv)*beta, S = g*S + k(x)delta,
attn = S^T q * scale).
6. Write `s_shard` back to cache slot ONCE (UNCHANGED single write). Write the 1-token-shifted conv
state back to the conv cache (replaces concat+cpy).
7. Epilogue gated-RMSNorm: warp-reduce sum(attn^2) over S_v -> RMS; multiply by `ssm_norm[col]` and
by `silu(z[col])` (z loaded once); write final output element. ssm_out GEMM stays separate.
Inputs added to the op: `ssm_conv1d` weight, `ssm_norm` weight, `z`, conv-state cache view, raw
`alpha`/`dt`/`a`, eps. This is a wider op signature (src[8..]) - acceptable; gate it behind a new
`cparams.fused_gdn_decode` resolved exactly like `auto_fgdn` (graph_reserve + device-match probe,
llama-context.cpp 518-595) so it silently falls back to the current op chain if any device lacks it.
### (2) bf16 recurrent-state cache - the dominant-term lever (NON-bit-exact)
Only build if `ncu-byte-gate` shows vLLM moves fewer state bytes (bf16) OR llama's f32 recurrence is
already >=85% of peak (then f32 is at the floor and bf16 is the only way down).
- Store `ssm_states_all` (the recurrent-state cache) as bf16. Halves the dominant 805 MB/call -> at
the same ~197 GB/s -> ~2.04 ms/call -> ~98 ms/step saved (196 -> ~98). Dense projected
335 -> ~440+ tok/s (>= vLLM 391) if BW-bound holds; smaller dtype usually achieves a HIGHER % of
peak, so likely better.
- Kernel change: read state -> convert bf16->f32 into `s_shard` (registers stay f32); all recurrence
arithmetic in f32 (UNCHANGED); on write, convert f32->bf16. Accumulation precision is preserved
within a step; only the PERSISTED state is rounded to bf16 each step.
- Numerics: the recurrent state decays geometrically (g<1), so per-step bf16 rounding does not
accumulate unboundedly, but it is NOT bit-exact. Validate KL < 1e-3 vs the f32-state build over a
256-token greedy run; if KL fails, fall back to f32 state (keep it a cparams toggle). This is the
ONLY path to bit-near parity-or-better on the dominant term; bit-EXACT parity on the 196 ms is
unreachable because the f32 state bytes are irreducible (single pass already).
## Numeric / bit-exactness notes (for fold (1))
- l2norm/RMS use f32 warp-reduce accumulation (matches `ggml_l2_norm`/`ggml_rms_norm` f32 sum).
Order of summation across lanes differs from the standalone op's sequential sum -> floating
reassociation. To stay bit-exact, replicate the standalone op's reduction order, OR accept a
tiny reassociation delta and gate on KL<1e-3 (the workflow's near-bit-exact target). Recommend:
ship fold (1) behind the cparams probe and assert greedy md5 match vs the current chain (0019
already established the harness: dense text md5, MoE byte-identical).
- Recurrence math, scale, g-exp order, beta apply: keep EXACTLY as in `gated_delta_net_cuda` /
`ops.cpp` reference (lines 84-141 .cu, 10685-10730 ops.cpp). Do not reorder the
v - g*kv -> *beta -> S update -> S^T q sequence.
- conv: depthwise dot of width-4 kernel in f32, then silu - identical to `ggml_ssm_conv`+`ggml_silu`
if done in the same order.
- gate softplus: `softplus(x)=log1p(exp(x))`; match ggml's `ggml_softplus` (has the >20 fast path)
to stay bit-exact.
## Implementation scope
- (1) `.cu`: extend `gated_delta_net_cuda` with a decode-fused template specialization (or a new
kernel) that does conv+silu prologue, q/k l2norm, recurrence, conv-state shift write, gated-RMSNorm
epilogue. Add `ggml_cuda_op` dispatch. CPU mirror in `ops.cpp` for parity/CI.
- (1) `ggml.h`/`ggml.c`: new builder `ggml_gated_delta_net_decode_fused` (extra src: ssm_conv1d,
ssm_norm, z, conv-cache view, alpha/dt/a, eps + op_params for eps).
- (1) graph edits: `delta-net-base.cpp build_recurrent_attn` (add the decode-fused branch alongside
the existing fused/ids branch); `qwen35.cpp` + `qwen35moe.cpp` `build_layer_attn_linear` (route
the pre/post ops into the op when `cparams.fused_gdn_decode`); leave `qwen3next.cpp`,
`kimi-linear.cpp`, the non-fused and rollback (n_rs_seq>0) paths unchanged.
- (1) `llama-context.cpp`: `auto_fgdn`-style device-match probe to enable/disable the decode-fused
op (silent fallback). `cparams.h`/`cparams.fused_gdn_decode`.
- (2) bf16 state: cache dtype change in the recurrent-memory allocation + the kernel load/store
convert + a `cparams` toggle + KL gate. Touches `gated_delta_net.cu` load/store, the inplace/ids
builders' state asserts, and the recurrent cache type.
## Risk register
- (1) is MEDIUM effort, bit-exact-targetable, but bounded upside (~6-8% + launches; ceiling ~95% of
vLLM). Worth it only if the workflow wants >90% and accepts no bf16.
- (2) is the only large lever on the dominant 196 ms but is NON-bit-exact (KL-gated). If vLLM is
f32-state, (2) takes llama BELOW vLLM's precision, not toward parity - a product call, not a perf
call.
- The widened op signature (many srcs) raises maintenance cost and the device-match probe matters
(CPU offload of a GDN layer must fall back cleanly).
- Do NOT expect a fused recurrence to cut the 196 ms: it is already one read + one write of f32
state. Re-confirm with the `ncu-byte-gate` achieved-BW number before committing HIGH effort.
---
# MEASUREMENT + VERDICT (label ncu-byte-gate, THE GPU agent) - GATE SETTLED
The design above predicted the answer; this is the decisive measurement that confirms it.
## VERDICT: NO-BUILD the fused single-pass recurrence. BUILD bf16 SSM state (design's lever (2)).
Deciding number: **llama re-stream factor = ~1.0x** (mathematically capped at <=1.33x; >=1.5x is
physically impossible). llama's recurrence kernel is ALREADY single-pass, coalesced, and at
**74% of GB10 peak BW** - MORE bandwidth-efficient than vLLM's fused triton kernel (41% of peak).
The whole 2x DRAM gap vs vLLM is **f32 (llama) vs bf16 (vLLM) state-cache width**, not re-streaming.
## ncu HW counters were BLOCKED; timing + geometry gave the byte ratio anyway
- `ncu dram__bytes` and `nsys --gpu-metrics-devices` both return `ERR_NVGPUCTRPERM`
(`NVreg_RestrictProfilingToAdminUsers` restricted, root-only; no passwordless sudo on dgx.casa).
DRAM byte counters are unobtainable on this box.
- Decisive fallback (no perf counters): CUPTI kernel TIMING (allowed) + EXACT byte geometry from
the kernel source. bytes_moved <= peak_BW x duration gives a HARD CAP on the re-stream factor;
comparing implied effective BW between llama and vLLM (same model, same B, both eager) settles it.
## Measured (clean nsys CUDA timing; build-cuda-base df1cc97 Lever-1; both B=128, both graphs/eager-OFF)
llama: `llama-batched-bench -npp 8 -ntg 12 -npl 128 -ub 2048`, GGML_CUDA_DISABLE_GRAPHS=1.
vLLM: postssm_decomp/vllm_decode.sqlite, NSEQ=128, enforce_eager=True (apples-to-apples).
| kernel | state dtype | bytes R+W/call | duration/call (steady) | eff. BW | % of 273 peak | re-stream |
|---|---|---|---|---|---|---|
| llama gated_delta_net_cuda | f32 | 805.3 MB | **3.98 ms** (min 3.90 max 4.33, grid 48x128x32) | 202 GB/s | **74%** | ~1.0x |
| vLLM fused_recurrent...packed_decode | bf16 | 402.6 MB | **3.62 ms** (min 3.53 max 3.96, grid 4x6144x1) | 111 GB/s | **41%** | ~1.0x |
- llama recurrence/step = 3.98 x 48 = **191 ms** (50% of 384 ms step; matches STATE 196 ms).
- vLLM recurrence/step = 3.62 x 48 = **174 ms**. Per-call gap llama-vs-vLLM is only +10%, NOT 2.8x.
The old "1.47 ms near-vLLM" was prefill-contaminated; clean decode is 3.98 ms (confirms STATE).
- Both kernels verified SINGLE-PASS in source (llama: s_shard load-once/store-once, 128 consecutive
f32/warp = coalesced; vLLM packed_decode: `b_h += load(p_h0).to(f32)` once, `store(p_ht, b_h.to(bf16))`
once). vLLM cache dtype = state_dtype = model_dtype = bf16 (`_mamba_state_dtype` default "auto" ->
model dtype; config.json dtype=bfloat16). Geometry identical (H=48, k/v head_dim 128, S_v 128).
## Why re-stream ~1.0x (the gate number)
Most bytes a 3.98 ms call could move at 273 GB/s peak = 1.087 GB = **1.33x the 816 MB minimal**.
1.5x/2x re-stream would need >peak BW -> impossible. Source proves single-pass+coalesced -> 1.0x end:
~816 MB at 202 GB/s = 74% peak. A fused single-pass rewrite recovers ~0 state bytes => NO-BUILD.
## The lever: bf16 SSM state (design (2)) - confirmed, large, parity-to-ahead
2x recurrence bytes vs vLLM = 100% f32-vs-bf16 cache. llama's kernel is the more efficient one
(74% vs 41% peak), so bf16 state (cache + load/store bf16, f32 register compute, exactly as vLLM):
- 805.3 -> ~413 MB => at 74% peak ~2.0 ms/call => 191 -> ~96 ms/step, save ~95 ms => step ~289 ms
(~443 tok/s, AHEAD of vLLM 327). Conservative (50% peak on smaller footprint): ~3.0 ms/call =>
save ~45 ms => step ~339 ms = vLLM parity. Range = parity-to-ahead.
- NON-bit-exact vs llama's f32 reference, but EQUAL precision to vLLM (which is bf16). Gate on
PPL/KL vs the f32 build, not md5. "Bit-exact parity with vLLM" was never on the table - vLLM is bf16.
## Conv-path (no-regret conv-fusion lever sizing), llama steady decode, per call x48
concat_cont 169.6 us (8.14 ms/step) + cpy_scalar 120.1 us (5.76) + ssm_conv_f32 115.9 us (5.56)
= ~19.5 ms/step (~5%). Conv STATE ~12.6 MB (tiny) -> this is LAUNCH/small-kernel overhead, not bytes
-> a FUSION lever (design (1)), secondary to bf16 state. l2_norm 6.8 us, gdn_gather 1.21 us (no-op,
identity seqs -> confirms gather does NOT re-stream state at steady decode).
## One-line answer
llama: 805 MB/call, 74% peak, re-stream ~1.0x (<=1.33x). vLLM: 402 MB/call (bf16), 41% peak.
conv-path: ~12.6 MB (launch-bound ~19.5 ms/step, not byte-bound).
=> NO-BUILD fused recurrence (already single-pass, more efficient than vLLM); BUILD bf16 state
(halves the dominant 805 MB, ~45-95 ms/step, parity-to-ahead). Deciding number: re-stream ~1.0x.
---
# FINAL DECISION (synthesis of all four agents) - the five points
This closes the workflow. Inputs: `ncu-byte-gate` (measured byte ratio), `vllm-fused-recurrence-study`
(vLLM's single-pass boundary), `llama-fused-recurrence-design` (the fold/levers), `conv-fusion-design`
(the no-regret conv in-place lever). They agree on every number; the decision is unambiguous.
## (1) Byte-ratio verdict - the decisive number
**llama is at the hardware bandwidth floor, NOT re-streaming.** Re-stream factor = **~1.0x**, hard
capped at **<=1.33x** (the most bytes a 3.98 ms call can move at 273 GB/s peak is 1.087 GB = 1.33x
the 816 MB minimal; >=1.5x is physically impossible). The recurrence kernel runs at **74% of GB10
peak BW** (805.3 MB R+W / 3.98 ms = 202 GB/s) - MORE bandwidth-efficient than vLLM's fused triton
`packed_decode` at **41% of peak** (402.6 MB / 3.62 ms = 111 GB/s). Source confirms both are
single-pass and coalesced (llama `s_shard` load-once/store-once, 128 consecutive f32/warp; vLLM
`b_h = load(p_h0)` once -> f32 regs -> `store(p_ht, b_h.to(bf16))` once). The entire 2x DRAM gap
vs vLLM is **100% f32 (llama) vs bf16 (vLLM) state-cache WIDTH**, not extra passes.
## (2) Fused single-pass GDN recurrence: **NO-BUILD**
A fused single-pass rewrite recovers **~0 state bytes** because the kernel is already one read + one
write of the f32 state, and the un-fused l2norm/sigmoid/softplus/gate ops act on the tiny
q/k/g/beta projections (8 MB/call, <1%), not the 805 MB state. There is no second pass to fuse away.
Expected ceiling if built anyway: unchanged 191 ms recurrence -> no movement on the dominant 50% of
the step. **Do not build it.** This refutes the workflow's founding hypothesis with a measured cap.
## (3) Conv-state in-place fusion (`conv-fusion-design`): **GO - confirmed, bit-exact, no-regret**
This is independent of the recurrence verdict and holds regardless. Build a fused
`ggml_ssm_conv_update_inplace` (mirrors the 0018/0019 in-place pattern) that, at decode
(`n_seq_tokens==1 && !keep && fused-AR && n_rs_seq==0`), assembles the width-4 conv window in
registers from the cached K-1=3 taps + the native `qkv_mixed` token, computes the depthwise conv,
folds `silu`, and writes the 1-token-shifted ring state back in place.
- Eliminates `concat_cont` (8.14 ms/step), `cpy_scalar` (5.76 ms/step), the transpose
materialization, and the separate `ggml_silu`; replaces `ssm_conv` with a ~1.6x-byte fused kernel
(5.56 -> ~9 ms). **Net ~12-14 ms/step = +3.1 to +3.7%** -> dense 335 -> ~346-349 tok/s @npl128
(88.5-89.3% of vLLM 391).
- **Bit-exact**: identical ascending-j width-4 FMA order as `ssm_conv_f32` at i==0, same `silu`
primitive, same f32 state bytes written - only the producing node changes. Greedy output is
bit-identical to the 0018/0019 baseline. LOW risk, additive to everything else.
## (4) Recurrence floor-mover: bf16 SSM state - **BUILD (gated product call)**, and the bit-exact question
Since the recurrence is at the f32 byte floor, the **only** lever on the dominant 191 ms (50% of the
step) is narrowing the state-cache width to bf16, exactly as vLLM does.
- Store `ssm_states_all` in bf16; load bf16->f32 into `s_shard`, run ALL recurrence arithmetic in
f32 (UNCHANGED), store f32->bf16. 805.3 -> ~413 MB/call -> ~2.0-3.0 ms/call -> save **~45-95 ms/
step** -> step 384 -> **289-339 ms** = parity-to-ahead of vLLM (327 ms / 391 tok/s; projected
360-443 tok/s @npl128).
- **Bit-exact parity is UNREACHABLE on this term, by construction.** The f32 state bytes are
irreducible (single pass already), so matching vLLM's *speed* on the recurrence requires matching
vLLM's *width* (bf16). bf16 state is non-bit-exact vs llama's own f32 reference, but it is **equal
precision to vLLM** (vLLM's state cache is itself bf16). "Bit-exact parity with vLLM" was never on
the table - vLLM is the less-precise reference here. Gate the build on **KL < 1e-3 / PPL-delta**
over a 256-token greedy run, not on md5, with a `cparams` f32 fallback. The geometric state decay
(g<1) bounds per-step bf16 rounding, so accumulation is well-behaved.
- Bit-exact gains that ARE reachable (vs llama f32): the conv fusion (3) and the activation-fold
lever (1) - together ~9-11% - but they top out near ~93-96% of vLLM and never touch the 50%
recurrence term.
## (5) Ranked build order + the single highest-value next step
1. **Conv-state in-place fusion (BUILD NEXT - no-regret).** Bit-exact, LOW risk, +12-14 ms (~+3%),
reuses the proven 0018/0019 in-place op pattern. Build this first because it is risk-free, purely
additive, and de-risks the in-place conv-cache plumbing the bf16 work also touches.
Confirming measurement: nsys decode trace shows `concat_cont` and `cpy_scalar` GONE, step
384 -> ~370-372 ms, and greedy md5 IDENTICAL to the 0019 baseline (dense text md5, MoE
byte-identical).
2. **bf16 SSM state cache (HIGHEST-VALUE lever - gated product call).** The ONLY lever on the
dominant 50% recurrence term: +45-95 ms/step, step -> 289-339 ms = parity-to-ahead of vLLM.
Non-bit-exact vs llama f32, equal precision to vLLM. Confirming measurement: `gated_delta_net_cuda`
duration/call drops 3.98 -> 2.0-3.0 ms in nsys; **KL < 1e-3 / PPL-delta vs the f32 build over
256-token greedy** passes; step time and tok/s hit the 289-339 ms / 360-443 tok/s band; cparams
f32 fallback verified.
3. **Activation-op fold, design lever (1) (OPTIONAL, bit-exact, diminishing).** After (1) takes the
conv/silu buckets, the residual fold (q/k l2norm + gate softplus/sigmoid + gated-RMSNorm epilogue
+ launch overhead) is ~3-5%; bit-exact but bounded. Build only if the goal is >90% of vLLM with
no bf16. Confirming measurement: per-op launch count for the GDN layer collapses to ~1; greedy
md5 unchanged.
**Single highest-value next implementation step: bf16 SSM state cache (#2)** - it is the only change
that moves the dominant 191 ms term and reaches vLLM parity-to-ahead. Its confirming measurement is
the `gated_delta_net_cuda` per-call time dropping to ~2.0-3.0 ms AND the KL<1e-3 gate passing.
**Recommended immediate build: the conv fusion (#1) first** (no-regret, bit-exact) so the bf16 work
lands on an already-cleaned conv path; ship #2 as a `cparams`-gated, KL-validated product option.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,77 +0,0 @@
# Lever 1: gated-DeltaNet output-projection MMQ reshape (patch 0020)
The single biggest decode-parity lever for the Qwen3.6 hybrid-SSM models
(arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). A two-line,
bit-exact tensor reshape that re-routes the per-layer SSM output projection
from a batch-1 FP4 GEMV (MMVQ) to a batch-128 tensor-core GEMM (MMQ).
## The mechanism (profiled, both engines)
Post-SSM (patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391.
The largest llama-specific overage was the gated-DeltaNet OUTPUT projection
(ssm_out). The GDN op left its output in SSM layout and the graph reshaped it
to 3D `[value_dim, n_seq_tokens=1, n_seqs=128]` before the ssm_out matmul, so
`src1->ne[1] = 1`. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the
128 sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does NOT amortize
the ssm_out weight read across the 128 sequences. vLLM packs the same projection
into a single M=128 GEMM. The in-projection was already fine (2D input -> MMQ);
only the output projection was in 3D SSM layout.
## The fix
In the GDN output path of qwen35.cpp / qwen35moe.cpp / qwen3next.cpp, collapse
the final GDN output to 2D `[value_dim, n_seq_tokens * n_seqs]` (= [6144, 128] at
decode) BEFORE the ssm_out `ggml_mul_mat`, so `src1->ne[1] = 128` routes to the
MMQ M=128 GEMM. The result is then already 2D `[n_embd, n_seq_tokens * n_seqs]`,
so the redundant post-matmul reshape_2d is dropped. Same contiguous data, just a
2D vs 3D view => bit-identical. MMQ on NVFP4 at this exact M=128 shape was already
proven by the in-projection.
```
- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
+ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
...
cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
```
## Gates (all PASS)
- Bit-identical greedy (--temp 0 --seed 1, -n 200, llama-completion) vs the
post-SSM baseline build:
- dense q36-27b-nvfp4: md5 b90681a7728faadc44492b0bcd6181cc (IDENTICAL)
- MoE q36-35b-a3b-nvfp4: md5 f37c7ca1edd752e3bd82e99b4e8744b6 (IDENTICAL)
- test-backend-ops MUL_MAT: OK ; MUL_MAT_ID: OK
- Coherent dense + MoE output (greedy text inspected).
## decode_agg (llama-batched-bench, -fa on, -npp 128 -ntg 128 -npl 32,128 -c 33000)
S_TG t/s (decode aggregate):
| model | npl | baseline | Lever 1 | delta |
|------------------|-----|----------|---------|---------|
| dense q36-27b | 32 | 170.52 | 200.00 | +17.3% |
| dense q36-27b | 128 | 254.92 | 335.80 | +31.7% |
| MoE q36-35b-a3b| 32 | 373.28 | 420.77 | +12.7% |
| MoE q36-35b-a3b| 128 | 560.66 | 691.24 | +23.3% |
Dense @128: 335.80 t/s = 85.9% of vLLM 391 (target 82-85% HIT/exceeded;
up from 65% post-SSM).
## nsys (cuda_gpu_kern_sum, -npp 128 -ntg 24 -npl 128)
The o_proj FP4 batch-1 GEMV bucket is eliminated and the work moves to MMQ M=128:
| kernel | baseline | Lever 1 |
|-------------------------------------|--------------------|------------------|
| mul_mat_vec_q<NVFP4, m=1> (o_proj) | 132.8 ms / 48 inst | 0 ms / 0 inst |
| mul_mat_q<NVFP4, m=128> | 5463 ms / 8800 inst| 5827 ms /10000 inst|
The 132.8 ms o_proj GEMV bucket collapses to zero; mul_mat_q M=128 absorbs it
(+1200 instances, +363 ms over the window), and its per-call average DROPS
(620.8 us -> 582.7 us) because the added o_proj GEMMs are individually cheaper
than the average projection GEMM. Realized o_proj-as-MMQ marginal cost
~363.5 ms / 1200 = ~0.30 ms/call, versus the 2.77 ms/call (132.8 ms / 48) of the
old GEMV: the amortized weight read is the win.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,143 +0,0 @@
# Patch 0015 findings: expert-density-aware MoE token-tile auto-select
The durable follow-up to patch 0014 (`MOE_TOKEN_TILE_CAP.md`): replace the blunt,
opt-in `LLAMA_MOE_MMQ_X` global cap with a host-side, **default-on** density-aware
`mmq_x` auto-select in `mul_mat_q_case`. Companion to
`0015-paged-expert-density-aware-moe-token-tile-auto-select.patch`. Dev tree
`~/llama-paged-dev` (branch `paged`), `build-cuda` sm_121.
Primary model: **Qwen3.6-35B-A3B NVFP4** (`~/bench/q36-35b-a3b-nvfp4.gguf`),
**256 experts, top-8**, expert FFN 512, GDN linear attention (SSM inner 4096),
41 layers. This is a different beast from 0014's Qwen3-Coder-30B-A3B (128 experts,
larger expert FFN, standard attention).
## What it does (vs 0014)
`mul_mat_q_case` picks the token-tile width `mmq_x` to cover `ncols_max` (= `ne12`,
the per-expert column upper bound = token count) in one column-tile, i.e. stock
**maximizes** the tile (128 on Blackwell). Applied per expert at MoE decode, where
per-expert density is tiny, that 128-wide tile is mostly padding.
Patch 0014 capped `mmq_x` globally on the ids path via `LLAMA_MOE_MMQ_X` (decode
**and** prefill), which cost ~1.3% prefill. Patch 0015 instead estimates the
per-expert density host-side, from args the ids path already passes:
```
ne_get_rows = ncols_dst = ne12 * n_expert_used (token-expert assignments)
n_experts = nchannels_x = ne02
density = ceil(ne_get_rows / min(ne_get_rows, n_experts)) (tokens/expert)
```
and caps to the small tile (default 64) **only when `density <= density_max`**, so
the high-density prefill ubatch keeps the big 128 tile. Prefill-safe by construction.
No new kernel: the selection only lowers the loop's upper bound to an
already-compiled, granularity- and shared-memory-validated `mmq_x`.
## The threshold matters: `density_max = 8`, not `tile/4 = 16`
The cap must fire for decode but not for a prefill ubatch. Each has per-expert
density `n_tokens * n_used / n_experts`. At the standard `n_ubatch=512`, `n_used=8`:
```
128 experts 256 experts
prefill ubatch (512) 32 16
decode npl128 (128) 8 4
```
`tile/4 = 16` (0014's first auto-select draft default) **equals the 256-expert
prefill density** and caps prefill: measured -2.0% to -2.9% S_PP on q36-35b-a3b.
`density_max = 8` sits strictly between decode and prefill for every `n_experts` in
`[128, 511]`, so it caps decode and leaves prefill on the big tile. This single
default change is what makes the patch prefill-safe on the 256-expert model.
## Measurements (default-on vs stock, median of 5 reps)
`llama-batched-bench`, q36-35b-a3b-nvfp4.gguf, `-fa on -npp 128 -ntg 128`, GB10
sm_121. STOCK = `LLAMA_MOE_AUTO_TILE=0` (exact stock selection); 0015 = default.
```
npl S_TG stock S_TG 0015 dTG% S_PP stock S_PP 0015 dPP%
8 183.59 183.18 -0.22% 1489.2 1500.1 +0.73%
32 264.02 263.44 -0.22% 2034.5 2033.5 -0.05%
64 311.76 310.41 -0.43% 2028.3 2027.6 -0.03%
128 336.10 337.32 +0.36% 2025.0 2027.7 +0.13%
```
Raw npl128 reps: S_TG 0015 `[337.3, 336.9, 336.4, 338.9, 338.1]` vs stock
`[336.2, 336.1, 335.9, 336.9, 335.8]` (distributions overlap); S_PP 0015
`[2028.6, 2023.0, 2024.9, 2028.0, 2027.7]` vs stock `[2024.9, 2025.0, 2023.2,
2029.4, 2029.0]`.
### Honest read: neutral on this model
On q36-35b-a3b the decode delta is **within run-to-run noise** (npl128 +0.36%,
npl<=64 slightly negative) and prefill is **neutral** (within +/-0.7%, well inside
the 1% target). The `+5%` decode target from the localmaxxing reference does **not**
materialize here. q36-35b-a3b decode is bound by the GDN/SSM recurrence and
256-tiny-expert weight bandwidth, not the MoE col-tile occupancy, so the col-tile
lever has nothing to bite on.
### npl128 decode tile sweep confirms 64 is the only useful width
`density_max=8` fixed, varying `LLAMA_MOE_DECODE_TILE`, S_TG @ npl128 vs stock:
```
TILE8 TILE16 TILE32 TILE64 TILE96
-6.31% -3.18% -0.17% +0.70% -0.76%
```
Smaller tiles are **worse**, not better: more column-tiles per expert = more
grid/scheduling overhead, and the FP4-MMA has a minimum efficient width. So matching
the tile to the literal density (4) is counterproductive; 64 is the sweet spot,
same as 0014.
## Why ship it default-on anyway
1. **Removes 0014's prefill cost by construction.** The cap is density-gated, not
global, so prefill keeps its 128 tile (S_PP neutral above).
2. **Banks the col-tile-bound gain for free.** At npl128 the auto-select picks
`tile=64` for a 128-expert model (decode density 8 <= 8), i.e. exactly 0014's
`cap64`, so it reproduces 0014's **+4.8% @npl128 on Qwen3-Coder-30B** without the
-1.3% prefill cost. (That model was unavailable to re-bench here; the tile choice
is identical by construction.)
3. **Prefill-safe and decode-neutral on the SSM model**, so it is harmless where it
does not help.
4. **Correctness-gated** by the P0 harness (below).
## Conservative by design (known limitation)
A pure-density gate cannot separate two cases with the **same** per-expert density:
Qwen3-Coder npl256 decode (density 16) and the 256-expert prefill ubatch (density
16) are identical to the estimator. `density_max=8` therefore **forgoes 0014's
+2.3% @npl256** on the 128-expert model to keep 256-expert prefill safe. Recovering
it needs an `ne12`-aware (absolute token count) gate in addition to density; scoped
as future work, not implemented.
## Knobs
- `LLAMA_MOE_AUTO_TILE=0` : disable the auto-select, exact stock `mmq_x` selection.
- `LLAMA_MOE_MMQ_X=<n>` (patch 0014) : **kept** as a manual override; when > 0 it
forces the old blunt global cap and bypasses the auto-select (explicit A/B knob).
- `LLAMA_MOE_DECODE_TILE=<n>` : the small tile (default 64).
- `LLAMA_MOE_DENSITY_MAX=<n>` : the density ceiling (default 8).
## P0 correctness gate
`tests/test-backend-ops` `test_mul_mat_id` is extended with a ragged small-M
NVFP4/MXFP4 MoE decode-density block: 128 experts, top-8, m=768, k=2048, n in
`{16,33,64,128,130,200,256,512}` spanning the cap boundary (n>=130 keeps the 128
tile at `density_max=8`, n<=128 takes tile 64) and ragged token counts (experts with
0/1/2 tokens, n not a multiple of the tile). All 16 shapes pass the CUDA-vs-CPU
oracle on GB10 both default-on and with `LLAMA_MOE_AUTO_TILE=0`; full `MUL_MAT_ID`
suite 2/2 backends OK. Off the ids path nothing changes (non-MoE `mul_mat`
byte-identical to stock).
## Verdict
- Correct, prefill-safe, default-on density-aware tile select; the durable design
0014's own doc scoped. Supersedes 0014's global cap as the default path; the
`LLAMA_MOE_MMQ_X` knob is retained as a manual override.
- **Net effect on q36-35b-a3b NVFP4: neutral** (decode within noise, prefill neutral)
because the model is SSM/bandwidth-bound, not col-tile-bound. The lever's real win
lives on col-tile-bound MoE (Qwen3-Coder-30B, +4.8% @npl128), banked here at zero
prefill cost.

View File

@@ -1,220 +0,0 @@
# Durable scope: grouped FP4-MMA MoE GEMM for ggml CUDA on GB10 (sm_121)
Build-ready plan. **Not implemented in this workflow** (large kernel work). This
document scopes the durable path to match or beat vLLM MoE grouped-GEMM efficiency
on GB10 for the Qwen3-30B-A3B-class mxfp4 MoE, and records the single honest
finding that re-shapes the whole effort.
Hardware: NVIDIA GB10 (sm_121, CC=1210 = `GGML_CUDA_CC_DGX_SPARK`), unified
LPDDR5X ~273 GB/s. Model: Qwen3-Coder-30B-A3B, 128 experts, top-8, mxfp4 experts
(`~/bench/qwen3coder-mxfp4.gguf`). Dev tree `~/llama-paged-dev` (branch `paged`,
HEAD at patch 0013), `build-cuda` sm_121.
## TL;DR (the honest reframe)
**The grouped GEMM the mission scoped to build from scratch already exists in
upstream ggml, and it already runs on GB10 for mxfp4.** For mxfp4 experts on
sm_121 `ggml_cuda_should_use_mmq()` returns true (`turing_mma_available`), so
MUL_MAT_ID takes the **grouped mmq path**, which already contains both vLLM
building blocks:
1. a moe_align / token-sort-by-expert (`mmid.cu` `mm_ids_helper`:
count -> warp-scan/cumsum -> scatter into expert-sorted contiguous buffers),
2. a **single persistent stream-k grouped FP4-MMA GEMM** (one `mul_mat_q` launch;
grid flattened into kbc-continuous space over expert x col-tile x row-tile x
k-block; native FP4 MMA via `block_fp4_mmq` under `BLACKWELL_MMA_AVAILABLE`).
The per-expert host-side row-gather loop in `ggml-cuda.cu`
`ggml_cuda_mul_mat_id()` (~L2632-2790) - the path the mission's root-cause
analysis describes as "the cliff" - is a **fallback only reached when
`should_use_mmq()==false`** (f16/bf16 experts, non-Blackwell). It is **never the
GB10 mxfp4 path.**
Consequence: the "npl128 MoE cliff" does not exist on the current dev HEAD.
Re-measured batched-bench decode (`S_TG` t/s) on the mxfp4 MoE rises monotonically
`85 / 278 / 637 / 950 / 1306 / 1771` at npl `1 / 8 / 32 / 64 / 128 / 256`. The
original `253/505/830/620` cliff was a real high-batch regression that has since
been **fixed upstream** (FP4-native grouped mmq + MoE stream-k balancing), not a
batched-bench artifact.
**Therefore the durable work is NOT "port moe_align + a grouped GEMM."** It is a
**surgical fix to the one place ggml diverges from vLLM: the M-tile (token-tile)
sizing heuristic.** This document scopes that delta, plus the optional
block-padded align, plus the parity gate and phased plan. It also records what is
intentionally NOT built and why (the W4A16 occupancy wall).
## The one structural gap: M-tile sizing
`mul_mat_q_case` / `launch_mul_mat_q` pick `mmq_x` (the token/M tile) by
**minimizing** `ntiles_x = ceil(ncols_max / mmq_x)` over the **aggregate** token
count (`ncols_max = ne12`). On Blackwell `get_mmq_x_max = 128`, so the heuristic
always selects the **largest** `mmq_x` that fits shared memory. vLLM's
CUTLASS/Triton fused_moe does the **opposite**: a small tuned `BLOCK_SIZE_M`
(typ. 16/32/64), padded **per expert**.
ggml then applies its over-large `mmq_x` **per expert**. In MoE decode the tokens
per expert is tiny - Qwen3-30B-A3B top-8 of 128: at npl64 ~512 assignments over
~126 activated experts ~= 4 tok/expert; at npl128 ~1024 over ~128 ~= 8 tok/expert.
So each expert's single M-tile of width 128 is **3-6% filled** -> ragged tiny-M
tiles run a dense-GEMM-tuned config, wasting MMA M-throughput, and (with
`need_check`) every expert runs as a masked partial tail.
The FP4 MMA N-fragment (`tile_C::J`) is 8, so the **ideal M-tile ~= tokens/expert
(~8)**, 16x smaller than the 128 ggml picks. This mismatch is the durable gap.
Critically for GB10: at tokens/expert <= 8 there is exactly **one col-tile per
expert**, so a smaller `mmq_x` causes **no extra weight re-read** (weight rows are
re-read only across multiple col-tiles, of which there is one) while it **lowers
shared-mem footprint and raises occupancy** - strictly aligned with the GB10
occupancy lessons.
## What already exists (reuse, do NOT rebuild)
Engine files on DGX `~/llama-paged-dev/ggml/src/ggml-cuda/`:
- **[A] moe_align / scatter** = `mmid.cu` `mm_ids_helper`. One CUDA block per
expert (`gridDim.x = n_experts`); warp counts tokens routed to this expert,
warp-scan for the compaction index, scatters into `ids_src1` (column gather
permutation, expert-sorted contiguous), `ids_dst` (output scatter), and writes
`expert_bounds[expert] = prefix start`, `expert_bounds[n_experts] = total`.
This **is** count -> cumsum -> permute; `expert_bounds` is the analogue of
vLLM's `num_tokens_post_padded` boundaries. No `-1` pad today because segments
are exact (not block-padded).
- **[B] persistent grouped FP4 GEMM** = `mmq.cuh` `mul_mat_q` stream-k
(kernel ~L3542, `process_tile` ~L3447, launch ~L3943, case-select ~L4055).
Single launch, fixed grid (`nsm` CTAs, or `ntiles` when >=90% tile efficiency).
Each CTA walks a contiguous `kbc` slice of (expert `zt` via `expert_bounds`,
col-tile `jt`, row-tile `it`, k-block) space; the weight row-tile (`mmq_y=128`
x K) is loaded once per col-tile in the `process_tile` k-loop; empty col-tiles
past `col_diff` are SKIPPED by advancing `kbc += blocks_per_ne00`; a
`stream_k_fixup` pass recombines split tiles.
- **[C] native FP4-MMA expert weights** = `block_fp4_mmq` + `MMQ_MMA_TILE_X_K_FP4`
(== Q8_1 tile, skew-pad +4) under `BLACKWELL_MMA_AVAILABLE`;
`quantize_mmq_fp4_cuda` quantizes activations to the q8-style y-layout **with
the `ids_src1` gather fused** (one pass, no separate row-copy).
Dispatch seam: `ggml-cuda.cu` `ggml_cuda_mul_mat_id()` (~L2632-2790). For mxfp4
with `ne2`(tokens) > 7, `should_use_mmq()` -> true -> `ggml_cuda_mul_mat_q()`
(`mmq.cu` id-branch ~L162-225) -> `mm_ids_helper` then ONE
`mul_mat_q_switch_type`. The per-expert host loop below it is the gated fallback.
(Below npl8, MXFP4 mmid routes through `mmvq` - `MMVQ_MAX_BATCH_SIZE=8`, mmid max
7 for turing_plus - which is fine for thin batch and out of scope here.)
## What to add (the durable delta, priority order)
### [1] Expert-aware M-tile selection (host-side only, zero new kernel)
In `mul_mat_q_case` / `launch_mul_mat_q`, when `ids != null`, choose `mmq_x` from
**per-expert density** (~`ne_get_rows / n_active_experts`, derivable cheaply, or
capped via env) instead of minimizing `ntiles` over aggregate `ncols_max`.
- `mmq_x` is a **compile-time template** (switch 8..128 step 8), so this is a pure
host-side SELECTION change - it picks a different already-compiled instantiation.
**Zero new kernel. Very low risk, high leverage.** Matches vLLM `BLOCK_SIZE_M`.
- Doubles as near-term lever-1: env-gated `LLAMA_MOE_MMQ_X` cap at the knee.
- GB10-aligned: smaller `mmq_x` -> smaller shared mem -> higher occupancy, and at
tokens/expert <= 8 (one col-tile/expert) it costs no extra weight read.
This is the single highest-leverage change and the seed of the durable port.
### [2] Block-padded moe_align (the moe_align_block_size port proper)
Extend `mm_ids_helper` to pad each expert segment up to a multiple of the chosen
block: write a sentinel (`-1`) `ids_dst` for pad lanes, put `expert_bounds` on
block boundaries. Then every col-tile is **full**, which:
- drops the `need_check` masking + per-expert partial-tail MMA,
- makes the stream-k `kbc` space exact (no skipped tiles, cleaner persistent
schedule), removing the `col_diff` skip branch.
Medium risk: touches the scatter, the `col_diff`/`need_check` logic, and the
`write_back` masking (pad rows must not write output). This is the proper
`moe_align_block_size` analogue and the durable second step.
### [3] Bespoke masked-grouped FP4 kernel - ONLY if [1]+[2] insufficient
A CUTLASS/DeepGEMM-style masked-grouped FP4 kernel. **Largest risk, likely
unnecessary** given [B] is already a persistent stream-k grouped GEMM. Listed for
completeness; do not start without [1]+[2] measured as insufficient.
## Integration into ggml_mul_mat_id (dispatch seam + gated fallback)
- The seam is unchanged: `ggml_cuda_mul_mat_id()` -> `should_use_mmq()` ->
`ggml_cuda_mul_mat_q()`. [1] and [2] live entirely inside the mmq id-branch
(`mmq.cu` ~L162-225) and its callees (`mmq.cuh` selection/launch, `mmid.cu`
scatter). No change to the host dispatch decision.
- **Gated fallback preserved**: the existing per-expert host loop
(`should_use_mmq()==false` path) stays as-is for f16/bf16 experts and
non-Blackwell GPUs. The new selection only fires on the grouped path.
- **Env gates** (off = exact current behavior):
- `LLAMA_MOE_MMQ_X=<8..128>` - cap/override the token tile for the id-path
(lever-1 + [1] manual knob).
- `LLAMA_MOE_BLOCK_ALIGN=0|1` - enable block-padded scatter ([2]).
Default both off until parity + throughput proven, then flip [1]'s
auto-selection on by default.
## Correctness / parity gate
Primary: `tests/test-backend-ops.cpp` `test_mul_mat_id` (~L4181). The CPU
reference is **deterministic** - the op test must be **bit-exact**.
- Sweep `type_a` in {`MXFP4`, `NVFP4`}, `type_b = F32`, `n_mats = 128`,
`n_expert_used = 8`, `n_tokens` in {8, 32, 64, 128} (the decode-density band).
- **Add ragged small-M shapes** to the harness if absent (n_tokens not a multiple
of mmq_x; experts with 0/1/2 tokens) - these are exactly where [1]/[2] change
tile geometry and where block-pad masking can leak.
- Pass criterion: new `mmq_x` selection and padded-align produce dst **identical**
to current op-test output (op test is exact; the GB10 CUDA greedy-decode
non-determinism band applies only to end-to-end, never to the op test).
- End-to-end sanity: `llama-batched-bench` on `~/bench/qwen3coder-mxfp4.gguf`,
`-fa on -npp 128 -ntg 128`, npl 8/32/64/128/256; confirm `S_TG` stays monotonic
and `S_PP` flat ~3050-3090. Verify greedy-decode output within the documented
CUDA batch-shape non-determinism band (CPU is the deterministic oracle).
Bench/parity scripts stay **dev-tree-only** (`~/llama-paged-dev/benches/`).
## Phased plan, expected payoff, risk per phase
| Phase | Work | Expected payoff | Risk |
|-------|------|-----------------|------|
| **P0** harness | Add ragged small-M + MXFP4/NVFP4 mmid shapes to `test_mul_mat_id`; capture current bit-exact baseline + the monotonic batched-bench curve as the reference. | None (gate). Locks correctness + the 85->1771 t/s baseline so any regression is caught. | Low. |
| **P1** sort op | Confirm `mm_ids_helper` is the moe_align; if [2] is pursued, prototype the block-pad scatter behind `LLAMA_MOE_BLOCK_ALIGN`. | Enables exact stream-k schedule; removes `need_check` masking (P3 payoff). | Medium (scatter + write-back masking). |
| **P2** grouped GEMM ([1]) | Expert-aware `mmq_x` selection in `mul_mat_q_case`/launch, `LLAMA_MOE_MMQ_X` gate. | The headline: reclaim the 3-6% M-tile fill waste at npl64-128. Modeled as removing wasted MMA M-throughput on every activated expert; net throughput up at high batch with no extra weight read. | **Low** (host-side template selection, no new kernel). |
| **P3** tune ([2] + fixup) | Land block-padded align; tune `mmq_x` per density, profile stream-k `fixup` overhead and `mmq_x`/`mmq_y` tile choice with nsys on the grouped `mul_mat_q<MXFP4>` kernel. | Remove per-expert partial-tail MMA; tighten the persistent schedule. Diminishing vs P2; this is pure micro-efficiency toward/past vLLM's saturated grouped-GEMM. | Medium-high (kernel masking paths). |
**Honest payoff framing:** the npl128 "cliff" is already gone on HEAD, so there is
no broken path to unlock. The durable win is **matching vLLM's saturated
grouped-GEMM M-tiling** (small per-expert block) and erasing the dense-GEMM-tuned
M-tile mismatch - a micro-efficiency gain at large effective batch, not a
step-change. vLLM 0.23.0 cannot even serve this model on GB10 (bf16 MoE-warmup
hang + hard reboot; GGUF loader can't map fused qwen3moe experts), and llama
already uses the same sorted-grouped-GEMM algorithm, so structural parity is
**already met**; this closes the residual kernel micro-gap.
## The biggest risk: the GB10 W4A16 occupancy wall
The dominant risk is **repeating the W4A16 dead-end** that hit only ~9 TFLOPS /
178 t/s on GB10. GB10 is **occupancy-dominated**: deep `cp.async` pipelines and
XOR-swizzle shared layouts **collapse occupancy** there. Any P3 kernel work MUST:
- keep **small shared mem + high occupancy** (do NOT add deep `cp.async` stages
or XOR-swizzle - they are exactly what killed W4A16);
- preserve the **skew-pad (+4)** tile layout already in `MMQ_MMA_TILE_X_K_FP4`;
- stay on the **FP4-MMA path** (`block_fp4_mmq`), the only path that hits Blackwell
FP4 = 2x INT8/BF16 rate;
- respect the ~273 GB/s LPDDR5X weight-read floor (dense decode is already at it;
MoE wins come from occupancy/tile fit, not bandwidth).
Smaller `mmq_x` ([1]) is **strictly consistent** with these lessons: it reduces
shared-mem footprint, raises occupancy, and at tokens/expert <= 8 adds no weight
re-read. So the low-risk lever ([1]) is also the one most aligned with what GB10
rewards - which is why it leads the plan and [3] is gated behind it.
## Commit / hygiene
Scope doc only (this file). No engine change committed in this workflow. Bench and
parity scripts are dev-tree-only. Commit with `git -s`, trailer
`Assisted-by: Claude:opus-4.8 [Claude Code]`, no `Co-Authored-By`, no em-dashes.
Do not push (human pushes). When [1]/[2] are implemented they mirror to
`backend/cpp/llama-cpp/patches/paged/0014-*` (next free slot).

View File

@@ -1,71 +0,0 @@
# MOE_QUANT_DEDUP_RESULTS.md - patch 0023 (qwen35moe NVFP4 activation-quantize de-dup)
Bit-exact MoE decode/prefill lever. Built + measured on DGX GB10 (sm_121a) on top of HEAD
8a3229f (patch 0022). Companion analysis: NONRECURRENCE_BITEXACT.md (section "nonrec-build").
## What
ggml `mul_mat_id` quantizes the EXPERT-GATHERED activation rows: it allocates
`ne11_flat = ne12 * n_expert_used` rows and quantizes each via `quantize_mmq_nvfp4(..., ids_src1)`.
For the broadcast up/gate projections the activation is the per-token hidden state, the SAME for
every expert that token routes to (`ne11 == 1`). So the stock path re-quantizes each token
`n_expert_used` times (4x for q36-35b-a3b).
`quantize_mmq_nvfp4` computes each `block_fp4_mmq` as a pure per-thread function of its 16
consecutive inputs (per-thread amax, the +/-2 ue4m3 search, the e2m1 packing - NO cross-thread
shfl/reduction). So the quantized block for a given token is byte-identical no matter which
expert slot it lands in.
## Lever
When `ne11 == 1` (broadcast up/gate):
1. Quantize the `ne12` UNIQUE token activations once into a compact buffer
(`quantize_mmq_fp4_cuda(src1_d, nullptr, ..., ne12, 1, 1)`, row stride `s12`).
2. Gather the `block_fp4_mmq` rows into the expert-gathered layout keyed by `ids_src1`
(`gather_mmq_fp4`): `block_fp4_mmq == 9 * uint4 == 144 B`, copied with a coalesced uint4
kernel whose output is written fully contiguously (`gathered[t] = unique[ib_u*9 + w]`).
Pure byte copy of identical blocks => the gathered buffer is byte-for-byte identical to
re-quantizing each gathered row. The MMQ GEMM is UNTOUCHED. `down_proj`
(`ne11 == n_expert_used`, distinct per expert) keeps the stock re-quantize path.
The first gather draft (one thread copies one 144 B struct, scattered) was uncoalesced and cost
478 ms - it ate 84% of the quantize saving and decode stayed flat. The shipped coalesced-uint4
gather costs 32 ms.
## Measurements (q36-35b-a3b-nvfp4 dense=q36-27b-nvfp4, -fa on, -npp 128 -ntg 128)
nsys decode-isolated (`--cuda-graph-trace=node`, npp8 ntg128 npl128), per-run kernel sums:
| kernel | dedup off | dedup on |
|-----------------------|-----------|----------|
| quantize_mmq_nvfp4 | 868 ms | 457 ms |
| gather_mmq_fp4 | - | 32 ms |
| net quantize path | 868 ms | 489 ms | (-379 ms decode GPU-time)
| gated_delta_net (50%) | unchanged | unchanged |
| mul_mat_q<NVFP4> | unchanged | unchanged |
Decode S_TG (t/s), back-to-back same-build A/B (default-on vs GGML_CUDA_MOE_QUANT_DEDUP=0):
| model | npl32 off->on | npl128 off->on |
|-----------------|------------------|-----------------------|
| MoE q36-35b-a3b | 440.3 -> 442.8 (+0.6%) | 745.2 -> 758.1 (+1.73%) |
| dense q36-27b | 207.4 -> 206.9 (flat) | 373.28 -> 373.24 (byte-flat) |
Prefill: MoE T_PP 7.69 -> 7.38 s (~ -4% time). Dense unaffected (no `mul_mat_id`).
## Bit-exact gate (greedy --temp 0 --seed 1 md5, byte-identical to 0022)
| model | md5 (default on) | == 0022 |
|------------------|--------------------------------------|---------|
| q36-27b-nvfp4 | 5951a5b4d624ce891e22ab5fca9bc439 | yes (dense untouched) |
| q36-35b-a3b-nvfp4| 07db32c2bcb78d17a43ed18bc22705cd | yes (on == off == 0022) |
test-backend-ops: MUL_MAT 1115/1115, MUL_MAT_ID 805/805 (default on).
## Knob
On by default. `GGML_CUDA_MOE_QUANT_DEDUP=0` restores the stock per-expert re-quantize path
(byte-identical output, used as the A/B baseline).
Commits: DGX dev tree f7409c2; worktree patch `0023-qwen35moe-nvfp4-quant-dedup.patch`.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,99 +0,0 @@
# Patch 0014 findings: expert-aware MoE token-tile cap (LLAMA_MOE_MMQ_X)
Near-term lever for the MoE-vs-vLLM workflow on GB10 (sm_121). Companion to
`0014-paged-expert-aware-moe-token-tile-cap.patch`. Model:
Qwen3-Coder-30B-A3B, 128 experts, top-8, mxfp4 experts
(`~/bench/qwen3coder-mxfp4.gguf`). Dev tree `~/llama-paged-dev` (branch `paged`),
`build-cuda` sm_121.
## Headline (honest): there is no npl128 cliff to erase on this build
The mission premise was a 25% decode drop at npl128 (batched-bench 253/505/830/620
@ npl 8/32/64/128). It does **not** reproduce. Stock decode is monotonic:
```
llama-batched-bench, qwen3coder-mxfp4.gguf, -fa on, -npp 128 -ntg 128, S_TG t/s
npl 1 8 32 64 128 256
stock 85 282 629 935 1295 1779 <- monotonic, no knee
```
The old cliff was a real high-batch regression since fixed upstream: mxfp4 MoE
decode on GB10 already takes the sorted grouped FP4-MMA GEMM (MUL_MAT_ID ->
`ggml_cuda_mul_mat_q` ids branch: `mm_ids_helper` moe_align/scatter + one
persistent stream-k `mul_mat_q`), i.e. vLLM's algorithm. See
`MOE_GROUPED_GEMM_SCOPE.md`.
## What the knob does
`mul_mat_q_case` picks the token-tile width `mmq_x` to cover `ncols_max`
(= `ne12`, the per-expert column upper bound = token count, up to 128) in one
column-tile. At MoE decode the per-expert density is `~ne12*k/n_experts`
(top-8/128 => ~1/16 of `ne12`), so each expert's `mmq_x`-wide col-tile is only
~6% filled: the MMA accumulator tile is `mmq_x`-wide at compile time and wastes
throughput on the padding columns, and the larger y-tile lowers occupancy.
`LLAMA_MOE_MMQ_X=<n>` caps `mmq_x` on the MUL_MAT_ID path only
(`expert_bounds != nullptr`). It only lowers the selection-loop upper bound and
still chooses from the same granularity/shared-memory-validated `mmq_x` set stock
already uses for smaller batches - no new kernel configuration. Default
(unset/<=0) = disabled => byte-identical to stock.
## Measurements (same binary, only LLAMA_MOE_MMQ_X differs)
Decode throughput, S_TG t/s:
```
npl stock cap16 cap32 cap64
1 85 85 85 85
8 282 280 282 282
32 629 623 629 628
64 935 915 949 934
128 1295 1204 1344 1357 <- cap64 +4.8% (cap16 -7%)
256 1779 1370 1723 1820 <- cap64 +2.3% (cap16 -23%)
```
Prefill throughput, S_PP t/s (the cost):
```
npl stock cap16 cap32 cap64
128 3083 1817 2559 3038
256 3084 1818 2560 3046
-41% -17% -1.3%
```
Reproducibility (interleaved off/cap64, two reps each):
```
npl off rep1/rep2 cap64 rep1/rep2
128 1300 / 1290 1357.5 / 1357.0
256 1786 / 1782 1826.3 / 1824.5
```
cap64 is stable to <0.1% and the gain sits well above the ~1% run-to-run band.
## Why 64 is the only value that helps net
A 512-token prefill ubatch routes ~32 tokens/expert. cap16/cap32 force those into
16/32-wide tiles, overflowing into extra col-tiles + weight re-reads -> prefill
craters (-41% / -17%). cap64 still holds the prefill density in one tile (32 < 64)
so prefill is near-neutral (-1.3%), while decode (~8 tokens/expert at npl128) gets
the fuller, higher-occupancy tile.
## Verdict
- Real but **modest** high-effective-batch DECODE micro-optimization
(+4.8% npl128, +2.3% npl256), neutral at npl<=64, ~1.3% prefill cost at cap64.
- **Not** a cliff fix (no cliff) and **not** a real-server unlock (llama-server
continuous batching already scales). Shipped as an opt-in, default-off knob;
recommended value 64 for decode-heavy high-concurrency deployments.
- Correctness: greedy temp-0 server output with cap64 is byte-identical to stock
for single-stream generation and stays coherent; thousands of capped MoE
matmuls at npl128/256 ran with no CUDA error / NaN.
## Durable follow-up (scoped, not implemented)
Replace the blunt global cap with a density-aware auto-select: choose `mmq_x`
from `ne_get_rows / n_active_experts` inside `mul_mat_q_case` so decode gets the
small tile while prefill keeps its large tile automatically (removes the ~1.3%
prefill cost). Plus the block-padded `moe_align` in `mm_ids_helper`. See
`MOE_GROUPED_GEMM_SCOPE.md`.

View File

@@ -1,323 +0,0 @@
# NONRECURRENCE_BITEXACT.md - bit-exact non-recurrence decode levers (label nonrec-design, READ-ONLY, no GPU)
Post-0022 the gated-DeltaNet recurrence is at 84.6% BW = 102.6% of vLLM (3.488 ms/call), past parity.
The remaining ~5% to vLLM lives in the non-recurrence path. Per the node-level decode trace (nsys
`--cuda-graph-trace=node`, clean build, q36-27b-nvfp4 dense, npl128) the decode step is ONE replayed
CUDA graph, ALL kernels on a SINGLE stream (stream 14), strictly serial, 99.94% GPU-busy, 0.06% idle.
That single-stream-99.94%-busy fact is load-bearing for everything below: there is NO overlap, so any
kernel GPU-time genuinely removed (or any kernel folded away) cuts wall-clock 1:1; and conversely, if a
"faster kernel" leaves wall-clock flat, then the kernel did NOT actually get faster at the decode shape.
Post-recurrence-fix kernel mix of the ~367 ms decode step (was 380.4 pre-0022; recurrence now smaller):
- `mul_mat_q` FP4 GEMM (496 calls/step) ~24% (the biggest non-recurrence bucket)
- `quantize_mmq_nvfp4` (496/step) ~4.5%
- `nvjet` lm_head GEMM ~3.1%
- `flash_attn_ext_f16` (16 attn layers) ~3.1%
- elementwise glue: k_bin_bcast (gate mul+add) ~1.7%, unary_gated silu/sigmoid ~1.4%, rms_norm ~0.9%,
l2_norm ~0.2%, plus conv-state concat_cont/cpy (Lever-1 territory, not in this scope).
Files read on the DGX 0022 tree (HEAD 8a3229f): `mmq.cuh`, `mmq.cu`, `quantize.cu`, `gated_delta_net.cu`,
`fattn.cu`, `fattn-common.cuh`.
---
## RESOLUTION of the P2a puzzle (load-bearing) - mmmq_y=64 / minblocks: bit-exact but FLAT on decode
The existing P2a machinery is two NVFP4-gated, default-stock flags in `mmq.cuh`:
- `GGML_CUDA_FP4_MMQ_Y` (L143-163): overrides the weight-row N-tile `mmq_y` 128 -> 64/96 for NVFP4 on
Blackwell. mmq_y tiles N (output rows); each weight row lives in exactly one row-tile, so total weight
traffic is unchanged. **Bit-exact**: the per-output K-reduction is the `for frag` loop in
`vec_dot_fp4_fp4_mma` (L1097-1108, `sum[...] += C.x[l]`), whose order is independent of mmq_y. md5-
verified in prior runs (1115/805 gate, byte-identical).
- `GGML_CUDA_FP4_MINBLOCKS` (L205-216): raises the `__launch_bounds__` min-blocks operand (L3579-3585)
for NVFP4 so >1 CTA co-resides per SM. **Bit-exact**: register allocation / occupancy cannot change
results.
The paradox restated: P2a made a standalone `mul_mat_q<NVFP4,m=128>` -24.7% faster (bit-exact), yet
decode was FLAT (335->336 post-0020). The trace says decode is 99.94% single-stream busy and mul_mat_q
is ~24% of it, so a -24.7% cut should give ~+6%. RESOLUTION (airtight, from the single-stream fact):
> On a 99.94%-busy single stream, freed kernel GPU-time MUST lower the wall 1:1. Decode is flat =>
> mmq_y=64 did NOT free per-call GPU-time at the DECODE shapes => the -24.7% was measured at a
> NON-decode shape (a single large-N or prefill-M GEMM that runs enough waves to reach asymptotic
> throughput). There is no contradiction; the two measurements are at different GEMM shapes.
Mechanism (grounded in the launch path, `launch_mul_mat_q` L3989-4088): decode runs ONE `mul_mat_q` per
weight with mmq_x=128 fused tokens => ntx=1, and the grid is `nty = N / mmq_y` CTAs (xy-tiling, or
stream-k at nsm=48 when `tiles_efficiency_percent < 90`, L4044-4047). The 496 decode GEMMs have small N:
- FFN up/gate N=17408 -> nty=136 CTAs (mmq_y=128) = ceil(136/48)=3 waves, last wave 40/48=83% full
- FFN down / qkv / o-proj N~5120-6144 -> nty=40-48 CTAs = 1 wave (and eff<90 => stream-k at 48 CTAs)
So EVERY decode GEMM is a 1-3 wave, 40-136 CTA kernel: it is **ramp + tail (wave-quantization) bound**,
dominated by the first-wave weight-load latency before any MMA can start plus the fractional last wave -
NOT by steady-state occupancy. mmq_y=64 doubles the grid (272 CTAs, 6 waves for the fat FFN) which only
helps the ASYMPTOTIC achieved-BW the microbench measures; at 1-3 waves there is no steady state for it
to act over, and each CTA now carries half the arithmetic-per-weight-load so the ramp is relatively MORE
exposed. minblocks=2 is worse: the FP4 MMA is register-bound at ~255 regs/thread (the `(256,1)` bound),
so forcing 2 CTAs/SM register-caps to ~128 regs => heavy spill => net-negative. Both are the in-wave
occupancy lever, and the decode GEMM has no in-wave occupancy problem - it has a too-few-waves problem.
VERDICT: re-test P2a (mmq_y=64, and 96) and minblocks=2 ON TOP of 0022 because it is a FREE one-build
re-test (flags already exist, default stock). **Design prediction: still ~flat (maybe +1-2% from the
one fat-FFN N=17408 GEMM that has 3->6 waves of room; ~0% from the 1-wave thin GEMMs).** The decisive
measurement for the reprofile agent is NOT a standalone microbench - it is the PER-CALL `mul_mat_q`
GPU-time at the REAL decode shapes (the 496 calls), flag on vs off, summed. If per-call decode time
drops, it ships (free bit-exact win). If per-call decode time is ~unchanged (predicted), the -24.7%
was a large-N artifact and the GEMM has no bit-exact occupancy lever - confirming the structural wall.
WHY the decode GEMM has no high-value bit-exact lever: its bottleneck is wave-quantization at a small
grid. The only knobs that change the grid are (a) mmq_y-down [bit-exact, flat per above], (b) mmq_x-down
[FORBIDDEN: re-reads the 18 GB weights ntiles_x times, strictly worse, and pins one-read], (c) the
stream-k-vs-tiling threshold [FORBIDDEN for bit-exactness: stream-k splits each output tile's K-sum
across CTAs and re-adds via the fixup kernel - a DIFFERENT K-accumulation order than one-CTA-full-K
tiling, so flipping the L4047 threshold changes which path a GEMM takes and breaks md5 vs the 0022
baseline]. So at the bandwidth/wave-quant floor for these tiny grids, 3% FP4 efficiency is structural;
no order-preserving change moves it.
---
## RANKED bit-exact non-recurrence levers
Ranked by expected bit-exact decode gain. "Bit-exact-safe" = keeps the exact reduction/FMA order; the
gate is md5-identity to llama 0022 f32 output on both models (dense + MoE), greedy temp0.
### 1. Quantize producer-fold (Track A) - bit-exact-safe - ceiling 4.5%, realistic ~2-2.5%
Fold `quantize_mmq_nvfp4` (4.5%, ~17 ms, 496/step) into the PRODUCER epilogue (the rms_norm / silu that
emits each GEMM's activation), so the f32 activation is quantized to `block_fp4_mmq` directly from the
producer's registers instead of being written to HBM as f32 and re-read by a standalone quantize kernel.
- **Bit-exactness: SAFE, and unusually clean.** `quantize_mmq_nvfp4` (quantize.cu:78-171) computes
`amax_raw` PER-THREAD over the thread's own QK_NVFP4_SUB=16 values (L108-118) with NO cross-thread
shfl/reduction (unlike `quantize_mmq_q8_1` which does a warp shfl_xor). Each thread independently runs
the +/-2 ue4m3 scale search (L120-150) and `ggml_cuda_float_to_fp4_e2m1` packing (L155-166). So the
output block is a pure per-thread function of its 16 inputs. Copy that arithmetic VERBATIM into the
producer epilogue and the `block_fp4_mmq` bytes are identical => md5-safe. The only requirement is the
producer thread-layout owns contiguous 16-element K-sub-blocks (feasible for an rms_norm/silu epilogue).
- **Expected gain:** the win is removing the standalone kernel's f32 activation READ (the producer already
holds the f32); the quant compute + fp4 write still happen (now folded). So ~the read-half of the 17 ms,
~2-2.5% of the step, and it is REAL because the step is single-stream 99.94% busy (no overlap to hide
the removed kernel).
- **Trap / caveat:** the SPENT "Lever-2" was a DIFFERENT fusion (quantize -> GEMM *consumer* prologue,
measured net-zero because the GEMM still reads the same activation bytes). Track A is the *producer*
fold and removes a true f32 round-trip, so it is not subject to that flatness - but it needs real
producer-kernel surgery + the frozen `block_fp4_mmq` ABI (mmq.cuh:53), more plumbing than the others.
- Ranked #1: largest cleanly-bit-exact non-GEMM bucket, no reduction trap (per-thread quant).
### 2. Activation / op fold - POINTWISE subset only - bit-exact-safe - realistic ~1.5-2.5%
Fold the pure pointwise glue off the single-stream chain into the adjacent kernel's epilogue/prologue:
the GDN residual ADDs and gate MULs (`k_bin_bcast`, ~1.7%), the `silu`/`sigmoid` (`unary_gated`, ~1.4%,
the part that is the output gate, not FFN), and the post-GDN gate MUL after the output rms_norm.
- **Bit-exactness: SAFE for the pointwise ops only.** Add/mul/silu/sigmoid are elementwise fp32 with the
same formula and the same op order whether standalone or folded => byte-identical. This is the bit-exact
half of the prior Lever-3 design.
- **THE TRAP (FORBIDDEN half):** the `rms_norm`/`l2_norm` REDUCTIONS must NOT be re-folded with a
different reduction tree. The standalone `l2_norm_f32<32>`/`rms_norm_f32` use a specific warp/block
reduction; folding the norm into a kernel with a different `warp_reduce_sum` width or eps placement
(`x*rsqrt(sumsq+eps)` vs `x/max(sqrt(sumsq),eps)`) changes the last ULP => breaks md5. Fold the MUL that
FOLLOWS the norm (pointwise, safe); do NOT fold the norm's reduction. (This is the direct analog of the
f32x4 lane-remap trap that blocked the recurrence's vectorized state loads: any change to a reduction's
grouping is forbidden.)
- **Expected gain:** ceiling ~3.3% (the Lever-3 slice), realistic ~1.5-2.5% once the norm reductions are
excluded. Real (single-stream, no overlap), bounded, lower plumbing than #1 (no new ABI).
- Ranked #2: smaller than #1 and the high-value pieces (norms) are off-limits.
### 3. mul_mat_q occupancy retune (existing P2a: mmq_y=64/96, minblocks=2) - bit-exact-safe - ~FLAT
See the P2a resolution above. Bit-exact-safe (N-tiling / register-cap preserve the K-reduction order;
md5-verified). Design prediction FLAT on decode (decode GEMMs are 40-136 CTA, 1-3 wave, ramp/tail-bound;
the -24.7% was an asymptotic large-N number). **Worth the one-build re-test only because it is free**
(flags exist, default stock). Possible marginal +1-2% from the single N=17408 fat-FFN GEMM (3->6 waves).
Measure PER-CALL decode-shape `mul_mat_q` time, not a microbench. Ranked #3: zero plumbing, but low/zero
expected gain - it is the diagnostic that confirms the GEMM wall is structural, not a shippable lever.
### 4. Attention occupancy (flash_attn_ext_f16) - NO bit-exact lever - NO-GO
`flash_attn_ext_f16` is ~3.1% (11.67 ms, 16 attn layers), grid 48 CTAs = exactly ONE full wave on 48
SMs (trace). There is no occupancy headroom (already 1 wave, perfectly filled, no tail) and no in-wave
under-occupancy to fix. The only knobs that change the attention grid are split-KV / parallel_blocks /
a different KV-tile (the `ncols1`/`ncols2`/`cols_per_block` selection in `fattn.cu`), and EVERY one of
them changes the online-softmax running-max/sum RESCALING ORDER across KV blocks => NOT bit-exact
(forbidden, the softmax-rescale analog of the reduction-tree trap). At 3.1% with one full wave the
attention is effectively at floor. Ranked last: no bit-exact lever exists; do not pursue.
---
## FORBIDDEN levers (require a precision or accumulation-order change - excluded by the gate)
- Stream-k vs plain-tiling threshold flip for the GEMM wave-quant tail: splits + re-adds the K-sum across
CTAs => different f32 accumulation order than one-CTA-full-K tiling => breaks md5.
- Vectorized / lane-remapped tile loads in the GEMM (`load_tiles_nvfp4_nvfp4` / `load_ldmatrix`): any
remap of which lane holds which K-element changes the MMA fragment->accumulator mapping => can change
the per-output sum grouping => forbidden (the f32x4 lane-remap trap, same class that blocked the
recurrence's vectorized state loads).
- mmq_x-down at dense decode: re-reads the 18 GB weights `ntiles_x` times. Order-preserving but strictly
slower and breaks the one-read invariant; not a lever.
- Folding rms_norm / l2_norm with a different reduction tree or eps placement: last-ULP change => md5 break.
- flash-attn split-KV / KV-retile: changes the online-softmax rescale order => not bit-exact.
- bf16 state / bf16 anything: precision change, SHELVED, forbidden by the gate.
---
## One-line summary for the parent
The remaining non-recurrence decode gap has NO single big bit-exact lever. The largest cleanly bit-exact
win is the **quantize producer-fold (Track A, ~2-2.5%, the per-16 NVFP4 quant has no cross-thread
reduction so it copies verbatim into the rms_norm/silu epilogue)**; second is the **pointwise activation
fold (~1.5-2.5%, fold the residual adds / gate muls / silu but NOT the norm reductions)**; the
**mul_mat_q occupancy retune (P2a mmq_y/minblocks) is bit-exact but predicted FLAT** (decode GEMMs are
small-grid wave-quant/ramp-bound, so the -24.7% asymptotic number does not apply per-call - confirmed by
the airtight single-stream-99.94%-busy logic, re-test only because the flag is free); and **attention has
NO bit-exact lever** (already one full wave; every grid knob changes the softmax rescale order). The
P2a puzzle is resolved: not a contradiction - the -24.7% and the flat decode are simply at different GEMM
shapes (large-N asymptotic vs 1-3-wave decode per-call).
Assisted-by: Claude:opus-4.8 [Claude Code]
---
# EMPIRICAL P2a RE-TEST ON 0022 (label reprofile-puzzle, GPU agent) - measured, build + bench + nsys
The design section above PREDICTED P2a flat from the single-stream logic. This section is the GPU
measurement that CONFIRMS it byte-for-byte, plus one load-bearing correction: an early "+11% decode"
A/B was a STALE-BASELINE artifact, not the flag. Box: DGX GB10 (sm_121a), HEAD 8a3229f (patch 0022),
SM+MEM clock pinned 2190 MHz (verified via `nvidia-smi dmon`, identical base vs flag - NOT a clock story).
## (1) Fresh node-level decode decomposition (nsys --cuda-graph-trace=node, dense q36-27b-nvfp4, npl128)
Per-instance trace windowed to one steady decode step (103 steady steps, step = 48 GDN-layer boundaries):
Committed-default build (build-cuda-base, 336 t/s @128) -- step span 383.1 ms, kernel-busy 99.24-99.30%:
gated_delta_net (SSM recurrence) 193.97 ms/step 51.0% <- BINDING KERNEL
mul_mat_q<NVFP4,m=128,nc=0> 93.64 ms/step 24.6% <- the P2a target
quantize_mmq_nvfp4 16.77 ms/step 4.4%
nvjet (cublas lm_head GEMM) 12.07 ms/step 3.2%
flash_attn_ext_f16 11.69 ms/step 3.1%
concat_cont 8.14 / cpy_scalar 7.49 / k_get_rows 7.29 / ssm_conv 6.55 / silu 5.32 / k_bin_bcast 4.67
mul_mat_q_stream_k_fixup 3.95 / rms_norm 3.56 / ... ; SUM 380.1 ms = 99.24% of the 383.1 ms wall.
conv-inplace + GDN(16,8) build (the 374 t/s state) -- step span 345.3 ms, kernel-busy 99.0%:
gated_delta_net 167.99 (49.2%), mul_mat_q<NVFP4,128,0> 93.79 (27.5%), quantize 17.66 (5.2%),
nvjet 12.05 (3.5%), flash_attn 11.66 (3.4%), ssm_conv(fused update) 8.44 (2.5%), k_get_rows 7.32 (2.1%).
BINDING KERNEL = gated_delta_net (~49-51% of the step) in BOTH; mul_mat_q<NVFP4,m=128> is #2 (~25-27.5%).
Decode is ~99.0-99.3% GPU-busy single-stream (confirms the 99.94% claim; ~0 idle, strictly serial).
## (2) P2a A/B - the -DGGML_CUDA_FP4_MMQ_Y=64 nwarps-remap, re-applied + built + bit-exact-gated on 0022
The committed 0022 machinery was PARTIAL (patch 0017 templated get_mmq_y_device<type> but left
mmq_get_nwarps_device() stock -> mmq_y=64 + nwarps=8 fails static_assert nwarps*tile_C::I==mmq_y at
mmq.cuh:3280). Re-derived the full threading: templated mmq_get_nwarps_device<type>() -> mmq_y/16 (=4)
for NVFP4+flag; type-aware mmq_get_nwarps_host(...,type); threaded <type> through the NVFP4 loader (998),
write_back_mma (3266), process_tile (3500), mul_mat_q launch_bounds (3579/83/85) + body (3602),
stream_k_fixup launch_bounds (3832) + body (3843), 2 host launch sites (3994/4172). Reverted after.
cuobjdump proof the flag took effect: mul_mat_q<NVFP4,m=128,nc=0> STACK 112 -> 56 (256-thr/8-warp CTA
-> 128-thr/4-warp CTA => 1 -> 2 resident CTAs/SM). REG 255 (HW-capped), no new spill.
BIT-EXACT GATE (HELD): test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805; greedy md5 base==flag
IDENTICAL = 5951a5b4d624ce891e22ab5fca9bc439 (matches the prior P2a gate hash). Byte-identical output.
CLEAN A/B (same build dir, ONLY mmq.cuh toggled => non-mmq .o byte-identical; back-to-back, pinned clocks)
S_TG t/s, llama-batched-bench -fa on -npp128 -ntg128:
DENSE q36-27b: npl 32 208.02 -> 207.51 (-0.2%) npl 128 374.30 -> 373.19 (-0.3%) FLAT
MoE q36-35b-a3b: npl 32 438.83 -> 439.30 (+0.1%) npl 128 745.71 -> 745.07 (-0.1%) FLAT
Prefill S_PP also flat at 0022 (npp128 1056->1050; npp2048/npl1 1028.85->1024.19).
## (3) RESOLUTION - why FLAT, where the GEMM time goes, and a correction to the prior "-24.7%->+6%" logic
Decode-isolated per-kernel A/B (node trace, same-source toggle, identical non-mmq code):
gated_delta_net 167.99 -> 167.89 ms/step (IDENTICAL - it is byte-identical code, untouched)
mul_mat_q<NVFP4,128,0> 93.79 -> 92.74 ms/step (-1.1%, FLAT) <- the P2a target, decode shape
mul_mat_q_stream_k_fixup 3.96 -> 5.65 ms/step (+1.7ms, REGRESSES at nwarps/2=2)
=> the decode mmq FAMILY is flat-to-slightly-WORSE; the flag delivers ~nothing at the m=128 decode shape.
The "-24.7%" is REAL but it is a PREFILL-shape number. Full-run aggregate (npp128 ntg128, prefill+decode)
mul_mat_q<NVFP4,128>: 19630 -> 17569 ms = -10.5%; subtracting the flat decode portion (93.8x128 vs
92.7x128) leaves the prefill-shape portion at 7625 -> 5699 ms = -25.3% (matches the prior -24.7%). So the
occupancy lever genuinely cuts the COMPUTE/occupancy-bound prefill-shape GEMM ~25%, and ~0 of the
BANDWIDTH-bound m=128 decode-shape GEMM (it reads the full NVFP4 weight matrix from 273 GB/s LPDDR5x; the
mmq_y knob is deliberately bandwidth-neutral - every weight row still read once - so it cannot move a
bandwidth-bound wall). Confirmed at the SOURCE-of-decode level, not inferred.
Reconciling with "99.94% busy single stream => a -24.7% cut should give ~+6%": the PREMISE is false. The
flag does NOT cut the decode mul_mat_q by 24.7% (it cuts it 1.1%). There is therefore NO freed time on the
99% busy stream - so the "where does the freed time go (idle gaps?)" question is moot: no time is freed at
the decode shape. The contradiction dissolves: mul_mat_q IS on the critical path AND single-stream-busy, but
the lever simply doesn't accelerate the decode-shape invocation. (Net it slightly hurts via stream_k_fixup.)
CORRECTION to an earlier in-session A/B (recorded so the parent does not chase it): a first pass showed
build-cuda-base 334.6 -> "flag" 372 (+11%). That was a STALE-BASELINE artifact, NOT the flag. build-cuda-base
(binaries 18:46) was compiled from a pre-0021 source - it has NO ssm_conv_update_f32 (cuobjdump symbol count
0 vs 4 in the 0022 build) and the un-retuned GDN default (gated_delta_net 194 vs 168 ms/step). Those ~40 ms
of non-mmq differences (conv fuse ~14 ms + GDN ~26 ms) are the entire 334.6->373 gap. With a correct
same-source baseline (toggle ONLY mmq.cuh in one build dir) the flag is flat (373.19 vs 374.30). Lesson:
the only valid P2a A/B holds every non-mmq .o byte-identical; comparing two independently-built trees mixes
in whatever other flag/patch state each was built from.
## VERDICT
P2a (mmq_y=64 nwarps-remap) is BIT-EXACT (md5-identical, 1115/805) and a genuine ~25% PREFILL-shape FP4-GEMM
kernel win, but it is FLAT on decode (dense and MoE, npl 32 and 128) on 0022, AND flat on end-to-end prefill
S_PP at 0022 (prefill is GDN/other-bound at these sizes, not mmq-bound). It is NOT a decode-parity lever and
the decode commit-gate (lift decode_agg) is NOT met -> do NOT ship for decode. The binding decode kernel is
gated_delta_net (~50%); the only decode levers left are the bit-exact folds in the design section above
(quantize producer-fold ~2-2.5%, pointwise activation fold ~1.5-2.5%) and the GDN-region launch/fusion that
vLLM already has. The mmq P2a machinery was reverted; the 0022 tree is left git-clean.
Assisted-by: Claude:opus-4.8 [Claude Code]
---
# nonrec-build (GPU agent) - built + measured. Lever shipped: MoE NVFP4 quantize de-dup (patch 0023)
Box: DGX GB10 (sm_121a), baseline = clean rebuild of HEAD 8a3229f (patch 0022) in build-cuda
(verified: mmq.cu.o rebuilt from clean source; the A/B-left binary was stale). md5 references
locked: q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439, q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd.
Baseline decode S_TG: dense 208.7/373.6, MoE 441/746 (npl 32/128). ncu unavailable (no
GPU-counter permission, no sudo) -> all verdicts are nsys + back-to-back same-build A/B.
## Levers EVALUATED
### A. quantize_mmq_nvfp4 occupancy retune (token-packing) - BIT-EXACT, FLAT -> not shipped
The decode quantize at the K=2048 shape is grid (128,1,1) = 128 CTAs = ~2.67 waves on 48 SMs.
Unlike mul_mat_q (bandwidth-bound on LPDDR5x, so P2a was flat), quantize moves trivial memory,
so I tried packing TPB token-rows per CTA (blockDim.y) to cut wave-quant - each thread still
quantizes its own 16 consecutive values, so byte-identical (md5 5951a5b4/07db32c2 held at TPB
1/2/4, after fixing the output ib index to use the token i1 not blockIdx.x). Result: DENSE npl128
DEAD-FLAT 373.25 across TPB 1/2/4; npl32 and MoE flat-to-slightly-WORSE at TPB>1. The decode
quantize is at its best config already (TPB=1 = max CTA parallelism = best latency hiding;
fewer/bigger CTAs hurt). Second bit-exact occupancy lever (after P2a) proven flat. Reverted.
### B. skip-ALL-quantize probe (NON-bit-exact, diagnostic) - the +30% MoE number is an ARTIFACT
Skipping quantize_mmq_fp4_cuda entirely (garbage buffer, FP4-MMA timing data-independent) showed
DENSE +2.7%/+3.7% (npl128/32) but MoE +29.9%/+43.9%. The MoE figure is NOT a valid ceiling: the
garbage activation also corrupts the router (ffn_gate_inp) quantize -> degenerate topk expert
selection -> less / better-localized expert work -> artificially fast. The authoritative
decode decomposition (nsys --cuda-graph-trace=node, npp8 ntg128 npl128) shows quantize is only
3.7% of MoE decode GPU-time, not 23%. Dense +2.7% IS real (rms_norm-fold territory, see D).
### C. SHIPPED - MoE NVFP4 activation-quantize de-dup (patch 0023) - BIT-EXACT, lifts decode+prefill
ggml mul_mat_id quantizes the gathered rows ne11_flat = ne12*n_expert_used. For the broadcast
up/gate proj (ne11==1) every expert of a token sees the SAME token activation, so stock
re-quantizes each token n_expert_used (=4 here) times. quantize_mmq_nvfp4 has NO cross-thread
reduction (per-16-element per-thread), so the gathered blocks are byte-identical across experts.
Lever: quantize the ne12 unique tokens once, then gather the block_fp4_mmq rows into the
expert-gathered layout with a coalesced uint4 copy (block_fp4_mmq = 9 uint4 = 144 B). GEMM
untouched; down_proj (ne11==n_expert_used, distinct) keeps stock.
- Gather v1 (per-thread 144 B struct copy) was UNCOALESCED: gather 478 ms ate 84% of the 570 ms
quantize saving -> flat. Gather v2 (coalesced uint4, output written contiguously) = 32 ms.
- nsys decode-isolated: quantize_mmq_nvfp4 868 -> 457 ms/run (-411 ms), gather +32 ms, net -379 ms.
- DECODE S_TG: MoE npl128 745.2 -> 758.1 (+1.73%), npl32 +0.6%. PREFILL T_PP -4%. DENSE byte-flat.
- BIT-EXACT GATE (default on): q36-27b 5951a5b4 (unchanged), q36-35b-a3b 07db32c2 (on==off==0022);
test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805. On by default; GGML_CUDA_MOE_QUANT_DEDUP=0
restores stock. Committed: DGX f7409c2 + worktree patch 0023.
### D. NOT built - dense quantize producer-fold (rms_norm -> fp4) - real but ~2.7%, needs graph fusion
Dense decode quantize is ~2.7% (skip B, real). Folding it into the rms_norm+mul producer is
bit-exact-feasible (keep the strided sumsq reduction byte-identical, re-partition only the
writeback to 16-consecutive-per-thread + the verbatim per-thread quant) but requires a 3-op
{RMS_NORM,MUL,MUL_MAT(NVFP4)} graph fusion hoisting the GEMM into the producer node and a
mul_mat_q pre-quantized-src1 path (the scratch is a per-call pool buffer). High plumbing for
~2.7% dense only; left for a follow-up. mul_mat_q (bandwidth wall), flash_attn (softmax rescale
order), lm_head (cublas) have NO bit-exact lever.
## Verdict
The non-recurrence path has ONE shippable bit-exact decode lever found and built: the MoE
quantize de-dup (0023, +1.73% MoE npl128 decode + 4% prefill, dense untouched, byte-identical).
It is the only redundant-work bucket; the rest of the non-recurrence kernels are at their
bit-exact floor (mul_mat_q bandwidth-bound, quantize occupancy-flat, attention softmax-locked).
The remaining bit-exact headroom is the dense rms_norm->fp4 producer-fold (~2.7% dense, graph-
fusion surgery, not built) and then bf16 state (precision change, shelved) - no other bit-exact
lever moves the LPDDR5x-bandwidth-bound, recurrence-dominated (~50%, past vLLM parity) decode wall.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,119 +0,0 @@
# OCCUPANCY_RETUNE_RESULTS.md - CRUX SETTLED: vLLM recurrence state is FLOAT32 (805 MB/call)
Phase: vllm-f32-confirm (GPU agent). DGX GB10, peak DRAM BW = 273 GB/s.
Checkpoint: ~/bench/q36-27b-nvfp4-vllm (vLLM 0.23.0), ~/bench/q36-27b-nvfp4.gguf (llama HEAD 58426b5, conv-fusion 0021).
NOTE: ncu HW perf-counters are perm-blocked on this node (RmProfilingAdminOnly:1, no passwordless sudo, ERR_NVGPUCTRPERM).
Settled WITHOUT counters: (a) empirical tensor dtype at the kernel boundary, (b) nsys/CUPTI kernel timing (counter-free), (c) source+config chain.
## VERDICT: f32. The close-check is RIGHT. The byte-gate (402 MB/bf16) is WRONG. BUILD THE BIT-EXACT OCCUPANCY RETUNE.
vLLM carries the gated-DeltaNet TEMPORAL/recurrent state in FLOAT32 and moves 805.3 MB/call, NOT 402 MB bf16.
Both engines move the SAME ~805 MB f32 recurrent state per call. The gap is pure BANDWIDTH EFFICIENCY on equal f32 bytes.
## vLLM (kernel: fused_recurrent_gated_delta_rule_packed_decode)
- EMPIRICAL tensor at kernel boundary (initial_state = self.kv_cache[1], qwen_gdn_linear_attn.py:1316/1492):
dtype=torch.float32 elem_bytes=4 shape=(1553, 48, 128, 128) per-slot state = 786432 elems = 3.000 MiB (f32)
- MB/call (B=128, Read+Write) = 128 * 48*128*128 * 4 bytes * 2 = 805,306,368 B = 805.3 MB (bf16 would be 402.7 MB)
- Runtime engine config: cache_config.mamba_ssm_cache_dtype = float32 (mamba_cache_dtype=auto/bf16 for conv)
- Source chain: config.json text_config.mamba_ssm_dtype=float32 -> Qwen3_5ForConditionalGenerationConfig.verify_and_update_config
sets cache_config.mamba_ssm_cache_dtype="float32" -> MambaStateDtypeCalculator._mamba_state_dtype else-branch
-> temporal_state_dtype = torch.float32 (conv state = bf16; temporal/SSM state = f32).
- Kernel timing (CUDA events, eager B=128, 432 steady-decode calls): median 3.578 ms/call, min 3.499, mean 3.593, p90 3.635
BW @ median = 805.3MB / 3.578ms = 225.1 GB/s = 82.4% of 273 peak (min 84.3%, p90 81.1%)
## llama (kernel: gated_delta_net_cuda<128, 0, 0>)
- Kernel signature: all operands const float* (q,k,v,g,beta,curr_state) + float* state_dst => recurrent state is f32. Source-confirmed.
- Identical state geometry (48 value-heads x 128 head_v x 128 head_k, B=128) => MB/call (R+W) = 805.3 MB f32 (same as vLLM).
- Fresh nsys (--cuda-graph-trace=node, build-cuda-base, -npp128 -ntg24 -npl128, q36-27b-nvfp4.gguf):
gated_delta_net = 25.4% of GPU time (#2 kernel after nvfp4 mul_mat_q).
Decode cluster isolated = exactly n=1152 calls (= 24 ntg x 48 GDN layers), B=128 steady state:
median 4.0211 ms/call, mean 4.0315 => 200.3 GB/s = 73.4% of 273 peak.
(Consistent with prior GAP_PROGRESS 4.08ms/~70% and context 3.98ms/202GB/s/74%.)
## THE GAP (equal f32 bytes, different efficiency)
llama 805.3 MB / 4.021 ms = 200.3 GB/s = 73.4% peak
vLLM 805.3 MB / 3.578 ms = 225.1 GB/s = 82.4% peak
=> vLLM is ~11% faster per recurrence call at IDENTICAL byte volume => ~9 pts more DRAM BW efficiency.
Retune target: 73.4% -> ~82% peak, recurrence 4.02 -> ~3.58 ms/call, KEEPING exact per-column f32
reduction/FMA order (md5-gateable bit-identical). bf16 plan stays SHELVED (optional over-clock only).
---
# retune-build (BUILD AGENT) — patch 0022 SHIPPED
vLLM verdict re-checked first: **f32, 805 MB/call** (the close-check is right, the byte-gate's 402 MB/bf16
is wrong). The bf16-state plan stays SHELVED. Built the bit-exact occupancy/coalescing retune.
## The change — bit-exact column folding (Lever A + B + D)
`ggml/src/ggml-cuda/gated_delta_net.cu` `gated_delta_net_cuda`: two new template params
`NUM_WARPS` (default 4) and `COLS_PER_WARP` (default 1) plus `MIN_BLOCKS`. Each warp now owns
`COLS_PER_WARP` columns of the 128x128 recurrent state instead of 1, looping the existing per-column
body over `col, col+NUM_WARPS, ...` inside a per-block column tile of `NUM_WARPS*COLS_PER_WARP` columns;
`grid.z = S_v / (NUM_WARPS*COLS_PER_WARP)`.
Why it is bit-exact: the S_v rows of every column stay sharded across the lanes by the SAME strided
mapping `i = r*warp_size + lane`, and every column's per-lane FMA accumulation and
`warp_reduce_sum<warp_size>` XOR-butterfly are byte-for-byte unchanged. Only the
`(warp,block)->column` assignment and the order a warp visits its columns differ, and a column's f32
value provably does not depend on either (columns are fully independent — column c reads only its own
S_v-float state slice plus the shared per-(token,head,seq) q/k/v/g/beta). The forbidden `float4`
state load (Lever E) — which would repartition a lane to 4 contiguous rows and change the reduction
grouping — was NOT done; this keeps the md5 invariant. Every global access stays identically coalesced
(32 consecutive lanes -> one 128B sector), so this is a latency-coverage / scheduling win (higher
per-warp memory-level parallelism: COLS_PER_WARP independent state-load bursts issued before any
reduction + the independent butterfly reductions interleave to hide each other's shfl latency), NOT a
coalescing change. The S_v=128 tile is env-selectable via `GDN_NW`/`GDN_CPW` for one-build re-tuning;
default is the measured GB10 winner **(NUM_WARPS=16, COLS_PER_WARP=8)**.
## %peak sweep — GB10, CUDA 13, sm_121 (nsys CUPTI timing; HW counters perm-blocked)
Metric: median of the 1152 (=ntg24 x 48 layers) B=128 decode calls, each moving 805.3 MB f32 (R+W),
isolated by the [2.5ms,6ms] band; %peak vs 273 GB/s. Baseline re-isolation reproduced the confirm
agent's 4.021 ms / 73.4% exactly (n=1152).
| NUM_WARPS x COLS_PER_WARP | ms/call | GB/s | %peak |
|---------------------------|---------|------|-------|
| base (0021) | 4.021 | 200.3| 73.4 |
| 4 x 1 (control == base) | 4.034 | 199.7| 73.1 |
| 4 x 2 | 3.887 | 207.2| 75.9 |
| 4 x 4 | 3.775 | 213.3| 78.1 |
| 8 x 1 | 3.837 | 209.9| 76.9 |
| 8 x 2 | 3.749 | 214.8| 78.7 |
| 8 x 4 | 3.699 | 217.7| 79.9 |
| 8 x 8 | 3.586 | 224.6| 82.3 |
| 16 x 2 | 3.665 | 219.8| 80.5 |
| 16 x 4 | 3.585 | 224.7| 82.3 |
| **16 x 8 (WINNER/default)** | **3.488** | **230.9** | **84.6** |
| 32 x 4 | 3.489 | 230.8| 84.6 |
Plateau ~84.5% at the grid.z=1 tiles; (16,8) picked as default (512-thread block, no spill, no
1024-thread .minnctapersm warning). **84.6% > vLLM 82.4%.**
## Gates (both PASS, non-negotiable)
- **md5 BYTE-IDENTICAL to the 0021 baseline**, greedy `--temp 0 --seed 1 -n 48`, both models, winner
(16,8 default) AND (4,1 control):
- q36-27b-nvfp4 (dense): `5951a5b4d624ce891e22ab5fca9bc439` (baseline == winner == control)
- q36-35b-a3b-nvfp4 (MoE): `07db32c2bcb78d17a43ed18bc22705cd` (baseline == winner == control)
- **test-backend-ops -o GATED_DELTA_NET: 36/36 PASS** (covers head_size=128, kda=0/1, prefill K>1).
## Decode throughput — base vs flag(16,8), llama-batched-bench -npp128 -ntg128 -fa on
| model | npl | base S_TG t/s | flag S_TG t/s | gain |
|-------|-----|---------------|---------------|------|
| dense 27b | 32 | 199.2 | 207.6 | +4.2% |
| dense 27b | 128 | 335.9 | 373.2 | +11.1% |
| MoE 35b-a3b | 32 | 420.6 | 440.0 | +4.6% |
| MoE 35b-a3b | 128 | 688.4 | 745.7 | +8.3% |
Prefill S_PP unchanged (dense ~930, MoE ~2185 t/s) — no regression. Stable across 3 samples.
## Parity vs vLLM (recurrence kernel)
Recurrence kernel BW: before 200.3 GB/s = 89.0% of vLLM's 225.1; **after 230.9 GB/s = 102.6% of vLLM**
(3.488 ms/call < vLLM 3.578 ms/call). The recurrence bandwidth gap that this workflow set out to close
is closed and slightly exceeded; the remaining decode-parity delta lives in the non-recurrence path
(matmul/attn), not in gated-DeltaNet.
Shipped: patch 0022, committed on the DGX dev tree and the LocalAI worktree. No push.

View File

@@ -1,305 +0,0 @@
# P1 results: dynamic decode-first prefill-token budget (patch 0016)
Implements **P1** of `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`: replace patch 0013's
**static** per-step prefill cap with a **dynamic, decode-first** token budget in
`tools/server/server-context.cpp::update_slots()`. Policy change only, zero
libllama changes, default-off byte-identical. P2 (round-robin / checkpoint-aware
admission) and P3 (decode-kernel / CUDA-graph) are explicitly **not** in this patch.
## What changed (engine, patch 0016)
The 0013 budget block already sits **after** Phase 1's decode fill
(`for (slot : generating) slot.update_batch(batch)`, lines 2716-2720), so at that
point `batch.n_tokens == D` is the live decode load. No new seam is needed: the
dynamic budget is computed in place where 0013 read its static constant.
| seam (post-0015 line) | before (0013) | after (0016) |
|---|---|---|
| budget block @2737-2747 | `n_prefill_budget = min(n_batch, atoi(LLAMA_PREFILL_BUDGET))` (static constant) | `D = batch.n_tokens`; `T = clamp(LLAMA_MAX_BATCH_TOKENS ?: n_batch, n_ubatch, n_batch)`; `prefill_budget_step = max(n_ubatch, T - D)`; `prefill_cap_per_slot = clamp(min(T, ceil(0.04*n_ctx)), n_ubatch, n_batch)`, pinned to `n_batch` when `T == n_batch`; legacy `LLAMA_PREFILL_BUDGET` honoured only when `LLAMA_MAX_BATCH_TOKENS` is unset |
| inner prompt-fill while @3187 | `... && batch.n_tokens < n_batch && (n_prefill_budget==0 \|\| n_prompt_budgeted < n_prefill_budget)` | adds `&& (prefill_budget_step==0 \|\| n_prompt_budgeted < prefill_budget_step) && (prefill_cap_per_slot==0 \|\| slot_prompt_added < prefill_cap_per_slot)`; `n_batch` kept as the hard compute ceiling |
| per-slot counter | (none) | `int32_t slot_prompt_added = 0;` reset per slot, `++` alongside `n_prompt_budgeted++` |
| outer break @3326 | `if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) break;` | `if (prefill_budget_step > 0 && n_prompt_budgeted >= prefill_budget_step) break;` |
Knobs (env, set before context init like `LLAMA_KV_PAGED`; LocalAI model options
wired in `grpc-server.cpp` beside `max_prefill_tokens`):
- `LLAMA_MAX_BATCH_TOKENS` (option `max_batch_tokens` / `mbt`) - total per-step
token budget `T` (decode + prefill), the vLLM `max_num_batched_tokens` analogue.
Default `n_batch`, clamped `[n_ubatch, n_batch]`.
- `LLAMA_PREFILL_CAP` (option `prefill_cap`) - per-slot prompt-chunk cap, the
`long_prefill_token_threshold` analogue. Default `min(T, ceil(0.04*n_ctx))`
floored at `n_ubatch`. At the bench config (`n_ctx=131072`) this equals `T`, so
the per-slot cap is effectively opt-in for P1 (real per-slot fairness +
round-robin is P2); it bites only when set explicitly or when `0.04*n_ctx < T`.
- `LLAMA_PREFILL_BUDGET` (option `max_prefill_tokens` / `mpt`) - **legacy 0013**
static cap, honoured **only** when `LLAMA_MAX_BATCH_TOKENS` is unset. 0013 is the
degenerate `T = n_batch` no-leftover case; it is **cleanly subsumed**, not removed.
## Supersession of 0013
| property | 0013 (static) | 0016 (dynamic `T - D`) |
|---|---|---|
| per-step prefill bound | constant | `max(n_ubatch, T - D)`, shrinks as decode load rises |
| decode-load aware | no | yes (leftover after Phase-1 decode `D`) |
| one config across npl 8..128 | no (256 best @128, net-negative @8) | yes (self-tuning) |
| long-prompt monopoly guard | no | per-slot `slot_prompt_added` cap |
| decode-first guarantee | structural (Phase 1) | structural (Phase 1) - kept |
| legacy knob | `LLAMA_PREFILL_BUDGET` | preserved when dynamic knob unset |
## Determinism / byte-identical analysis (verified by construction)
The hard ceiling `batch.n_tokens < n_batch` is **kept** in the inner loop (not
replaced by `< T`). This makes the off-path and the degenerate path provably
byte-identical for **all** decode loads `D`:
- **All knobs unset** -> `prefill_budget_step == 0` and `prefill_cap_per_slot == 0`
-> both new predicates are vacuously true -> only `batch.n_tokens < n_batch`
binds -> **bit-for-bit stock**. The outer break is `prefill_budget_step > 0`
guarded, so it never fires. Identical to 0013's off-path by construction.
- **Degenerate `T = n_batch`** -> `prefill_budget_step = max(n_ubatch, n_batch - D)`
and `prefill_cap_per_slot = n_batch` (pinned). The budget bound
`n_prompt_budgeted < n_batch - D` is equivalent to `batch.n_tokens < n_batch`
(since `batch.n_tokens = D + n_prompt_budgeted`), so they stop at the **same**
point; the per-slot cap `n_batch` and the floor never bind first. When `D` is so
large that `n_batch - D < n_ubatch`, the kept `batch.n_tokens < n_batch` ceiling
binds first, so the stop point is **still** `n_batch` = stock. Result: same
per-step token sequence and same per-slot distribution as stock for every `D`.
- **Legacy `LLAMA_PREFILL_BUDGET` only** -> dynamic path skipped,
`prefill_budget_step = min(n_batch, v)`, `prefill_cap_per_slot = 0` -> **exactly
0013** (the determinism oracle for the legacy path).
- **`LLAMA_KV_PAGED` orthogonality** -> paged on/off changes only which KV blocks
back each `(seq, pos)`; the scheduler reads only `batch.n_tokens`, slot states,
and `n_ctx`/`n_batch`/`n_ubatch` - none paged-dependent. Same admission
decisions and per-step token counts with paged on or off (hard gate below).
## Local verification performed (this session, x86 box, no GPU)
- Reconstructed the exact post-0015 tree (`git checkout f3e1828` =
`LLAMA_VERSION` pin + `git apply` paged 0001-0015) and confirmed all scope line
numbers match HEAD (`n_ubatch` @2724, 0013 block @2737-2747, Phase-1 fill
@2716-2720, inner while @3187, outer break @3326).
- Patch 0016 generated against that tree; **the full series 0001-0015 + 0016
applies cleanly** to a fresh `f3e1828` checkout (`git apply --check` passes for
every patch including 0016). Stat: `1 file changed, 85 insertions(+), 22
deletions(-)`.
- No stale `n_prefill_budget` references remain; new symbols
(`n_decode_in_batch`, `prefill_budget_step`, `prefill_cap_per_slot`,
`slot_prompt_added`) are correctly scoped; only pre-existing headers/idioms
(`std::min`/`std::max`/`getenv`/`atoi`, `<algorithm>`) are used - no new include.
- Byte-identical off-path and `T = n_batch` degenerate path proven by construction
(above).
## Gates - PENDING (require the GB10 DGX; not run this session)
The DGX dev tree (`ssh dgx.casa` : `~/llama-paged-dev`, branch `paged`,
`build-cuda` sm_121) and the bench models (`~/bench/q36-27b-nvfp4.gguf`,
`~/bench/q36-35b-a3b-nvfp4.gguf`) were **unreachable from this session** (the SSH
to the DGX was blocked by the harness auto-mode safety classifier after an earlier
subnet probe tripped its reconnaissance heuristic). The build + the four gates +
the A/B sweep below were therefore **not executed**. Numbers must be filled by a
re-run on the DGX (or with `ssh dgx.casa` allowlisted). Methodology is locked here
so the re-run is mechanical.
Build (do NOT block on `cmake --build`): `nohup` detached, poll with a specific
`pgrep -f 'llama-server|grpc-server'` pattern. Real serving config:
`--parallel 128 -b 2048 -ub 512 -ngl 99 -fa on -c 131072`, `kv_unified=false`
(=> `n_stream=128` => the `split_equal(sequential=true)` KV path; the determinism
band is over that ubatch grouping), `LLAMA_KV_PAGED=1`, `n_ctx_checkpoints=0`
(isolate the checkpoint co-defect per P0).
| # | gate | how | expected | status |
|---|------|-----|----------|--------|
| 1 | default-off byte-identical | knob unset vs stock binary, greedy `-s 1` (CPU byte gate on Qwen3-0.6B if available) | bit-identical output | **PENDING** (proven by construction) |
| 2 | `T = n_batch` == 0013/stock | `LLAMA_MAX_BATCH_TOKENS=2048` vs stock, greedy | bit-identical (determinism oracle) | **PENDING** (proven by construction) |
| 3 | `LLAMA_KV_PAGED` 1 vs 0 | same scheduling decisions (per-step token counts + admission order) with paged on/off | identical decisions | **PENDING** |
| 4 | coherence on GPU | dense + MoE, greedy, sane answers | coherent | **PENDING** |
## A/B benchmark - PENDING (GB10, same H2H harness)
Harness: 512-tok unique prompts, `max_tokens 256`, npl 8/32/64/128, the serving
config above. Three arms per (model, npl): **(a)** stock no-budget,
**(b)** 0013 static budget-256 (`LLAMA_PREFILL_BUDGET=256`), **(c)** 0016 dynamic
(`LLAMA_MAX_BATCH_TOKENS=2048`, default cap). Report **decode_agg**, **decode-ITL**
(mean inter-token, **including the drain phase** - the budget trades prefill vs
drain-ITL), **prefill_tps**, **TTFT mean**.
Dense `q36-27b-nvfp4`:
| npl | arm | decode_agg | decode-ITL (incl drain) | prefill_tps | TTFT mean |
|----:|-----|-----------:|------------------------:|------------:|----------:|
| 8 | stock / 0013-256 / 0016 | PENDING | PENDING | PENDING | PENDING |
| 32 | stock / 0013-256 / 0016 | PENDING | PENDING | PENDING | PENDING |
| 64 | stock / 0013-256 / 0016 | PENDING | PENDING | PENDING | PENDING |
| 128 | stock / 0013-256 / 0016 | PENDING | PENDING | PENDING | PENDING |
MoE `q36-35b-a3b-nvfp4`: same table, **PENDING**.
Reference ceilings to validate against (from `QWEN36_NVFP4_BENCH.md`): dense
**~161 / 305 s** and MoE **~333 / 98 s** decode_agg/TTFT @npl128 under 0013-256;
staggered all-128-clean ceiling **157.4** dense.
### Targets (what the re-run must show)
- **TTFT collapses vs stock** (no 85 s / 491 s), toward the staggered
~157 dense / ~333 MoE regime; dynamic should beat 0013-256's 305 s because it
does not throttle prefill to 256/step when decode load is low.
- **Ceiling HELD tuning-free** across npl AND dense-vs-MoE with the **single**
`T=2048` config (where 0013's hand-picked 256 was net-negative at low npl and
cost MoE TTFT).
- **No low-concurrency regression** at npl8 vs stock.
- **Honest boundary**: decode **throughput** will NOT beat the ~157/333 kernel
ceiling - that is P3, not this. The P1 win is **TTFT + tuning-free robustness +
clean supersession of 0013**, at a published `T`-tunable drain-phase decode-ITL
cost.
## Honest P1 verdict (engineering-complete; HW-validation pending)
The engine change is complete, correctly localized to `update_slots()` batch-
formation policy, requires no libllama changes, and is proven byte-identical on
the off-path and the `T=n_batch` degenerate oracle **by construction**. It cleanly
supersedes 0013 (legacy knob preserved). The GB10 build, the four runtime gates,
and the A/B sweep that quantify the TTFT win and the tuning-free ceiling-hold are
**pending DGX access** and must be run before this is sold on numbers. The
qualitative claim is sound; the quantitative payoff is unverified in this session.
## Staggered-arrival evaluation
Ran on the GB10 DGX (`dgx.casa`, dev tree `~/llama-paged-dev` @ `253cbae`, patch
0016 BUILT, `build-cuda` sm_121). The prior all-at-once **BURST** H2H (all N
requests at t=0) is structurally adversarial to *any* prefill budget: under a
burst, TTFT is prefill-rate-bound, so a per-step prefill cap can only slow the
drain. That burst showed 0016 ~= 0013, no win. A **STAGGERED** arrival (requests
trickle in while others are already decoding) is the regime 0016 is designed for:
when a new prefill arrives, the decode-first budget should keep the
already-decoding slots flowing (low/flat inter-token latency) while the new
prefill takes only the leftover `T - D`. This section measures exactly that.
### Harness (staggered client, dev-tree-only)
`~/bench/stagger_cli.py` issues N requests at a **fixed inter-arrival rate** (not
all at once) against `/v1/completions`, `stream=true`, `temperature 0`,
`ignore_eos`, 512 unique-prefix tokens per prompt (unique leading token defeats
prefix caching). It records, per request, the send time, the TTFT, and the
absolute timestamp of **every** generated token (full ITL series); raw dumps go to
`~/bench/stag_*/raw_*.json`, analysed by `~/bench/stagger_agg.py`. Server flags are
**identical to the prior H2H** (`abrun.sh`): `--parallel 128 -b 2048 -ub 512 -ngl
99 -fa on -c 131072 --no-kv-unified` with `LLAMA_KV_PAGED=1` (verified
`n_ctx_seq=1024`, i.e. `n_stream=128` per-sequence KV, kv_unified=false; checkpoints
at the default max=32, identical across all arms). Three to four arms per model,
**env-only** difference, sequenced on the single GPU with PID-file stop between
arms: **stock** (no knobs), **0013** static (`LLAMA_PREFILL_BUDGET=256`), **0016**
dynamic (`LLAMA_MAX_BATCH_TOKENS=512`, and `1024`).
**Metric definitions.** *Arrival window* = `[first send, last send]`. *In-window
ITL* = inter-token gaps whose token lands inside the arrival window = the ITL seen
by already-decoding slots **while new prefills are still arriving** -> the
decode-protection metric (mean/p95/max). *freezes >Ns* = count of in-window gaps
exceeding N seconds (decode stalls caused by a prefill admission). *TTFT* =
first-token latency per newly-arriving request. *decode agg* = total generated /
decode span (a staggered-run aggregate, **not** the saturated kernel ceiling; it
is depressed by the arrival ramp + checkpoint overhead and is not the P1 figure of
merit). *wall* = last token - first send.
### Dense `q36-27b-nvfp4`, 64 reqs, max_tokens 256, 300 ms inter-arrival (~19 s window) - the discriminating regime
| arm | in-win ITL mean / p95 / max (ms) | freezes >1s / >2s | TTFT mean / p95 (ms) | decode agg tok/s | wall s |
|-----|---------------------------------:|------------------:|---------------------:|-----------------:|-------:|
| stock | 1494 / 2691 / 2693 | 45 / 35 | 26891 / 46083 | 94.1 | 174.4 |
| 0013 (pb256) | 527 / 640 / 650 | 0 / 0 | 44763 / 90338 | 81.2 | 201.8 |
| 0016 (mbt512) | 730 / 897 / 901 | 0 / 0 | 33320 / 66595 | 88.4 | 185.8 |
| 0016 (mbt1024) | 1320 / 2050 / 2051 | 46 / 5 | 33402 / 62636 | 72.4 | 226.8 |
**Read:** stock's in-flight decoders **freeze ~2.7 s** every time a new prefill is
admitted (35 freezes >2 s, in-window p95 2691 ms). Both small-cap budget arms
(0013, mbt512) keep the in-flight ITL **flat and spike-free** (0 freezes >1 s).
`mbt512` beats `0013` on **TTFT** (p95 66.6 s vs 90.3 s, mean 33.3 s vs 44.8 s),
**throughput** (88.4 vs 81.2) and **wall** (186 s vs 202 s) at the same spike-free
protection. `mbt1024` admits bigger prefill chunks, so it reintroduces spikes (5
freezes >2 s) for a marginal TTFT gain -> the per-step prefill-chunk size is the
protection/TTFT dial.
### Dense, light load: 32 reqs, max_tokens 64, 400 ms inter-arrival (~12 s window) - non-saturated control
| arm | in-win ITL mean / p95 / max (ms) | freezes >1s / >2s | TTFT mean / p95 (ms) | decode agg tok/s | wall s |
|-----|---------------------------------:|------------------:|---------------------:|-----------------:|-------:|
| stock | 810 / 2324 / 2324 | 25 / 15 | 10604 / 18872 | 49.0 | 42.3 |
| 0013 (pb256) | 443 / 572 / 607 | 0 / 0 | 18608 / 38347 | 38.0 | 54.7 |
| 0016 (mbt512) | 597 / 858 / 863 | 0 / 0 | 14506 / 28055 | 43.9 | 47.4 |
Same shape with shorter, churning requests: stock 15 freezes >2 s, both budget
arms 0; `mbt512` again beats `0013` on TTFT (p95 28.1 s vs 38.3 s), throughput and
wall at equal protection.
### MoE `q36-35b-a3b-nvfp4`, 64 reqs, max_tokens 256, 300 ms inter-arrival
| arm | in-win ITL mean / p95 / max (ms) | freezes >1s / >2s | TTFT mean / p95 (ms) | decode agg tok/s | wall s |
|-----|---------------------------------:|------------------:|---------------------:|-----------------:|-------:|
| stock | 706 / 1146 / 1148 | 132 / 0 | 2774 / 5105 | 202.4 | 81.1 |
| 0013 (pb256) | 194 / 273 / 280 | 0 / 0 | 18205 / 36023 | 170.3 | 96.5 |
| 0016 (mbt512) | 275 / 366 / 373 | 0 / 0 | 11940 / 22453 | 191.4 | 85.8 |
MoE decode is ~2x faster (3 B active), so the baseline ITL is ~240 ms and stock's
prefill freezes are shorter (~1.1 s, 132 of them >1 s, none >2 s) but **still
present**; budget arms hold the in-flight ITL near baseline (p95 273-366 ms).
`mbt512` again dominates `0013` (TTFT mean 11.9 s vs 18.2 s, p95 22.5 s vs 36.0 s,
throughput 191 vs 170, wall 86 vs 96). Because MoE prefill is cheap, **stock's
TTFT is far lower** (2.8 s mean) - the TTFT cost of decode protection is most
visible here.
### Near-burst control: dense, 64 reqs, 150 ms inter-arrival (~9.5 s window)
At 150 ms the 64 prompts pile in faster than the ~94-127 tok/s drain, so the run
degenerates into a **burst** (window 9.5 s << per-request TTFT of 240-308 s; no
token lands inside the window, so the in-window protection metric is empty). This
reproduces the prior burst null: TTFT stock 267 s / 0013 291 s / mbt512 279 s /
mbt1024 240 s, decode agg 127 / 102 / 106 / 122, wall 401 / 443 / 432 / 375 s -
budget ~= stock, stock marginally better on TTFT and throughput. This is the
control, not 0016's target regime.
### Structural note (intellectual honesty)
At `T = 512 = n_ubatch`, `prefill_budget_step = max(n_ubatch, T - D) = 512`
**constant**, so `mbt512` behaves as a *static* 512-token prefill cap - the dynamic
floor binds and the `T - D` term never bites. Its edge over `0013`'s 256 is
therefore mostly "a larger, `n_ubatch`-aligned cap", not the adaptivity per se. The
genuine decode-adaptive `T - D` is exercised only at `T >= 1024` (`mbt1024`:
prefill chunk ~`1024 - D`, auto-shrinking as decode load `D` rises). Across all
settings the per-step prefill-chunk size is a clean, monotonic protection/TTFT
dial: 256 (0013) -> 512 (mbt512) -> ~960 (mbt1024) trades flatter decode for lower
TTFT. The distinctive value of the dynamic budget is the **safety property**: it
lets you set a *high* `T` for low-load TTFT while guaranteeing the per-step token
count auto-shrinks so decode is never starved when load rises - which is precisely
what stock lacks (stock = unbounded prefill chunk = the freezes).
### Verdict (honest)
- **Does 0016 keep the in-flight decoders' ITL low/flat when new prefills arrive,
vs stock's spikes?** **Yes, decisively, on staggered traffic.** Stock's
already-decoding slots freeze on every prefill admission (dense: 35 freezes >2 s,
in-window ITL p95 2.7 s; light: 15 >2 s; MoE: 132 >1 s). Every budget arm
(0013, mbt512) eliminates them (0 freezes >1 s, flat in-window ITL). This is the
real P1 win and it shows **only** under staggered arrival, never under the burst.
- **Does it bound new-request TTFT?** Relative to **0013**, yes (26-38 % lower TTFT
across dense and MoE). Relative to **stock**, **no** - stock has the lowest TTFT
precisely because it lets prefill stampede the decoders (that stampede *is* the
freeze). New-req TTFT vs in-flight ITL is a genuine Pareto tradeoff, not a free
lunch; this does not manufacture a TTFT-beats-stock claim.
- **Does the dynamic budget beat BOTH stock AND 0013, or is it ~= 0013 here too?**
It **does not tie 0013 here** (unlike the burst): at `T=512`, 0016 sits at a
strictly better point on the protection/TTFT frontier than 0013-256 (equal
spike-free protection, materially lower TTFT/throughput/wall), and it adds a
principled, decode-adaptive, single-`T` way to move along that frontier (one
config across dense and MoE) that 0013's hand-picked 256 cannot. It does **not**
strictly dominate stock: 0016 wins decode smoothness (no multi-second freezes),
stock wins raw TTFT/throughput. Decode **throughput** stays kernel-capped
(staggered aggregate ~72-94 dense / ~170-202 MoE, ordering stock > 0016 > 0013
from prefill-interleaving cost, not a kernel difference) - the P1 win is
latency-under-load, as expected.
**Bottom line:** 0016 **earns its keep over 0013 on staggered traffic** - same
spike-free decode protection at a strictly better TTFT/throughput/wall point, plus
a decode-adaptive knob that holds one config across loads and model types. Against
stock it is a deliberately different operating point that trades a few seconds of
new-request TTFT to remove the multi-second in-flight decode freezes stock cannot
avoid. Keep 0016; recommend `LLAMA_MAX_BATCH_TOKENS=512` as the default
protective setting and higher `T` when low-load TTFT matters more than ITL
flatness.

View File

@@ -1,107 +0,0 @@
# Paged-KV: GPU 0007 re-run + shared-prefix throughput benchmark
DGX Spark (NVIDIA GB10, sm_121 / cc 12.1), CUDA 13, dev tree `~/llama-paged-dev`
branch `paged`, base pin `f3e182816421c648188b5eab269853bf1531d950`, full paged
engine (0001-0004, 0006, 0007). All paged behaviour stays gated by
`LLAMA_KV_PAGED`; default-off is byte-identical to stock. Models:
`Qwen3-0.6B-Q8_0.gguf` and `Qwen3-32B-Q4_K_M.gguf`.
## Deliverable 1 - GPU run of the 0007 prefix-engine correctness driver
The committed driver `examples/simple/paged-prefix-engine.cpp` hardcodes
`n_gpu_layers = 0`. For this GPU run it was given a dev-only
`PAGED_NGL` env override (`mp.n_gpu_layers = getenv("PAGED_NGL") ? atoi(...) : 0`),
rebuilt in `build-cuda`, run, then the edit was **reverted** so the committed
driver stays byte-clean (it is dev scaffolding, never shipped in a patch).
Three runs of the same Gate-0 driver, Qwen3-0.6B, `LLAMA_KV_PAGED=1`:
| binary / offload | result |
|------------------------------------------|-------------------------|
| committed `build-cpu` driver | **ALL PASS (failures=0)** |
| `build-cuda`, `PAGED_NGL=99` (all layers)| GATE FAILED (failures=3)|
| `build-cuda`, `PAGED_NGL=0` (same binary)| GATE FAILED (failures=2)|
**The GPU run did NOT print ALL PASS - reported honestly.** But the failures are
narrow and are not a paged-engine bug:
- Every **structural / mechanical** paged invariant PASSES on GPU, in both
scenarios (boundary and mid-block): prefill computed ONLY the suffix (32 prefix
tokens skipped), shared prefix block-aligned, shared-block `ref_cnt == 2` while
both sequences hold it, ref drops `2 -> 1` on freeing one sharer, only the
private (suffix) blocks are returned, and the prefix block returns to the pool
once all sharers free. The cross-request KV reuse mechanism itself is GPU-clean.
- The only failures are the **exact greedy-token byte-identical** assertions
(e.g. boundary `B-shared` vs `B-from-scratch`). They diverge at a single near-tie
token (boundary: 2nd generated token `17971` vs `5671`) and then cascade
autoregressively.
Root cause is **CUDA float-kernel non-determinism, not the paged logic**: the
*same* CUDA binary fails the exact-token assertions even with `PAGED_NGL=0` (zero
layers offloaded), whereas the genuine `build-cpu` binary passes all 16/16. The
CUDA backend (loaded via `ggml_backend_load_all`) uses non-associative reductions
whose result differs between the full-prefill batch shape and the
incremental-suffix batch shape; under greedy decode a single logit near-tie flips
and the sequences cascade apart. This refines the earlier note in
`PAGED_GPU_VERIFY.md` (which framed it as "not GPU-specific" and had no CPU pass
to compare against): the CPU build now passes clean, so the divergence is a strict
test-assertion artefact of CUDA float ordering, not a defect in 0006/0007.
## Deliverable 2 - shared-prefix throughput benchmark (the real-win test)
Dev-only driver `examples/simple/paged-prefix-bench.cpp` (registered in
`examples/simple/CMakeLists.txt`, dev tree only - not in any shipped patch).
Workload: `K` sequences that all share a `P`-token common prefix (a system /
RAG preamble), each with a unique `S`-token suffix; prefill only (`G=0`,
generation is identical compute in both modes so it is excluded from the
headline). GPU, `-ngl 99`, `kv_unified = true`.
- **NO-SHARE (stock):** `LLAMA_KV_PAGED` unset; every sequence prefills the full
`P+S` tokens. Total prefill work `= K*(P+S)`.
- **PAGED-SHARE:** `LLAMA_KV_PAGED=1`; the prefix is computed ONCE on seq 0,
committed via `paged_prefix_api::commit`, then every other seq calls
`paged_prefix_api::share` to physically reuse the ref-counted prefix blocks and
prefills ONLY its suffix. Total prefill work `= P + K*S`.
**`kv_unified` note:** this engine's cross-request share is built around the
*unified* stream-0 pool (ref-counted shared cells), so `kv_unified = true` is what
makes the share engage - the same setting the committed 0007 driver uses. With
`kv_unified = true` the share engaged in every run (evidence below).
### Reuse actually engaged (share mode)
In every share run: `kshare(seq 1) = 1024` (the full block-aligned prefix is
reused, not recomputed), the shared prefix block's `ref_cnt == K` (all sharers
point at one physical copy), and `prefill_tokens_submitted` collapses from
`K*(P+S)` to `P + K*S`.
### Results (P=1024, S=32, prefill-only)
| model | K | mode | prefill tokens | prefill time | raw tok/s | shared ref_cnt |
|--------------|----|-----------|----------------|--------------|-----------|----------------|
| Qwen3-0.6B | 32 | no-share | 33792 | 4.659 s | 7253 | - |
| Qwen3-0.6B | 32 | **share** | 2048 | **0.554 s** | 3695 | 32 |
| Qwen3-32B | 16 | no-share | 16896 | 26.14 s | 647 | - |
| Qwen3-32B | 16 | **share** | 1536 | **3.64 s** | 422 | 16 |
| Qwen3-32B | 32 | no-share | 33792 | 61.91 s | 546 | - |
| Qwen3-32B | 32 | **share** | 2048 | **6.02 s** | 340 | 32 |
### Verdict: YES, a real and substantial win, and it grows with K
- Prefill wall-time speedup: **0.6B K=32 -> 8.4x**, **32B K=16 -> 7.2x**,
**32B K=32 -> 10.3x**. The win grows with the number of sharers because
no-share prefix recompute is `O(K)` while the shared prefix is `O(1)` plus
`K` tiny suffixes.
- Note the honest caveat in the raw-throughput column: share mode submits small
32-token suffix batches that are *less* GPU-efficient (340-422 tok/s) than the
large no-share batches (546-7253 tok/s). The win is **not** higher tok/s - it is
computing ~11-16x **fewer** tokens. On a fast GB10 prefill that still nets a
7-10x wall-time reduction because prefill is compute-bound and the shared prefix
dominates the token count.
- This is exactly the many-users-one-system-prompt / RAG-preamble fan-out
scenario, and the paged cross-request prefix cache delivers there.
Scaffolding (`paged-prefix-bench.cpp`, the `PAGED_NGL` driver tweak) stays
dev-tree-only and is not part of any shipped patch.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,81 +0,0 @@
# Paged-KV GPU verification + full backend CUDA build
Verification run on a DGX Spark (NVIDIA GB10, compute capability 12.1 / sm_121),
CUDA 13.0, against pin `f3e182816421c648188b5eab269853bf1531d950`. Models:
`Qwen3-0.6B-Q8_0.gguf` (core gate) and `Qwen3-32B-Q4_K_M.gguf` (sanity).
All paged behaviour stays gated by `LLAMA_KV_PAGED` (env) / the `kv_paged`
server option; default-off is byte-identical to stock.
## Deliverable 1 - GPU-path correctness (all on GPU, `-ngl 99`)
CUDA build of the dev tree configured with
`-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121 -DCMAKE_BUILD_TYPE=Release`;
all paged drivers (`llama-simple`, `llama-paged-multiseq`,
`llama-paged-prefix`, `llama-paged-prefix-engine`) compiled clean under sm_121.
1. Core token-identical gate - PASS. `llama-simple` greedy, Qwen3-0.6B, `-ngl 99`:
stock (env unset) vs `LLAMA_KV_PAGED=1` output is BYTE-IDENTICAL. The paged
path is genuinely engaged: `LLAMA_KV_PAGED_DEBUG=1` shows the device gather
firing (`[paged-attn] gather n_stream=1 ...`), per-token block placement
(`[paged-alloc] ... grew`), and the stock run uses CUDA Graphs while the paged
run takes the distinct gather path - yet output matches exactly.
2. Multi-stream - PASS. `llama-paged-multiseq -s 4 -ngl 99`, stock vs paged:
all 4 concurrent sequences BYTE-IDENTICAL on GPU (n_seqs=4, CUDA0 compute
buffer matches expectation). Same result reproduced on the CPU build.
Prefix recompute-skip (`llama-paged-prefix-engine`, patch 0007) - MIXED, and
this is a dev-scaffolding driver ("not shipped"); it was never built on CPU
(absent from the CPU Gate-0 set), so there is no prior CPU pass to match.
The driver hardcodes `n_gpu_layers = 0`; a reported test-harness-only env
override (`PAGED_NGL`) was added to run it at `-ngl 99` (29/29 layers
offloaded confirmed), then reverted. Results are IDENTICAL on CPU and GPU
(so not a GPU issue):
- PASS: measured recompute-skip (32 prefix tokens skipped, block-aligned),
ref-count == 2 on shared block, ref drop 2->1 on free, only-private-blocks
returned, block returned to pool.
- FAIL: 2 of ~16 greedy-token-equality assertions. `boundary` case diverges
from the from-scratch baseline at the 2nd generated token (`17971` vs
`5671`) and then completely; `mid-block` "A re-shareable after free, output
unchanged" also differs. Driver prints `GATE FAILED (failures=2)`.
This is a divergence in the prefix recompute-skip path (0006/0007), NOT in the
core gather gate, and not GPU-specific. Reported, not fixed (out of scope).
3. 32B GPU sanity - PASS. `LLAMA_KV_PAGED=1 llama-simple -ngl 99 -n 16` on
Qwen3-32B-Q4_K_M (65/65 layers offloaded): coherent output
("The capital of France is Paris..."), no crash, no OOM.
## Deliverable 2 - full backend build with the paged patches
Built in a nested LocalAI tree on the DGX; gRPC v1.59.0 built from source
(LocalAI bundle; the system protobuf ships no CMake CONFIG) in ~26 min.
- (2a) `make llama.cpp LLAMA_PAGED=on` - PASS. All 6 paged patches
(0001,0002,0003,0004,0006,0007) `git apply` cleanly to the pin (EXIT=0). The 8
vendored paged sources land in `llama.cpp/src/` and are BYTE-IDENTICAL to the
dev tree; `grpc-server.cpp` carries the `kv_paged`/`paged_attention` option
(patch 0005); `llama-kv-cache.cpp` has the env-gated hooks.
- (2b) grpc-server under CUDA sm_121 - PASS (with the single-application caveat
below). 89 MB ARM aarch64 executable, build ~139 s, linked against
libcudart.so.13 / libcublas.so.13; binary contains the paged option strings
and `paged_alloc`/`paged_attn`/gather symbols.
- (2c) `make llama.cpp LLAMA_PAGED=off` - PASS. "skipping paged-attention patch
series", EXIT=0, NO `paged-*` sources in the checkout (clean escape hatch).
### Build-flow finding: paged patches are applied TWICE in the on-flow
A plain `make grpc-server LLAMA_PAGED=on` FAILS to compile. The paged series is
applied by BOTH the Makefile `llama.cpp` target (`git apply`) AND `prepare.sh`
(`patch -p1`). On the already-git-applied tree, `prepare.sh` hits "Reversed (or
previously applied) patch detected! Assume -R? [n]", declines, and re-applies the
pure-addition hunks a second time. `llama_kv_cache::get_n_gather` etc. end up
defined twice -> redefinition errors in `llama-kv-cache.cpp` (`.rej`/`.orig`
litter `src/`). Single application (one of the two appliers) compiles clean -
the 2b build above used a single git-apply with `prepare.sh` patching suppressed.
Reported only; the fix (drop one of the two application sites for
`patches/paged/`) is out of scope for this verification.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,111 +0,0 @@
# Paged llama.cpp vs vLLM - apples-to-apples (batched + NVFP4 + prefix cache)
Definitive matched comparison on a DGX Spark (GB10, sm_121). Both engines batched,
both NVFP4-class weights, both with prefix caching on, both eager (no CUDA graphs).
Workload: shared 1024-token system prefix + unique 32-token suffix, generate 64
tokens, K requests fired concurrently (cold fan-out), one client hitting both
OpenAI-compatible servers with identical token-id prompts.
This run fixes the two confounders in the earlier comparison (a *serial* Q4_K dev
driver vs a *batched* FP4 vLLM server). Here both sides are batched and NVFP4.
## Setup
- llama.cpp: `llama-server` built from the paged dev tree (`~/llama-paged-dev`,
branch `paged`, patches 0001-0007), CUDA `build-cuda/` (sm_121).
`LLAMA_KV_PAGED=1`, `-ngl 99 --parallel 32 -c 40960`, model
`q3-32b-nvfp4-dense.gguf` (NVFP4 weights, FP4-MMA kernel). OpenAI `/completion`.
- vLLM 0.23.0: `vllm serve q3-32b-nvfp4a16/` (compressed-tensors W4A16 / Marlin),
`--enforce-eager --max-model-len 4096 --gpu-memory-utilization 0.9
--max-num-seqs 64`, APC on (default). OpenAI `/v1/completions`.
## Finding 1 - the paged cross-request prefix cache does NOT engage in llama-server
This is itself a key result. The paged engine has two distinct mechanisms:
1. Physical paged block placement (patches 0002/0004) - runs inside
`llama_kv_cache::find_slot`, gated only by `LLAMA_KV_PAGED`. This DOES engage in
the server: with `LLAMA_KV_PAGED_DEBUG=1`, 2 concurrent shared-prefix requests
produced 14 `[paged-alloc] ... grew` lines, one stream per `seq`.
2. Cross-request prefix recompute-skip (patch 0007) - the actual fan-out win
(`shares N prefix blocks ... prefix NOT recomputed`, ref-counted block sharing).
This is reachable ONLY through `paged_prefix_api::share/commit`
(`src/paged-prefix-api.cpp`), which only the standalone driver calls.
Evidence it does not reach the server:
- Static: `grep -rn "paged_prefix\|share_prefix\|LLAMA_KV_PAGED" tools/server/`
returns nothing; `nm` on the binary finds no `paged_prefix` symbol use from the
server path. Nothing in `llama_decode` or the server calls `share`/`commit`.
- Runtime: the 2-request verify run logged **0** `shares prefix blocks` /
`NOT recomputed` lines. Both `seq=0` and `seq=1` independently grew to 65 blocks,
each allocating and recomputing the full ~972-token prefix separately - no
cross-slot KV block sharing, no `ref_cnt>1`.
So the 0007 recompute-skip, proven in the driver, does **not** yet reach the
server. Closing it needs server-side wiring: when admitting a slot whose prompt
shares a prefix with another live/committed slot, the server would have to call
the `paged_prefix_api::share` / `commit` seam. That is a future patch.
Note: llama-server has its OWN native prefix reuse (the slot prompt cache /
"context checkpoints"). In the K=32 wave the server reused the prefix cached by the
earlier wave, so prefill was only the 32-token suffix (`prompt eval ... / 32
tokens`). But that is a separate mechanism, it only helps prefill, and prefill is
not the bottleneck here (see below), so it does not change the verdict.
## Finding 2 - the matched comparison
Both batched, both NVFP4, both prefix-cache on, both eager. Cold concurrent fan-out,
identical token-id prompts via one client.
| K | engine | wall (s) | aggregate gen tok/s | req/s | vLLM speedup |
|----|----------|----------|---------------------|-------|--------------|
| 16 | llama.cpp| 50.7 | 18.9 | 0.30 | - |
| 16 | vLLM | 8.57 | 119.5 | 1.87 | ~5.9x |
| 32 | llama.cpp| 58.3 | 34.0 | 0.53 | - |
| 32 | vLLM | 8.86 | 231.1 | 3.61 | ~6.6x |
vLLM APC confirmed engaged: prefix cache hit rate 90.9% (K=16), 95.5% (K=32),
enforce_eager (CUDA graphs disabled), `enable_prefix_caching=True`.
### Verdict: not competitive - vLLM ~6x faster, and prefix caching is not why
With every confounder removed (both batched, both NVFP4, both eager, both with
prefix caching on), vLLM is still ~6x faster end-to-end. The gap is decode-bound,
not prefill/cache-bound:
- The G=64 workload is dominated by decode. In the llama K=32 run, decode was
52.98s of the 58.3s wall; prefill was ~3.5s (and only the 32-token suffix, since
the server's native prompt cache already reused the prefix). So even perfect
prefix sharing - paged or native - cannot move the total much.
- llama.cpp batched decode: **~828 ms per decode step** at batch 32
(1.21 tok/s per sequence).
- vLLM batched decode: ~170 tok/s aggregate gen at 32 running reqs ->
**~185 ms per step**, roughly **4-5x faster per decode step**.
- CUDA graphs are NOT the differentiator: both sides are eager (llama
`graphs reused = 0`, vLLM `--enforce-eager`). The win is vLLM's batched-decode
efficiency: PagedAttention + fused W4A16 (Marlin) GEMMs + chunked-prefill
scheduler, versus llama.cpp's per-step eager graph and NVFP4-GGUF decode path on
this Blackwell-class part.
Because decode dominates, wiring the paged 0007 recompute-skip into the server
(Finding 1) would mainly remove redundant prefill across slots - a real saving for
short-generation / long-prefix RAG fan-out, but at G=64 it is a few seconds against
a decode floor that is already ~6x slower than vLLM. The fan-out win does not, on
its own, make llama.cpp competitive here; the decode kernel/batching gap is the
load-bearing factor.
## Caveats
- NVFP4-GGUF is double-quant and is speed-representative (it routes onto the
FP4-MMA kernel); output quality is not the subject of this run.
- vLLM side is NVFP4A16 (W4A16 / Marlin) - 4-bit weights, 16-bit activations;
llama side is NVFP4 weights on FP4-MMA. Both are NVFP4-weight class.
- One llama request per run hit an intermittent HTTP 500 ("output does not match
the expected Content-only format" - a Qwen3 thinking-output quirk on
`/completion`), so llama counts were 15/16 and 31/32. The failed request returns
early and reduces batch contention for the rest, so a clean 16/16 / 32/32 llama
run would be marginally slower - i.e. the ~6x gap reported here is conservative
(favorable to llama.cpp).
- Both servers cold-started; numbers are end-to-end wall from the concurrent
client. Disk healthy (~325 GB free), GPU otherwise idle.

View File

@@ -1,165 +0,0 @@
# Paged-attention closing measurements: stock GPU determinism + vLLM comparison
Two closing measurements for the paged-attention series, run on a DGX Spark
(NVIDIA GB10, compute capability 12.1 / sm_121), CUDA 13. Dev tree
`~/llama-paged-dev` branch `paged`, paged engine gated by env `LLAMA_KV_PAGED`
(default-off = stock). Models: `Qwen3-0.6B-Q8_0.gguf` and
`Qwen3-32B-Q4_K_M.gguf` (llama.cpp), `Qwen3-32B` nvfp4a16 / W4A16 HF safetensors
(vLLM 0.23.0). All dev drivers are dev-tree-only and not shipped.
## Deliverable 1: stock GPU determinism across batch shapes (no paging)
Question: is the patch-0007 GPU byte-identity "failure" (a near-tie greedy token
flips on CUDA, e.g. 17971 vs 5671) caused by paging, or is it inherent stock
CUDA non-determinism from running the same tokens in a different batch shape?
Method: a new dev-only driver `llama-paged-batchshape` (paging explicitly OFF:
`unsetenv("LLAMA_KV_PAGED")`). For a prompt `[P+S]` it greedy-decodes two ways,
both stock contiguous KV:
- (a) `full` - prefill the whole `[P+S]` in ONE `llama_decode`.
- (b) `split` - prefill `P` in one `llama_decode`, then `S` in a second.
The two paths write byte-for-identical token ids; the only difference is the
batch shape submitted to the kernels (full prefill vs P-then-S), which changes
the float reduction order in the GEMMs and therefore the KV values by tiny
amounts. 5 distinct prompts, suffix S=16.
### Single next token (the literal T_full vs T_split)
Both CPU and CUDA returned the SAME greedy next token for all 5 prompts
(0/5 flips). BUT the top-2 logit gap measurably changes with the batch shape on
CUDA, proving the float order does differ:
```
CUDA, S=8: prompt 1 T_full=1896 (gap 0.07072) T_split=1896 (gap 0.17986)
CUDA, S=8: prompt 4 T_full=49584 (gap 0.93304) T_split=49584 (gap 0.85785)
```
The argmax simply did not flip on the immediate next token for these prompts -
the gaps, while shifting, stayed wide enough.
### Generated stream (what 0007 actually byte-asserts)
0007 asserts byte-identity over a *generated* token stream, where the tiny
prefill-shape KV perturbation accumulates and eventually crosses a near-tie.
Generating G tokens greedily from `full` vs `split` and reporting first
divergence:
| gen length | CPU diverged | CUDA diverged |
|-----------|--------------|---------------|
| G=24 (0007 default) | 1/5 (prompt 0 @ step 5) | 2/5 (prompt 1 @ step 3, prompt 4 @ step 6) |
| G=64 | 2/5 (steps 5, 42) | 3/5 (steps 3, 6, 30) |
Example CUDA divergence, pure stock, zero paging:
`prompt 1: DIVERGES at gen step 3: full=1260 split=576`.
### Verdict (Deliverable 1): HYPOTHESIS HELD
The 0007 GPU byte-identity failure is **stock batch-shape non-determinism, not a
paged bug**. With paging entirely OFF, stock llama.cpp produces a different
greedy token stream when the same prompt is processed in a full-prefill batch vs
a split (prefix-then-suffix) batch - exactly the shape difference that 0007's
prefix-share path introduces (full B-from-scratch vs prefix-cached + suffix-only).
Refinement (reported honestly): it is **not strictly CUDA-only**. CPU exhibits
the same divergence, just less often and later (1/5 vs 2/5 at G=24, and CPU's
flips land at later generation steps). This is exactly why 0007's small, short
CPU scenarios happened to pass 16/16 while the CUDA run flipped: CUDA's larger
parallel reductions reorder more aggressively, so a near-tie crosses earlier and
more frequently. The phenomenon is floating-point GEMM-batching non-determinism,
inherent to both backends; paging is not the cause.
## Deliverable 2: vLLM vs llama.cpp+paged on a shared-prefix fan-out
Workload: K requests share a 1024-token system prefix, each with a unique
32-token suffix, then generate 64 tokens. Both engines cache the shared prefix
(vLLM automatic prefix caching ON by default; llama.cpp via the paged
cross-request prefix cache, `LLAMA_KV_PAGED=1`).
Quant is the realistic apples-to-oranges, reported honestly:
- llama.cpp: Qwen3-32B **Q4_K_M** (GGUF), `-ngl 99`, CUDA dequant kernels.
- vLLM: Qwen3-32B **nvfp4a16 (W4A16)**, served via the **Marlin FP4
weight-only** kernel because GB10 (sm_121) has **no native FP4 compute** -
i.e. vLLM is on a slower-than-ideal kernel path here. vLLM also ran
`enforce_eager=True` (no CUDA graphs / torch.compile; the env lacked a working
inductor/ninja toolchain), so the vLLM numbers are if anything **conservative**.
### vLLM (automatic prefix caching), end-to-end
APC hits confirmed in the engine log: **"Prefix cache hit rate: 97.0%"**,
`prefix_cache_hits 33040/34848` (K=16) and `99344/102432` (K=32).
| K | APC | prefill wall (G=1) | total wall (G=64) | throughput |
|---|-----|--------------------|--------------------|-----------|
| 16 | ON | 0.749 s | 6.63 s | 2.41 req/s |
| 16 | OFF | 20.19 s | 27.21 s | 0.59 req/s |
| 32 | ON | 1.13 s | 7.56 s | 4.23 req/s |
| 32 | OFF | 40.19 s | 48.71 s | 0.66 req/s |
vLLM's APC cuts the fan-out prefill ~27x (K=16) to ~36x (K=32) vs APC-off; the
huge ratio reflects how slow the FP4-emulation prefill is when forced to
recompute all K prefixes.
### llama.cpp + paged prefix cache (prefill phase)
The paged shared-prefix bench (`llama-paged-prefix-bench`, `BENCH_GEN=0`,
`PAGED_NGL=99`). Reuse confirmed: `kshare(seq1)=1024`, shared-block
`ref_cnt = K` (all sequences hold the one prefix), 15360 / 31744 prefix tokens
skipped.
| K | mode | prefill tokens submitted | prefill wall | vs no-share |
|---|------|--------------------------|--------------|-------------|
| 16 | PAGED-SHARE | 1536 | 3.66 s | 7.15x |
| 16 | NO-SHARE | 16896 | 26.17 s | 1.0x |
| 32 | PAGED-SHARE | 2048 | 6.04 s | 10.3x |
| 32 | NO-SHARE | 33792 | 62.17 s | 1.0x |
The paged prefix cache delivers the expected **7.15x (K=16) / 10.3x (K=32)**
prefill wall-time reduction - the headline cross-request prefix-skip win, on a
real 32B model on GPU.
### Head-to-head, both engines caching the shared prefix
Prefill of the cached fan-out (vLLM G=1, ~prefill; llama.cpp G=0, pure prefill):
| K | llama.cpp+paged prefill | vLLM APC prefill | vLLM faster by |
|---|-------------------------|------------------|----------------|
| 16 | 3.66 s | 0.749 s | ~4.9x |
| 32 | 6.04 s | 1.13 s | ~5.3x |
### Verdict (Deliverable 2): competitive in kind, behind in absolute terms
With both engines caching the shared prefix, **llama.cpp+paged is qualitatively
competitive but absolutely behind vLLM on this GB10 box**:
- **Same optimization, same order of magnitude.** llama.cpp's paged prefix cache
reproduces exactly the win vLLM's APC gives - skip the shared-prefix recompute
- and yields a 7-10x prefill reduction vs its own no-share baseline. On the
RAG/system-prompt fan-out the algorithmic gap is closed: llama.cpp no longer
pays K x prefix.
- **vLLM still wins head-to-head by ~5x on the cached prefill** (0.75s vs 3.66s
at K=16; 1.13s vs 6.04s at K=32), and by more end-to-end because it does
**continuous batched decode** (all K sequences decoded in one fused step)
while the llama.cpp paged *dev driver* decodes each sequence serially. That
decode-batching gap is a property of the serving stack, not of the paged
prefix cache. Notably vLLM wins here while handicapped (eager mode, FP4
weight-only emulation with no native FP4 on GB10); a tuned vLLM would lead by
more.
- **Honest caveats / blockers.** (1) Quant differs (Q4_K_M vs nvfp4a16). (2) The
comparison is prefill-vs-prefill plus vLLM end-to-end; a clean llama.cpp
end-to-end on this driver is blocked because its generation phase has a
stale-logits bug (`get_logits_ith` reads seq 0's prefill index after later
sequences' prefills overwrote the logits buffer -> segfault), and even fixed
its decode is serial, so it would not be apples-to-apples vs vLLM's batched
decode. The fair end-to-end llama.cpp number needs the grpc / llama-server
continuous-batching path, not this dev scaffold. (3) vLLM ran eager + FP4
emulation, making its numbers conservative.
Bottom line: paged gives llama.cpp the cross-request prefix-skip that vLLM's APC
provides, which is the categorical win and removes the K x prefix penalty on
RAG/system-prompt fan-out. On absolute wall-time on this hardware vLLM retains a
~5x prefill lead and a larger end-to-end lead from continuous batched decode and
a more optimized serving stack.

View File

@@ -1,464 +0,0 @@
# Qwen3.6 NVFP4-vs-NVFP4: llama.cpp vs vLLM on GB10 (DGX Spark)
Apples-to-apples benchmark. Both engines run the **same NVFP4 weights** on the **same box**
(GB10, sm_121, LPDDR5x unified memory ~273 GB/s). The question is not "who wins the HW
lottery" but "at matched NVFP4, on one bandwidth-limited box, does our paged llama.cpp
(patch 0015, expert-density-aware MoE token-tile auto-select, default-on) sit at par with /
ahead of / behind vLLM?"
---
# FINAL shipping benchmark (patch 0023, f32 bit-exact build) — 2026-06-26
This is the **publishable, plot-ready** apples-to-apples result. Both engines at their **best
realistic config** (no handicapping either side), matched NVFP4 weights, one clean GB10 box
(LocalAI service containers stopped for the duration, restored after). Raw rows in
[`final_benchmark.csv`](final_benchmark.csv); per-row checkpoint log in
[`BENCHMARK_PROGRESS.md`](BENCHMARK_PROGRESS.md).
## Build under test (the clean shipping result)
- **llama.cpp** = patch **0023**, dev tree `~/llama-paged-dev` HEAD **`f7409c2`**, git-clean
(the shelved bf16-GDN-state work was reverted; `git diff` empty at HEAD before the
`build-cuda` rebuild). Greedy gate confirmed canonical f32 output on both models. The bf16
GDN-state path is **shelved** (it fails the f32 KL gate); the shipped plateau is the
**95%-bit-exact f32** stack (patches 0018-0023). dense greedy md5 `5951a5b4…`, MoE
`07db32c2…` are the 0023 references (the *transcript* md5 also encodes llama-cli UI chrome,
which has since changed, so the build was verified instead via the clean git tree + full
rebuild + the greedy numerical gate).
## Config (both engines at BEST realistic config)
- **llama-server**: `-c 131072 --parallel 128 -b 2048 -ub 512 -ngl 99 -fa on`,
`LLAMA_KV_PAGED=1`, **CUDA graphs ON** (`USE_GRAPHS=1`, default), and the QoS prefill budget
**`LLAMA_MAX_BATCH_TOKENS=512`** (patch 0016 decode-first dynamic budget). 512 is the
`n_ubatch` floor and is the best of the swept budgets: at npl32 it gives 133 s TTFT vs
**394 s for stock** (no budget) — lower budget = stronger decode-first = better burst TTFT,
and decode throughput is budget-independent.
- **vLLM 0.23.0**: its strongest honest decode config — **CUDA graphs ON** (NOT
`--enforce-eager`; `cudagraph_mode=FULL_AND_PIECEWISE`), `--gpu-memory-utilization 0.85
--max-model-len 4096 --max-num-seqs 256 -tp 1`, chunked prefill on, prefix caching off.
- **Client** (`h2h_cli3.py`, identical async harness both sides): 512-token **unique-nonce**
prompt (fresh full prefill every request, defeats all prefix caching), `max_tokens=256`,
`temperature=0`, `ignore_eos=True`, streaming with usage; concurrency npl 8/32/64/128.
- **Precision asymmetry (in llama's disfavour, yet llama still competes)**: llama runs
**f32 GDN recurrent state + q8 activations**; vLLM runs **bf16 GDN state + w4a4**. The
numbers below are llama at *higher* precision.
## DENSE — Qwen3.6-27B NVFP4 (`q36-27b-nvfp4`)
| npl | engine | decode_agg t/s | decode_perseq t/s | prefill t/s | ttft_mean ms | peak_gb | engine_gb |
|----:|--------|---------------:|------------------:|------------:|-------------:|--------:|----------:|
| 8 | llama | **82.5** | 9.57 | 507 | 6 038 | 53.5 | 50.2 |
| 8 | vLLM | 70.4 | 8.76 | 2096 | 1 861 | 110.9 | 107.6 |
| 32 | llama | **192.6** | 4.79 | 115 | 133 552 | 69.6 | 66.3 |
| 32 | vLLM | 211.8 | 6.28 | 2183 | 5 353 | 110.9 | 107.6 |
| 64 | llama | **277.8** | 3.09 | 96 | 321 619 | 84.0 | 80.6 |
| 64 | vLLM | 309.1 | 4.38 | 2089 | 9 512 | 110.9 | 107.6 |
| 128 | llama | **384.6** | 1.86 | 70 | 902 763 | 93.8 | 90.5 |
| 128 | vLLM | 418.8 | 2.79 | 1929 | 18 450 | 111.0 | 107.6 |
**llama decode as % of vLLM (dense):** npl8 **117%**, npl32 **91%**, npl64 **90%**, npl128 **92%**.
## MoE — Qwen3.6-35B-A3B NVFP4 (`q36-35b-a3b-nvfp4`)
| npl | engine | decode_agg t/s | decode_perseq t/s | prefill t/s | ttft_mean ms | peak_gb | engine_gb |
|----:|--------|---------------:|------------------:|------------:|-------------:|--------:|----------:|
| 8 | llama | 211.8 | 24.45 | 1236 | 2 477 | 39.7 | 36.1 |
| 8 | vLLM | 256.5 | 31.84 | 5187 | 769 | 109.6 | 106.3 |
| 32 | llama | 393.0 | 10.02 | 1214 | 8 225 | 47.1 | 43.8 |
| 32 | vLLM | 500.8 | 14.90 | 6223 | 1 830 | 109.6 | 106.4 |
| 64 | llama | 527.0 | 6.15 | 1152 | 15 850 | 57.1 | 53.8 |
| 64 | vLLM | 686.1 | 9.83 | 5927 | 3 224 | 109.6 | 106.4 |
| 128 | llama | 726.4 | 3.73 | 277 | 213 017 | 61.5 | 58.2 |
| 128 | vLLM | 882.2 | 6.05 | 5301 | 6 488 | 109.6 | 106.4 |
**llama decode as % of vLLM (MoE):** npl8 **83%**, npl32 **78%**, npl64 **77%**, npl128 **82%**.
## Plots (decode throughput vs concurrency)
Generated from [`final_benchmark.csv`](final_benchmark.csv) (matplotlib); the per-point label is
llama as a share of vLLM decode at that concurrency.
![dense decode vs npl](qwen36_dense_decode_vs_npl.png)
![MoE decode vs npl](qwen36_moe_decode_vs_npl.png)
## The honest public story (let the numbers speak)
1. **Decode throughput — the headline.** On the dense 27B, paged llama.cpp **matches/beats
vLLM**: 117% of vLLM at npl8 and a steady **90-92%** across npl32-128 — at *higher*
precision (f32 GDN state + q8 act vs vLLM bf16 + w4a4). On the MoE 35B-A3B llama lands at
**77-83%** of vLLM decode — close, but vLLM's fused grouped-GEMM MoE keeps a clear edge.
2. **Memory — a decisive llama win.** vLLM's pre-reserved pool is a **flat ~107 GB** at every
concurrency (the `--gpu-memory-utilization 0.85` design). llama's **on-demand paged KV**
uses **50-90 GB (dense)** and **36-58 GB (MoE)**, growing with load: at the operating point
most people actually run (npl≤32) llama uses **~1.5-3× less unified memory**, and even at
npl128 it stays below vLLM. This is the "fits where vLLM OOMs" axis.
3. **TTFT — vLLM's win, llama's disclosed tradeoff.** vLLM's chunked prefill absorbs a
128-way simultaneous burst gracefully (6-18 s). llama's decode-first QoS budget protects
decode throughput by throttling burst-prefill, so TTFT climbs at high concurrency
(dense npl128 **903 s**, MoE npl128 **213 s**). It is *bounded relative to no-budget*
(stock is worse) but high in absolute terms under a synchronized burst. Under realistic
staggered arrival this is far milder; for a synchronized-burst benchmark it is the cost of
the decode-first scheduler. **Decode and memory are unaffected.**
**Bottom line for the GB10 / DGX Spark page:** with matched NVFP4 weights, paged llama.cpp
delivers **90-117% of vLLM dense decode** and **77-83% of vLLM MoE decode** at **equal-or-higher
precision** and **1.5-3× lower memory** (on-demand paged KV vs a fixed 107 GB pool). The
remaining gap is MoE-decode and burst-TTFT, not dense-decode or memory.
## Anomalies / methodology notes (rigour)
- **Paged-pool burst degradation (real, worked around).** After a high-npl burst, a llama
server's *subsequent lower-npl* prefill collapses (npl8 fresh = 507 t/s / 6 s TTFT; the same
npl8 *after* an npl64 burst = 65 t/s / 64 s TTFT). Decode is unaffected. To measure clean
per-config prefill/TTFT, **the llama server is restarted per npl** (cheap vs the prefill
cost). vLLM has no such degradation — verified by an end-of-sweep npl8 re-check that matched
the opening npl8 (dense 70.4→73.4, MoE 256.5→226.4) — so vLLM uses one server per combo.
- **Fresh-prefill discipline.** Every measured request uses a unique nonce so prefill is always
a full fresh compute (the task's "defeat prefix caching" intent); vLLM ran with
`enable_prefix_caching=False`, llama with `cache_prompt:false`. Apples-to-apples.
- **No bimodality observed.** With per-npl restart + a cheap (ptok=8) graph warmup, the early
two-pass checks matched within <0.5% (npl8 486/484 t/s), so the headline uses one stable
measured pass per (model,engine,npl).
- **Clean environment.** The benchmark's peak (dense ~94 GB) plus the idle LocalAI worker's
~30 GB resident model OOM-cycled the service containers on the first attempt and corrupted
one run; the `local-ai`/`local-ai-worker` containers were stopped for the measurement
(baseline ~3.3 GB, ~120 GB free) and **restarted afterwards** to return the host.
- **peak_gb** is absolute unified-memory used (`MemTotal-MemAvailable`) peak; `engine_gb` =
peak the ~3.3 GB OS baseline (the per-config engine footprint).
- **Internal-consistency check (decode_agg vs perseq×npl).** `decode_agg_tps` is the steady-state
aggregate over the decode window; `decode_perseq_tps` is each sequence's lifetime rate (output
tokens ÷ total request latency, so it *includes* the TTFT queue wait). They coincide when
TTFT ≪ decode-window (vLLM npl8: 70.4 vs 70.1, +0.5%) and diverge exactly as TTFT grows, on
**both** engines (the aggperseq×npl gap rises monotonically with `ttft_mean`: vLLM 0.5%→17%,
llama 8%→62% across npl8→128, mirroring its 6 s→903 s TTFT). The relationship is governed by
TTFT, not a measurement artifact, and the FINAL rows are distinct from the historical patch-0015
table (no stale-baseline carry-over).
---
## Setup (historical — patch 0015 run; FINAL section above is the shipping 0023 result)
- **Box**: GB10 / DGX Spark, sm_121, unified LPDDR5x (~273 GB/s). Memory figures are
unified-memory used GB (`MemTotal-MemAvailable`), so they cover weights + KV + runtime.
- **llama.cpp**: dev tree `~/llama-paged-dev` branch `paged` HEAD `151343b` (patch 0015),
`build-cuda` sm_121, `LLAMA_KV_PAGED=1`, `llama-server -c 131072 --parallel 128 -b 2048
-ub 512 -ngl 99 -fa on`. **NOTE: run WITHOUT `max_prefill_tokens` (patch 0013) - see the
TTFT caveat in the verdict.**
- **vLLM**: 0.23.0, `--enforce-eager --gpu-memory-utilization 0.85 --max-model-len 4096
--max-num-seqs 256 -tp 1`.
- **Client**: identical async client for both engines. Per request: 512-token unique prompt
(unique leading tokens defeat cross-request prefix caching), `max_tokens=256`,
`temperature=0`, `ignore_eos=True`, streaming with usage. Concurrency (npl) swept 8/32/64/128.
- **Metrics** (localmaxxing.com schema): `decode_agg_tps` (aggregate decode tok/s across all
live seqs), `decode_perseq_tps` (mean per-sequence decode), `prefill_tps`, `ttft_mean_ms`,
`PEAK_GB` (unified-memory peak).
## The 4 models (NVFP4, matched weights)
| Model | llama.cpp GGUF | vLLM checkpoint | Match |
|-------|----------------|-----------------|-------|
| DENSE Qwen3.6-27B (28B dense) | `q36-27b-nvfp4.gguf` (native Blackwell FP4) | `q36-27b-nvfp4-vllm/` (unsloth TRUE W4A4) | clean W4A4 both sides |
| MoE Qwen3.6-35B-A3B (36B total, ~3B active) | `q36-35b-a3b-nvfp4.gguf` (241 NVFP4 tensors, nvidia weights) | `q36-35b-a3b-nvfp4-vllm/` (nvidia modelopt; vLLM picks Marlin NvFp4 MoE + FA2) | NVFP4 weight-only, identical nvidia weights |
---
## Results (decode aggregate tok/s, per-seq, prefill, TTFT, peak GB)
### MoE Qwen3.6-35B-A3B (~3B active)
| npl | engine | decode agg | decode/seq | prefill | TTFT mean ms | peak GB |
|----:|--------|-----------:|-----------:|--------:|-------------:|--------:|
| 8 | llama | 170.2 | 20.27 | 2813 | 855 | 38.98 |
| 8 | vLLM | 202.0 | 24.92 | 4648 | 799 | 111.49 |
| 32 | llama | 235.4 | 6.77 | 2005 | 4970 | 43.06 |
| 32 | vLLM | 462.0 | 13.59 | 4755 | 2308 | 111.26 |
| 64 | llama | 271.7 | 3.88 | 2389 | 7205 | 52.53 |
| 64 | vLLM | 624.5 | 8.90 | 4784 | 4072 | 111.46 |
| 128 | llama | 292.2 | 2.05 | 657 | 84800 | 61.42 |
| 128 | vLLM | 811.1 | 5.46 | 4263 | 7980 | 111.61 |
llama decode as % of vLLM: **84 / 51 / 44 / 36** at npl 8/32/64/128.
### DENSE Qwen3.6-27B
| npl | engine | decode agg | decode/seq | prefill | TTFT mean ms | peak GB |
|----:|--------|-----------:|-----------:|--------:|-------------:|--------:|
| 8 | llama | 63.8 | 7.60 | 1117 | 2029 | 51.72 |
| 8 | vLLM | 64.3 | 7.98 | 1514 | 2593 | 112.07 |
| 32 | llama | 108.9 | 3.08 | 752 | 13212 | 61.48 |
| 32 | vLLM | 189.8 | 5.57 | 1555 | 7477 | 112.09 |
| 64 | llama | 126.2 | 1.78 | 465 | 53818 | 74.90 |
| 64 | vLLM | 284.2 | 3.92 | 1526 | 12942 | 112.11 |
| 128 | llama | 134.6 | 0.93 | 125 | 491195 | 94.03 |
| 128 | vLLM | 390.7 | 2.50 | 1420 | 24806 | 112.12 |
llama decode as % of vLLM: **99 / 57 / 44 / 34** at npl 8/32/64/128.
---
## Verdict
**At matched NVFP4 on one GB10 box: llama.cpp is at parity only at low concurrency; vLLM
scales substantially better as concurrency rises.**
1. **npl=8 (low concurrency): near parity.** Dense 99%, MoE 84% of vLLM decode. The MoE's
~3B active shows: per-seq decode 20-25 tok/s (MoE) vs 8 tok/s (dense) on both engines.
2. **npl>=32 (high concurrency): vLLM pulls decisively ahead** - decode ~2x (npl32) rising to
~2.8-2.9x (npl128) on both models. vLLM scales monotonically (dense 64->391, MoE 202->811);
llama plateaus (dense 64->135, MoE 170->292).
3. **TTFT is the clearest gap, and it is largely self-inflicted here.** llama's TTFT explodes
at high concurrency (dense **491 s**, MoE **85 s** at npl128) while vLLM stays bounded (25 s,
8 s). **This run used llama WITHOUT `max_prefill_tokens` (patch 0013)** - so 128 concurrent
512-token prefills starve each other and the decode. Crucially, that starvation also drags
`decode_agg` down: while many slots are stuck prefilling, fewer are actually decoding, so the
measured aggregate understates llama's steady-state decode. A re-run with `max_prefill_tokens`
(the QoS budget this PR already ships) is expected to bound TTFT AND raise high-concurrency
decode by keeping all slots live.
4. **Memory: llama wins on efficiency.** vLLM pre-reserves the whole pool (~112 GB at
gpu-mem-util 0.85); llama grows on demand (MoE 38->61 GB, dense 52->94 GB). The paged
on-demand KV is materially more memory-efficient / multi-tenant-friendly.
5. **vs the localmaxxing reference (259.5 MoE / 254.8 dense top-speed):** those are single-stream
on fast datacenter HW. GB10 per-seq decode tops out far lower (MoE ~25, dense ~8 tok/s at
npl8) - the LPDDR5x ~273 GB/s bandwidth floor, as expected. The reference is a ceiling, not a
GB10 target.
### Honest bottom line
The "par-or-beat vLLM" goal is **met at low concurrency but NOT at high concurrency** on these
NVFP4 models. vLLM's continuous-batched decode + bounded prefill scheduling scale better on a
bandwidth-limited box. Two of the three gap drivers are addressable on our side: (a) **prefill
starvation** - re-run with `max_prefill_tokens` (patch 0013), which this PR ships; (b) **decode
batching efficiency at high concurrency** - the runtime/scheduler lever (the small/unsaturated
regime). The kernel itself is at parity (npl8). Next step: a fair re-run with the prefill budget
on, plus decode-batch tuning, to get llama's true high-concurrency numbers before concluding the
absolute gap.
---
## Fair re-run (max_prefill_tokens on)
The prior tables ran llama-server **without** the QoS prefill budget (patch 0013). This section
re-runs the same A/B with `LLAMA_PREFILL_BUDGET` set, sweeping the per-step prompt-token cap over
**256 / 512 / 1024**. Everything else is byte-identical to the prior run: dev-tree llama-server
(branch paged, HEAD `151343b`), `-c 131072 --parallel 128 -b 2048 -ub 512 -ngl 99 -fa on`,
`LLAMA_KV_PAGED=1`, same workload (512-token unique prompt, `max_tokens=256`, `temperature=0`,
`ignore_eos`), same harness (`h2h_moe_sweep.sh` -> `h2h_cli.py`). vLLM numbers are unchanged
(carried over from the committed dense table, not re-run).
### DENSE Qwen3.6-27B - budget sweep (decode agg tok/s | TTFT mean ms | peak GB)
| npl | metric | stock (no budget) | budget 256 | budget 512 | budget 1024 | vLLM |
|----:|--------|------------------:|-----------:|-----------:|------------:|-----:|
| 8 | decode agg | 63.8 | 63.5 | 63.8 | 63.5 | 64.3 |
| 8 | TTFT ms | 2029 | 4255 | 3756 | 2653 | 2593 |
| 32 | decode agg | 108.9 | 105.7 | 107.7 | 108.8 | 189.8 |
| 32 | TTFT ms | 13212 | 23114 | 18934 | 13912 | 7477 |
| 64 | decode agg | 126.2 | 132.0 | 131.2 | 118.2 | 284.2 |
| 64 | TTFT ms | 53818 | 109455 | 74272 | 92450 | 12942 |
| 128 | decode agg | 134.6 | **161.2** | 146.9 | 128.3 | 390.7 |
| 128 | TTFT ms | 491195| **305423**| 543448| 424058| 24806 |
Peak host GB is budget-independent (on-demand paged KV grows with concurrency): ~51.5 (npl8) ->
~61.5 (npl32) -> ~74.7 (npl64) -> ~93.5 (npl128) for every budget, vs vLLM's flat ~112.1.
### Best budget = 256 (only the saturated npl128 regime benefits)
At the fully-saturated point (npl128), **budget 256 is the clear winner on both axes**:
- **decode_agg: 134.6 -> 161.2 tok/s (+19.8%)** vs the starved stock run.
- **TTFT mean: 491.2 s -> 305.4 s (-37.8%, -186 s)** vs stock.
- llama decode as % of vLLM at npl128: **34.5% -> 41.3%**. TTFT still ~12x vLLM's 24.8 s.
Larger budgets help less at npl128 (512 -> 146.9 tok/s; 1024 -> 128.3, i.e. ~stock) because a
looser cap lets a long prefill grab a bigger slice per step and re-introduce decode jitter. So
the tightest cap (256) protects in-flight decode the most when the box is saturated.
### Honest caveat: this bursty workload is the worst case for TTFT
At npl 8 / 32 / 64 the budget **raised** TTFT (e.g. npl8 2029 -> 4255 ms at budget 256) and left
decode_agg roughly flat. Reason: the harness fires all N requests simultaneously, so at t=0 there
is **no in-flight decode to protect** - capping prefill purely defers first tokens. The budget
only pays off once enough slots are decoding that an unbounded prefill would starve them, which on
this box happens only at npl128. Budget 1024 tracks stock closely at light load (npl8 TTFT 2653 ~
stock 2029) because a 512-token prompt fits in one <=1024 step. In a steadier (staggered) arrival
pattern the budget would protect decode jitter without the burst-TTFT penalty; that regime is not
exercised here.
### Bottom line (dense)
The prefill budget is a **real but narrow** lever on this workload: at maximum saturation
(npl128) budget=256 lifts decode_agg ~20% and cuts TTFT ~38% vs the starved run, moving llama
from 34.5% to 41.3% of vLLM decode. It does **not** close the gap - vLLM still decodes ~2.4x
faster and keeps TTFT ~12x lower at npl128, and scales monotonically where llama plateaus. At
light/moderate concurrency the budget is net-negative for TTFT in this all-at-once workload, so it
should be applied selectively (high-concurrency serving), not as an unconditional default.
## MoE 35B-A3B fair re-run (max_prefill_tokens on)
Same build (HEAD 151343b, P0+P1 patch 0015), same flags (`-c 131072 --parallel 128 -b 2048
-ub 512 -ngl 99 -fa on`, `LLAMA_KV_PAGED=1`), same all-at-once harness (512-tok unique prompt,
gen 256, temp 0, ignore_eos). Swept the dense winner budget 256 plus neighbor 512.
### Primary table - budget 256 (decode_agg tok/s | TTFT mean ms | peak host GB)
| npl | stock (no budget) | budget 256 (best) | budget 512 | vLLM |
|----:|------------------:|------------------:|-----------:|-----:|
| 8 | 170.2 / 855 / - | 169.3 / 1655 / 38.95 | 172.1 / 1488 / 38.82 | 202.0 / 799 |
| 32 | 235.4 / 4970 / - | 239.0 / 9034 / 42.93 | 234.7 / 7260 / 42.72 | 462.0 / 2308 |
| 64 | 271.7 / 7205 / - | 277.0 / 16249 / 51.96 | 274.5 / 13660 / 52.53 | 624.5 / 4072 |
| 128 | 292.2 / 84800 / - | **333.5 / 98106 / 61.42** | 300.8 / 92470 / 61.45 | 811.1 / 7980 |
Peak host GB (paged KV, budget-independent): ~38.9 (npl8) -> ~42.8 (npl32) -> ~52 (npl64) ->
~61.4 (npl128). Far below the dense run (94 GB @npl128) - only ~3B params are active, so the KV
plus activations footprint stays light even fully saturated.
### MoE inverts the dense story: the budget buys decode, NOT TTFT
Unlike the dense 27B (where the stock run was prefill-starved to 491 s TTFT @npl128 and the budget
cut it 38%), the MoE stock run was **never prefill-starved**: 3B active params make prefill cheap,
so stock TTFT @npl128 was already only 84.8 s. Capping prefill therefore cannot rescue TTFT - it
can only **defer first tokens to free decode steps**. Result at npl128 with budget 256:
- **decode_agg: 292.2 -> 333.5 tok/s (+14.1%)** vs the starved stock run.
- **TTFT mean: 84.8 s -> 98.1 s (+15.7%, WORSE)** - the budget costs latency here.
- llama decode as % of vLLM @npl128: **36.0% -> 41.1%**. TTFT now ~12.3x vLLM's 7.98 s.
Budget 512 is the milder trade (decode +3.0% to 300.8, TTFT +9.0% to 92.5 s @npl128). Budget 256
maximizes decode throughput; 512 if you want to bleed less TTFT. At npl 8/32/64 both budgets are
net-negative or flat on decode and clearly raise TTFT (e.g. npl64 7.2 s -> 16.2 s @b256), the same
all-at-once burst artifact seen in the dense run.
### Does the ~3B-active decode scale better now? Yes - the plateau is gone
The headline win is the **decode scaling curve**, not any single point:
| npl step | stock decode_agg | budget-256 decode_agg |
|---------:|-----------------:|----------------------:|
| 8 -> 32 | 170 -> 235 (+38%) | 169 -> 239 (+41%) |
| 32 -> 64 | 235 -> 272 (+16%) | 239 -> 277 (+16%) |
| 64 -> 128| 272 -> 292 (**+7.4%**, plateauing) | 277 -> 333.5 (**+20.4%**, still climbing) |
Stock MoE decode **plateaus** at saturation (+7.4% over the last doubling) because unbounded
prefills keep stealing steps from the many ready decode slots. Budget 256 removes that ceiling -
decode keeps climbing +20% into npl128, so more of the 128 slots actually decode concurrently.
This is the cleanest evidence that patch 0013 protects in-flight decode once enough slots are live.
### Bottom line (MoE)
For the A3B MoE the prefill budget is a **decode-throughput lever, paid for in TTFT** - the mirror
image of the dense case. Budget 256 lifts decode_agg +14% @npl128 and, more importantly, restores
monotonic decode scaling (kills the stock plateau), moving llama from 36.0% to 41.1% of vLLM
decode - the same ~41% ceiling the dense run hit. It does **not** close the gap: vLLM still decodes
~2.4x faster (811 vs 333.5) and holds TTFT ~12x lower (8.0 s vs 98.1 s) @npl128, and scales
monotonically and steeply where llama only partially recovers. Net: apply the budget to saturated
MoE serving when decode throughput is the objective and some extra TTFT is acceptable; for
latency-sensitive MoE serving leave it off (stock TTFT was already not the bottleneck here).
---
## Fair re-run verdict
This is the synthesis after patch 0013 (`max_prefill_tokens` / `LLAMA_PREFILL_BUDGET`) was turned
on for both models. It answers three questions: how much of the apparent gap was prefill
starvation, what genuine gap to vLLM remains after that artifact is removed, and where that leaves
the "par-or-beat vLLM" goal.
### 1. How much did patch 0013 close the gap?
The original (stock) tables blamed two things on llama: an exploding TTFT and a flat decode curve
at high concurrency. The budget re-run shows these were **two different problems with two
different root causes**, and only one was prefill starvation.
**Dense 27B - was genuinely prefill-starved.** Dense prefill is expensive (full 28B weights per
token), so 128 simultaneous 512-token prefills truly starved both first-tokens and decode. Budget
256 @npl128:
| metric @npl128 | stock | budget 256 | vLLM | what closed |
|----------------|------:|-----------:|-----:|-------------|
| TTFT mean | 491.2 s | **305.4 s** (-37.8%) | 24.8 s | starvation real; -186 s recovered |
| decode_agg | 134.6 | **161.2** (+19.8%) | 390.7 | freed slots now decode |
| llama as % of vLLM decode | 34.5% | **41.3%** | 100% | +6.8 pts |
Dense llama-as-%-of-vLLM after the fix, npl 8/32/64/128: **99 / 56 / 46 / 41** (was 99/57/44/34).
The fix moved only the saturated tail; npl 8/32 were never starved and are unchanged.
**MoE 35B-A3B - was NOT prefill-starved (the inversion).** Only ~3B active params, so prefill was
already cheap and stock TTFT @npl128 was 84.8 s, not dense's 491 s. There was no starvation to
rescue, so the budget could not cut TTFT - it instead converted deferred prefill into decode
steps. Budget 256 @npl128:
| metric @npl128 | stock | budget 256 | vLLM | direction |
|----------------|------:|-----------:|-----:|-----------|
| TTFT mean | 84.8 s | 98.1 s (+15.7%, WORSE) | 7.98 s | budget costs latency here |
| decode_agg | 292.2 | **333.5** (+14.1%) | 811.1 | plateau removed |
| llama as % of vLLM decode | 36.0% | **41.1%** | 100% | +5.1 pts |
MoE llama-as-%-of-vLLM after the fix, npl 8/32/64/128: **84 / 52 / 44 / 41** (was 84/51/44/36).
The decisive MoE finding is the scaling curve, not the point: stock decode plateaued over the last
doubling (64->128 = +7.4%); budget 256 restored monotonic scaling (+20.4%), proving the stock flat
curve was unbounded prefill stealing steps from ready decode slots, not a kernel ceiling.
**Combined takeaway.** Both models converge to the **same ~41% of vLLM decode at npl128** after the
fix. That convergence is the signal: once prefill starvation is removed, dense and a 12x-cheaper-
prefill MoE land on the identical ceiling, which means the remaining gap is **not** about prefill
at all - it is the decode scheduler.
### 2. The honest remaining gap to vLLM
After patch 0013, the residual gap is the **continuous-batched-decode efficiency** lever, and it is
real, not an artifact:
- vLLM still decodes **~2.4x faster** at npl128 on both models (390.7 vs 161.2 dense; 811.1 vs
333.5 MoE).
- vLLM holds TTFT **~12x lower** at npl128 (24.8 vs 30.5 s dense; 8.0 vs 98.1 s MoE) - and does so
while decoding faster, i.e. no latency/throughput trade.
- **vLLM scales monotonically and steeply** (dense 64->391, MoE 202->811 across npl 8->128); llama,
even with the budget, only **partially** recovers its scaling (dense 64->161, MoE 170->334).
The mechanism: vLLM's scheduler interleaves prefill and decode at token granularity (chunked
prefill + paged continuous batching) every step, keeping the GPU saturated with a near-optimal mix.
Patch 0013 is a coarser tool - a static per-step prefill **cap** - which protects in-flight decode
but does not actively schedule the prefill/decode mix, and on the bursty all-at-once harness it
defers first tokens (the TTFT penalty at npl 8/32/64, and the MoE TTFT regression @npl128). The gap
that remains is the **quality of the step-by-step batching decision**, not raw kernel speed: at
npl8 the kernels are at parity (dense 99%, MoE 84%), so the per-token math is competitive - what
vLLM does better is keeping more sequences productively in-flight every step as concurrency rises.
### 3. Where this leaves "par-or-beat vLLM", and the last lever
**Where llama is competitive today (NVFP4, GB10):**
- **Low concurrency (npl<=8): at parity.** Dense 99%, MoE 84% of vLLM decode, comparable TTFT.
For single-user / few-stream local serving - LocalAI's dominant mode - llama.cpp is already
there on matched NVFP4.
- **Memory efficiency: llama wins outright at every concurrency.** On-demand paged KV (dense
52->94 GB, MoE 39->61 GB) vs vLLM's flat ~112 GB pre-reservation. On a 128 GB unified box this is
the difference between multi-tenant headroom and OOM - a genuine product advantage, not a
consolation.
**Where llama is not competitive:** high-concurrency decode throughput (npl>=32), where vLLM is
~2-2.4x ahead and the budget only narrows it to ~41%.
**The last lever** is therefore *not* another prefill knob (0013 has extracted what a static cap
can give) and *not* the kernel (at parity @npl8). It is **token-granular continuous-batch
scheduling**: actively interleaving chunked prefill with decode every step rather than capping
prefill, so all live slots decode while new prefills trickle in - exactly what closes vLLM's
monotonic-scaling advantage. A staggered (non-burst) arrival pattern would also let 0013 protect
decode jitter without the burst-TTFT penalty seen here, narrowing the practical gap for real
serving traffic that does not arrive all-at-once.
### Bottom line
Patch 0013 is validated and worth shipping as a **selective, high-concurrency QoS lever**: it
recovers dense TTFT 38% and lifts saturated decode +14-20%, converging both models to ~41% of
vLLM. But it is honestly **not a gap-closer**. The "par-or-beat vLLM" goal is **met at low
concurrency and on memory efficiency, and not met at high-concurrency decode throughput.** The
remaining ~2.4x is a continuous-batched-decode scheduling gap, not a prefill-starvation or kernel
gap - and that is the next (harder) lever, distinct from anything 0013 can touch.

View File

@@ -1,400 +0,0 @@
# RMSNORM_FP4_FOLD.md - ceiling-critic verdict (label ceiling-critic, READ-ONLY, no GPU)
Completeness audit of the post-0022/0023 bit-exact decode surface: is the rms_norm -> fp4
producer-fold the BEST remaining bit-exact decode lever, or is something better being missed?
Source: all paged/*.md verdicts + the 0019/0021/0023 patch diffs (local, read-only). No GPU touched.
## Starting line (post-0023)
- Dense q36-27b-nvfp4: 373.2 t/s @ npl128 = 95.4% of vLLM 391. Dense is UNTOUCHED by 0023.
- MoE q36-35b-a3b: 758 t/s @ npl128 (0023 +1.73%).
- Decode = ONE replayed CUDA graph, single stream, 99.94% GPU-busy, 0.06% idle. Removed/folded
kernel GPU-time cuts wall 1:1, and DISJOINT folds STACK 1:1 (each removes a distinct kernel).
- gated_delta_net recurrence = ~50% of the step, at 84.6% peak BW (past vLLM's 82.4%), PLATEAUED.
## TIER 0 - confirmed NO bit-exact lever (dead, do not pursue)
(a) GDN recurrence past 84.6% - NO. The 0022 sweep is MONOTONIC toward grid.z=1: 8x4 (grid.z=4,
32 cols/block) = 79.9%, 16x4/8x8 (grid.z=2) = 82.3%, 16x8/32x4 (grid.z=1, all 128 cols in one
block = max in-flight independent state-loads per warp) = 84.6%. grid.z>1 is the WRONG direction
(fewer cols/block = less memory-level parallelism = lower BW), already measured worse. The only
thing past 84.6% is the float4/vectorized load or a different row-partition, BOTH of which
repartition which rows a lane sums into the warp-butterfly = a different reduction grouping =
breaks md5 (the exact f32x4 trap that was explicitly avoided). 84.6% (230.9 of 273 GB/s) is at
the practical LPDDR5x DRAM ceiling AND past vLLM. No bit-exact decomposition exists. FLOOR.
(b) flash_attn_ext_f16 (3.1%) - NO. 48 CTAs = exactly one full wave, no occupancy headroom, no tail.
Every grid knob (split-KV / parallel_blocks / ncols / cols_per_block / KV-retile) changes the
online-softmax running-max/sum RESCALE ORDER across KV blocks = forbidden. FLOOR.
(c) lm_head (nvjet/cublas, 3.1%) - NO. cublas-internal; any algo/kernel swap changes the K-accum
order vs the current f32 reference = breaks md5. Already tuned. No knob. NO lever.
(d) mul_mat_q FP4 GEMM (~24-27%, the biggest bucket) - NO decode lever. P2a (mmq_y=64 / minblocks=2)
is bit-exact (1115/805, md5-identical) but MEASURED FLAT on decode (decode mmq -1.1%, stream_k
fixup +1.7ms = net worse). The -24.7% is a PREFILL large-N asymptotic number; the m=128 decode
GEMM is LPDDR5x-bandwidth-bound and mmq_y is deliberately bandwidth-neutral. FLOOR.
=> Of the four largest buckets (recurrence 50% + GEMM 25% + lm_head 3% + attn 3% = ~81% of the
step), NONE has any bit-exact lever left. All remaining headroom lives in the ~12% of small,
foldable glue/quantize/gather buckets below.
## TIER 1 - the bit-exact-feasible folds, RANKED by ROI (gain / plumbing+risk)
Confirmed bit-exact-foldable buckets from the post-0021/0022 node trace:
- quantize_mmq_nvfp4 ........ 4.5% (dense-foldable ~2.7% ceiling; fold captures ~2-2.5%)
- k_get_rows_float .......... 1.9-2.1% (STILL LIVE post-0021; pure gather)
- pointwise glue ............ ~3.1% (k_bin_bcast 1.7% + silu/sigmoid output-gate 1.4%; ~1.5-2.5% net)
Rank 1 - POINTWISE ACTIVATION FOLD (~1.5-2.5%, MEDIUM plumbing, NO new ABI). Best ROI/risk of the
three. Fold k_bin_bcast residual-adds + gate-muls and the silu/sigmoid output gate into adjacent
kernel epilogues/prologues. Pure elementwise f32, same formula+order standalone or folded =
byte-identical. STRICT EXCLUSION: do NOT re-fold the rms_norm/l2_norm REDUCTIONS (reduction-tree /
eps-placement trap). No frozen ABI, no GEMM surgery. Well-scoped already (NONRECURRENCE Lever #2).
Rank 2 - rms_norm -> fp4 PRODUCER-FOLD (the proposed lever) (~2-2.5% realistic dense, HIGHEST
plumbing). LARGEST single clean dense bucket and HIGHEST-confidence ROI (skip-B measured dense
+2.7% for the whole quantize; the fold removes the f32 round-trip, keeps the quant compute, so
~2-2.5%). BIT-EXACT VERDICT: SOUND, and NOT the f32x4-trap class. The trap changed a REDUCTION
grouping; this fold touches only (i) the sumsq block-reduce, kept BYTE-IDENTICAL, and (ii) the
writeback, where the post-norm normalize-MUL is pointwise (order-independent, identical out_i for
any thread partition) and the NVFP4 quant is per-16-consecutive PER-THREAD with NO cross-thread
shfl (verified in quantize.cu; 0023 already shipped on exactly this property and held the byte
gate). Re-partitioning the writeback to 16-consecutive-per-thread therefore changes only WHO
writes/quantizes each element, not the VALUES or the reduction. md5-safe. BUT it carries the worst
plumbing-to-ROI ratio: 3-op {RMS_NORM,MUL,MUL_MAT(NVFP4)} graph fusion + a mul_mat_q
prequantized-src1 path + the frozen block_fp4_mmq ABI + a per-call scratch pool. This is the
LAST-MILE lever, not the first.
Rank 3 - GET_ROWS / STATE-GATHER FOLD (~up to 2%, LOW-MEDIUM plumbing, ZERO reduction risk -
but UNDER-SCOPED). k_get_rows_float is STILL 7.29-7.32 ms = ~2.1% of the step post-0021/0022; the
0021 author KEPT the build_rs conv-tap + recurrent-state gathers, explicitly deferring them
("tiny; not one of the eliminated buckets"), NOT proving them unfoldable. A gather is a pure copy
with NO reduction = the SAFEST possible bit-exact fold (the exact property the 0023 dedup
exploited). Folding the residual build_rs gathers into their consuming kernel (read from cache via
ids/block-table instead of a pre-gathered f32 scratch, mirroring 0019's gather-free recurrence) is
bit-exact by construction. Ranked 3 only because the FOLDABLE FRACTION needs a one-pass source
scoping (some of the 2% may be the "tiny" conv-tap part already); the ROI is lower-confidence than
Rank 1/2, but the RISK is the lowest of all. THIS IS THE "SOMETHING BEING MISSED": it is a live
~2% bit-exact bucket that the current plan does not address.
## IS THE fp4 FOLD THE RIGHT NEXT BUILD?
DEFENSIBLE, but NOT unambiguously the best by ROI. It is the largest single well-understood
bit-exact dense bucket and the verdict is sound (no trap). HOWEVER its plumbing is the highest of
the three folds, and the POINTWISE fold matches its realistic gain (~1.5-2.5%) at MEDIUM plumbing
with no new ABI, while the GET_ROWS fold offers ~2% at the lowest risk (pure copy). The fp4 fold has
the worst gain/plumbing ratio of the candidates.
Recommended build order (all bit-exact, all stack 1:1 on the serial single stream):
1. POINTWISE activation fold first (cheapest, no ABI, ~1.5-2.5%).
2. GET_ROWS gather fold second, after a short source-scoping pass (~up to 2%, lowest risk).
3. rms_norm -> fp4 producer-fold LAST (the high-plumbing last mile, ~2-2.5% dense), built only if
the remaining gap to the chosen target still justifies the ABI/graph-fusion surgery.
If the workflow insists on a SINGLE decisive lever and accepts the plumbing, the fp4 fold is the
biggest one and a legitimate choice - but it should be sequenced after the cheap folds, not before.
## HONEST BIT-EXACT CEILING
The three folds remove DISJOINT kernels on a 99.94%-busy serial stream, so they STACK:
~2-2.5% (fp4) + ~1.5-2.5% (pointwise) + ~2% (get_rows) = ~5.5-7% gross on dense.
373 t/s + ~6% = ~393-399 t/s = ~100-102% of vLLM 391.
=> The bit-exact dense ceiling is vLLM PARITY-to-slightly-ahead (~100%), NOT 95%. Declaring the
ceiling at ~95% would leave ~4-5% of identified, bit-exact-FEASIBLE fold headroom unbuilt.
Realistic SHIPPABLE ceiling (fold inefficiency + the realistic-vs-ceiling haircut + some buckets
resisting clean folding): ~98-100% of vLLM dense. The recurrence (50%) is already past vLLM and
at the DRAM floor; attention/lm_head/mul_mat_q have no bit-exact lever; everything left is the
~6% of small folds above. There is no fourth large bit-exact lever hiding anywhere.
Caveat that frames the whole result: vLLM 391 is a LOWER-precision reference (w4a4/w4a16 acts vs
llama's q8_1; the recurrence is algebraically reassociated). Bit-exact-vs-vLLM is IMPOSSIBLE; the
only meaningful cross-engine bar is throughput + top-1/KL, and llama at 373 (95%) bit-exact f32 is
already doing strictly MORE precise arithmetic at near-equal throughput. Closing the last ~5% with
the folds reaches throughput parity at higher precision - a strong result, but each fold is a
diminishing 1.5-2.5% at rising plumbing. The bf16-state over-clock (shelved) is the only thing that
goes materially AHEAD, and it is non-bit-exact (KL-gated), out of scope for this gate.
Assisted-by: Claude:opus-4.8 [Claude Code]
====================================================================================================
# RMS_NORM -> NVFP4 PRODUCER-FOLD - PRECISE IMPLEMENTATION DESIGN (label fold-design, READ-ONLY, no GPU)
Design-only, no GPU. Reads: DGX `~/llama-paged-dev` HEAD f7409c2 (patch 0023) + `git stash@{0}`
(trackA1-prequant-nvfp4-fused-rmsnorm) + norm.cu/quantize.cu/mmq.cu/mmq.cuh/ggml-cuda.cu/qwen35.cpp.
## 0. One-line verdict
The fold is bit-exact-FEASIBLE, BUT the Lever-2 stash that exists as the starting point is
(a) almost certainly bit-INEXACT and (b) was measured FLAT. The single mandatory fix is the
reduction block_size dispatch; the single thing that makes it not-flat is de-dup-across-siblings
+ skipping the dead f32 write at the FFN boundary. Build the FFN boundary first, gate on a measured
per-call producer-vs-removed-quantize win before extending. Honest expectation: ~1.5-2.5% dense
best case, real risk of flat (Lever-2 precedent). Lower-risk alternative in Section 7.
## 1. Which graph nodes fuse
Both boundaries already collapse rms_norm+gain into ONE `rms_norm_f32<bs, do_multiply=true>` kernel
(existing fuse, ggml-cuda.cu:3675). That kernel's f32 output is the byte-exact target.
- FFN (STRONGEST), qwen35.cpp:188-192 + build_layer_ffn:478-487:
`attn_post_norm = build_norm(cur, RMS)` feeds EXACTLY `ffn_up` + `ffn_gate` (both NVFP4 MMQ at
m=128). NO non-NVFP4 consumer (residual = pre-norm `cur`; ffn_down eats silu(gate)*up). => the
f32 normed tensor is DEAD once both GEMMs read fp4 -> producer can skip the f32 write. An existing
`{MUL_MAT, MUL_MAT, GLU}` fuse (ggml-cuda.cu:3631) already groups up+gate+GLU -> the natural seam.
- GDN/attn (weaker), qwen35.cpp:161 + build_qkvz:228-243:
`attn_norm = build_norm(inpL, RMS)` feeds `wqkv` + `wqkv_gate` (NVFP4 MMQ, share src1) AND
`ssm_beta` + `ssm_alpha` (small N=n_v_heads -> MMVQ, READ THE f32). => f32 still live, producer
MUST write f32 -> smaller win.
- MoE FFN (qwen35moe.cpp) goes via mul_mat_id, already 0023-deduped -> out of scope. Fold = dense only.
## 2. Byte-exact target (norm.cu rms_norm_f32<bs,true>)
Dispatch (norm.cu:304-380): `bs = (ncols < 1024) ? 256 : 1024`, shmem 32*float.
```
for col=tid; col<ncols; col+=bs: tmp += x[col]*x[col]; // (R1) strided sumsq grouping
tmp = block_reduce<SUM, bs>(tmp, s_sum); // (R2) tree width depends on bs
mean = tmp/ncols; scale = rsqrtf(mean+eps); // (R3) exact eps/div
for col=tid; col<ncols; col+=bs: dst[col] = scale*x[col]*mul[col];// (W) per-channel gain, mul_col==col
```
(W) is per-column independent (scale block-uniform) -> writeback may be re-partitioned. (R1/R2/R3)
are the ONLY order-sensitive parts and must stay byte-identical.
## 3. Fused producer kernel (quantize.cu) - deltas vs the stash
Start from stash `rms_norm_mul_quantize_nvfp4_kernel` + the factored `quantize_nvfp4_write_subblock`
(verbatim per-thread NVFP4 quant). Required changes:
1. TEMPLATE on block_size + launch `bs=(ncols<1024)?256:1024` (NOT the stash's hardcoded 256). MANDATORY.
2. Reduction pass VERBATIM (R1/R2/R3): scalar strided sumsq, `block_reduce<SUM,bs>`, `mean=tmp/ncols`,
`scale=rsqrtf(mean+eps)`. Byte-identical once bs matches.
3. Writeback re-partitioned to 16-consecutive-per-thread: `for s=tid; s<n_sub; s+=bs`, col0=s*16,
`v=scale*xr[col]*mul[col]` (col<ncols else 0), amax=max|v|, `quantize_nvfp4_write_subblock(vals,
amax, sub, y+ib)`, `ib=k_block*ne11+row`, n_sub=ncols_padded/16. x is re-read (canonical does too).
4. `template<bool write_f32>`: FALSE at FFN (skip `dr[col]=v` -> drop the producer's f32 store),
TRUE at GDN (beta/alpha read it). THIS is what turns re-bucketing into a real traffic cut.
Buffer ABI frozen: block_fp4_mmq = {uint32_t d4[4]; int8_t qs[128]} = 144B = 9 uint4 = 4*block_q8_1
(mmq.cuh:53). Same layout quantize_mmq_fp4_cuda emits; GEMM stride
s12=ne11*ne10_padded*sizeof(block_fp4_mmq)/(QK_K*sizeof(int)).
## 4. mul_mat_q prequantized-src1 plumbing (mmq.cu/mmq.cuh)
Re-add the stash hook on top of 0023: `ggml_cuda_mul_mat_q(..., const char* src1_prequantized=nullptr)`.
In the NON-ids branch: if non-null, skip quantize_mmq_fp4_cuda + the local pool alloc, point mmq_args
src1_q8_1 at it. GEMM byte-UNTOUCHED (the bit-exactness firewall). 0023 ids-branch untouched (orthogonal).
Sharing across non-adjacent siblings:
- FFN (preferred): extend `{MUL_MAT,MUL_MAT,GLU}` to `{RMS_NORM,MUL,MUL_MAT,MUL_MAT,GLU}` super-fuse;
one producer (write_f32=false) + one pool buf spanning both GEMMs + GLU, all in one handler. Clean.
- GDN/general: a scratch cache keyed by the normed tensor ptr (graph-eval lifetime); defer until FFN wins.
The stash folds only ONE consumer with a stack-scoped qbuf -> the sibling still standalone-quantizes
(a key reason it was flat; nsys showed quantize 12896->10816, not ->0).
## 5. Bit-exactness argument
(1) NVFP4 quant of each 16-elem sub-block = PURE per-thread function, NO cross-thread shfl/reduction
(quantize.cu; the exact property 0023 shipped on). => writeback re-partition cannot change a byte.
(2) v=scale*x[col]*mul[col] byte-identical iff scale identical (R1/R2/R3 preserved via bs dispatch)
AND expression verbatim (left-assoc, scalar). Per-column independent -> partition-invariant.
=> produced block_fp4_mmq bytes == standalone == 0022/0023 baseline; GEMM untouched -> md5 held.
Gate: BATCHED (ne[1]>8) md5 == 5951a5b4 dense + 1115/1115 - NOT just batch=1 (the gate Lever-2 skipped).
## 6. THE TRAP
- block_size trap (the stash's latent bug): canonical = `ncols<1024?256:1024`; qwen35 n_embd is
1024/2560/4096 (qwen35.cpp:30-31) -> canonical is rms_norm_f32<1024> (LEVER2 nsys confirms). Stash
hardcodes 256 -> different strided grouping {tid,tid+256,...} vs {tid,tid+1024,...} AND 8-warp vs
32-warp reduce -> different f32 order -> md5 break. FIX = template+dispatch matching bs.
- f32x4 vectorize trap (recurrence class): do NOT vectorize the sumsq load or align the reduction
partition to the 16-consecutive writeback. Keep sumsq scalar + strided-by-bs.
- eps/assoc: `rsqrtf(mean+eps)`, `mean=tmp/ncols`, `(scale*x)*mul`. Never reassociate.
- GEMM K-reduction / stream-k / tile loads: forbidden (NONRECURRENCE FORBIDDEN list). Fold only
changes WHO writes src1.
## 7. Contrast with Lever-2 + lower-risk alternative
Lever-2 (stash) was FLAT (+0.3% dense) and NET-ADDED GPU-time (+2.3% fused vs -1.1% quantize -0.9%
rms_norm) because it (a) folded only 1 of 2 siblings, (b) always wrote f32, (c) bs=256 (wrong AND
non-canonical). It md5'd only batch=1 (fuse off) -> bit-inexactness never caught. The new fold beats
it ONLY with de-dup-both-siblings + skip-dead-f32-at-FFN; without BOTH, expect flat again.
LOWER-RISK alt (recommend evaluating first): dense quantize DE-DUP, no fold - keep the efficient
standalone quantize, quantize the shared normed activation ONCE, reuse for wqkv+wqkv_gate /
ffn_up+ffn_gate (CSE keyed by src1 ptr, the dense analog of 0023). ZERO reduction risk (rms_norm
untouched), much less plumbing; ceiling ~<=1% (redundant half only), which the fold's de-dup half
captures anyway. The fold's only incremental value is the f32 round-trip read, which Lever-2 showed
is easily eaten by the fused kernel's added work.
## 8. Scope + build order (the gate)
Scope dense qwen35: quantize.cu/.cuh (templated kernel + bs dispatch), mmq.cu/.cuh (src1_prequantized
on non-ids path), ggml-cuda.cu (FFN super-fuse, gate: NVFP4 src0 + Blackwell + ne[1]>MMVQ_MAX_BATCH_SIZE
+ ne2==ne3==1 + per-channel gain; flag LLAMA_FUSE_NVFP4_QUANT).
Build order: (1) FFN super-fuse only (write_f32=false + de-dup); measure per-call producer GPU-time
vs the two removed quantizes (nsys node trace, same-build flag toggle); SHIP only if decode_agg
actually lifts AND batched md5==0022/1115. (2) Only if (1) lifts, add the GDN boundary (write_f32=true,
keyed scratch). Realistic: ~1.5-2.5% dense FFN best case; ceiling +2.7% (skip-ALL) is unreachable
(fold keeps quant compute+write). If step 1 is flat, dense quantize is at its bit-exact floor -> stop.
Assisted-by: Claude:opus-4.8 [Claude Code]
====================================================================================================
# RE-PROFILE TARGET MEASUREMENT (label reprofile-target, THE GPU agent) - post-0023, HEAD f7409c2
Fresh node-level nsys re-profile of the DENSE decode to confirm the fold target size, foldable
fraction, critical-path status, and the realistic recoverable ceiling, BEFORE BuildFold commits.
## Build-dir correction (acted on)
The orchestrator framed `build-cuda-base` as the clean 0023 build. It is NOT: empirically
`build-cuda-base` = stale pre-0021 (336.71 t/s), the real post-0023 build is `build-cuda` (371.81 t/s,
git-clean tree, no mmq.cuh P2a remap). All numbers below are from `build-cuda`. (Dense profiling is
unaffected by the 0023 MoE de-dup knob - dense has no MoE.)
## Confirmed baseline
- llama-batched-bench dense q36-27b-nvfp4 npl128 ntg128: 371.81 t/s, 344 ms/decode-step. CONFIRMS the
~343 ms / ~373 t/s target. (build-cuda-base stale = 336.71 t/s.)
- nsys --cuda-graph-trace=node, 103 steady windowed steps: step span 345.0 ms, mean kernel-busy 99.0%,
sum-of-kernels/span = 98.9% (< 100% => no NET overlap; serial single stream, ~1.1% idle).
## Dense decode decomposition (ms/step)
gated_delta_net 168.06 (49.2%) BINDING | mul_mat_q<NVFP4,128> 93.57 (27.4%) |
**quantize_mmq_nvfp4 17.55 (5.1%)** | nvjet 12.02 (3.5%) | flash_attn_ext 11.64 (3.4%) |
ssm_conv 8.56 (2.5%) | k_get_rows_float 7.32 (2.1%) | silu 5.36 | k_bin_bcast(mul) 4.64 |
stream_k_fixup 3.95 | rms_norm 3.53 (1.0%). TOTAL kernel 341.25.
## quantize_mmq_nvfp4 at the dense decode shape (the answer)
- TOTAL: 17.55 ms/step = 5.1% of kernel time = 5.08% of the 345 ms wall. 496 quant calls/step (1 per
NVFP4 GEMM src1). CONFIRMS the verdict's 17.66 ms / ~4.5-5% (the stray "3.7%" reading was wrong).
- Decomposes EXACTLY by input dim K (graph-verified in qwen35.cpp; 64 layers = 48 GDN + 16 attn):
- K=5120 (368/step) FOLDABLE: GDN {wqkv, wqkv_gate, beta, alpha} + attn {wq,wk,wv} + both {ffn_up,
ffn_gate}. All fed by a plain rms_norm+mul (attn_norm or attn_post_norm). beta/alpha CONFIRMED
foldable: they read the same `cur` as wqkv (qwen35.cpp:359/366).
- K=6144 (64/step) UNAVOIDABLE: ssm_out (gated-norm = rms_norm + mul(ssm_norm) + mul(silu(gate)),
two muls break the chain) + wo (attn-gated producer).
- K=17408 (64/step) UNAVOIDABLE: ffn_down (silu(gate)*up producer).
## Foldable portion (measured) - LARGER than the byte-model 2.7%
The quant kernel is NOT byte-proportional: ffn_down (K=17408) measures 3.62 ms but a byte-model
predicts 5.75 ms. Small-K quants are launch/overhead-bound (flat ~21.7 us floor, K=5120 vs 6144
indistinguishable), so the byte model UNDER-counts the numerous small-K (foldable) calls.
- byte-model FOLDABLE = 9.73 ms = 2.82% of step
- flat-split FOLDABLE = 11.90 ms = 3.45% of step (368 small-K quants, the physically correct one)
- => true FOLDABLE raw GPU-time = 9.7 - 11.9 ms = 2.8% - 3.4% of step. UNAVOIDABLE = ssm_out+wo
~2.1 ms + ffn_down 3.62 ms = ~5.7 ms (1.6%).
- Sub-split for the build order: the FFN boundary alone (ffn_up+ffn_gate, f32 DEAD -> cleanest fold)
= 128 quants/step ~4.1 ms; the input-norm boundary (wqkv/wqkv_gate/wq/wk/wv, +beta/alpha keep f32)
= ~7.8 ms raw but lower net efficiency.
## Critical path: YES (1:1)
98.9% kern/span, 99.0% busy, single serial stream, no net overlap. The quant kernels are inline on the
serial decode chain; removing their GPU-time cuts the wall ~1:1. Not a gap-fill (there are no gaps).
## Realistic recoverable - and the honest haircut
RAW foldable removed = 9.7-11.9 ms. NET recoverable is LESS, for reasons the fold-design + ceiling-critic
already flagged and this profile does not overturn:
- the fused producer KEEPS the quant compute + the fp4 write (only the f32 round-trip read is saved,
and the f32 write is droppable ONLY at the FFN boundary where it is dead);
- Lever-2 precedent: the existing stash fold measured FLAT (+0.3% dense) because it folded 1 of 2
siblings, always wrote f32, and used a non-canonical bs=256 reduction;
- TENSION TO FLAG: the critic cites a skip-B probe of only ~+2.7% for the WHOLE quantize, yet the whole
quantize is 5.1% on a 98.9%-serial stream (which predicts ~5.1% if cleanly 1:1). Either these small
kernels are not perfectly 1:1, or the skip probe is unreliable (same class as the NONREC
garbage-routing skip artifact). This caps the realistic NET nearer the conservative end.
=> Realistic NET recoverable: ~1.5 - 2.5% dense (consistent with fold-design Section 8), real risk of
FLAT. Optimistic ceiling if the f32 round-trips fully convert: up to ~3% (371.8 -> ~383 t/s); do not
bank above ~2.5%.
## VERDICT (GPU-measurement view)
- The target is REAL: foldable raw GPU-time 9.7-11.9 ms (2.8-3.4%, slightly LARGER than the prior 2.7%
byte-model floor), squarely on the single-stream critical path (1:1), bit-exact-FEASIBLE (no precision
change), and the largest single clean dense bucket left after the plateaued recurrence.
- BUT the NET recoverable is the contested ~1.5-2.5% with a documented FLAT risk, and this fold has the
HIGHEST plumbing of the three identified folds. Worst gain/plumbing ratio of the candidates.
- RECOMMENDATION: build is DEFENSIBLE but should be SEQUENCED AFTER the cheaper pointwise + get_rows
folds (per ceiling-critic). If built as the single decisive lever, do the FFN boundary FIRST (cleanest
~4.1 ms, f32 dead), gate per-call producer-GPU-time vs the two removed quantizes, and SHIP only if
decode_agg actually lifts AND batched md5 == 5951a5b4 (1115/1115). Kill-switch: if the only bit-exact
construction forces re-partitioning the sumsq reduction (changing accumulation order), abort - not
bit-exact.
Assisted-by: Claude:opus-4.8 [Claude Code]
====================================================================================================
# BUILD VERDICT (label fold-build, THE GPU agent) - post-0023, HEAD f7409c2 = patch 0023
DECISION: NO BUILD. The bit-exact decode ceiling is effectively reached for any lever that justifies
its plumbing. The proposed rms_norm -> fp4 producer-fold is NOT built (it was already built once and
measured FLAT), and the recommended lower-risk alternative (dense quantize de-dup) does NOT have a
clean, contained construction for the portion that matters. Tree left clean at 0023; nothing committed
to the code; this verdict appended only.
I extended the read-only agents' analysis with the two things they could not verify from the .md
verdicts alone: (1) the prior EMPIRICAL fold attempt, and (2) the actual graph/dispatch structure in
the source. Both kill the build.
## 1. The fp4 producer-fold was ALREADY BUILT and measured FLAT (decisive)
LEVER2_PROGRESS.md + stash@{0} (trackA1-prequant-nvfp4-fused-rmsnorm) is exactly this fold. Measured:
- dense q36-27b-nvfp4 npl128: 333.32 -> 334.44 t/s (+0.3%), npl32 -0.5%
- MoE q36-35b-a3b npl128: 690.23 -> 690.89 (+0.1%), npl32 -0.3%
nsys A/B (fusion fires): quantize_mmq_nvfp4 -2080 inst (-1.1%), rms_norm_f32<1024> -2080 (-0.9%),
NEW rms_norm_mul_quantize_nvfp4 +2080 (+2.3%). NET GPU-time = +0.3%. The fused producer ADDS BACK
the GPU-time it removes - it RELOCATES work, it does not remove it. The +0.3% wall is exactly
consistent with strict 1:1 wall scaling on the single serial stream (reprofile's own model). So the
fold is not a victim of a bad implementation that a rewrite fixes - it is structurally flat: the
producer must still read x, compute sumsq, normalize, quantize and WRITE the fp4 blocks; the only
recoverable traffic is the f32 round-trip, which the fused kernel's extra work eats (Lever-2 proved
this empirically; fold-design Section 7 and reprofile both predicted it). The design's two "fixes"
(de-dup both siblings + skip dead f32 at FFN) do not change this: the skip-f32 saves one f32 write at
the FFN boundary only (~0.5% of step), and the de-dup-both-siblings is item 2 below.
## 2. The dense quantize de-dup is NOT a clean analog of 0023 (the meaningful part is infeasible)
This is the critical finding the read-only agents missed. 0023's MoE de-dup lifted +1.73% because the
redundancy is INTRA-CALL: inside ONE mul_mat_id, the broadcast (ne11==1) up/gate quantize repeats the
SAME token n_expert_used times, all within a single ggml_cuda_mul_mat_q call, so de-dup is a contained
quantize-once + gather with a stack-scoped buffer. NO precedent issue, NO cross-node lifetime.
The DENSE redundancy is INTER-NODE and that is a different, much harder problem:
- The shared-src1 GEMMs are SEPARATE graph nodes. build_qkvz (qwen35.cpp:228-243) emits wqkv MM,
reshape, wqkv_gate MM; then ssm_beta MM, reshape, sigmoid; ssm_alpha MM, reshape, add, softplus,
mul. The four src1-sharing MMs are INTERSPERSED with reshape/sigmoid/softplus/add/mul - they are
NOT consecutive graph nodes, so ggml's consecutive-op fusion framework cannot match them. A
contained, single-handler de-dup (the only kind with safe buffer lifetime, like 0023) is impossible
for the qkvz bucket.
- De-duping them therefore requires graph-level CSE: recognize 2-4 non-adjacent MUL_MAT nodes share
src1, quantize once, and keep that pool buffer alive across the intervening nodes until the last
sibling GEMM consumes it - under CUDA-graph CAPTURE (buffer addresses baked at capture, the pool
must not recycle the buffer between siblings). This is the SAME high-plumbing scratch-pool +
src1_prequantized path the fold needs, with real implementation risk (graph-capture
non-determinism / crashes), and NO precedent in the tree. fold-design's "much less plumbing"
framing for this alternative is optimistic - the hard part (inter-node buffer sharing under graphs)
is common to both.
- The qkvz bucket (the big one, ~192 redundant quants/step ~= 1.4%) is exactly the inter-node case.
- The ONLY contained, tractable dense de-dup is the FFN {MUL_MAT,MUL_MAT,GLU} (consecutive; build_ffn
LLM_FFN_PAR). But that existing fusion executes ONLY via ggml_cuda_mul_mat_vec (gated on batch<=8;
ggml_cuda_should_fuse_mul_mat_vec_q). At npl128 (m=128) it falls through to two separate MMQ nodes.
Adding an MMQ-path branch to quantize src1 once captures only the FFN redundancy = ~64 quants/step
~= 0.5% of the step - below the +-0.3-0.5% bench noise the runs already show, not worth a new
fusion code path + the risk to the byte gate.
## 3. The pointwise + get_rows folds are not clean wins either
- Pointwise: the cheap ops are ALREADY fused in the tree - {RMS_NORM,MUL(,ADD)} -> rms_norm_fused
(ggml-cuda.cu:4194/4199), {SSM_CONV,(ADD),SILU} -> ssm_conv (4204/4209), {UNARY(silu/sigmoid/
softplus),MUL} -> unary_mul (4216). The residual silu 5.36 + k_bin_bcast 4.64 ms is the un-fusable
remainder inside the GDN gating chain feeding the 50% binding gated_delta_net kernel; GAP_PROGRESS
measured the whole gating-glue ceiling at only 3.35% and folding further means surgery on the binding
kernel. Lower-confidence, needs a GPU node-scoping pass - not a clean lever.
- get_rows: 0019 already folded the main recurrent-state gathers; the residual ~2% is an unquantified
mix of the conv-tap (already deferred as "tiny") and leftovers - under-scoped, not a confirmed win.
## 4. Tree state / gates
- Dev tree clean at HEAD f7409c2 (git diff empty; mmq.cuh/mmq.cu/quantize.cu no uncommitted diff -
no P2a remap to revert). build-cuda = the clean 0023 build (371.81 t/s dense @npl128, per reprofile).
- No code change made -> no md5 gate needed (baseline 27b = 5951a5b4, 35b = 07db32c2 unchanged).
- No GPU build/bench launched (no buildable candidate clears the ROI bar; re-confirming the baseline
the reprofile already measured would waste the GPU window).
## 5. FINAL BIT-EXACT CEILING
Dense q36-27b-nvfp4: 371.81 t/s @npl128 = 95.0% of vLLM 391. MoE q36-35b-a3b: 758.1 @npl128 (0023).
This is the bit-exact f32 decode plateau and there is no single decisive bit-exact lever left:
- gated_delta_net recurrence (~50%) is at 84.6% peak LPDDR5x BW, PAST vLLM (82.4%) - DRAM floor.
- mul_mat_q NVFP4 GEMM (~27%), flash_attn (~3.4%), lm_head nvjet (~3.5%) have NO bit-exact lever
(any knob changes a K-/softmax-reduction order vs the f32 reference).
- The remaining ~5% of small foldable buckets is real GPU-time on the critical path, but the largest
piece (the fp4 fold, ~1.5-2.5%) is EMPIRICALLY FLAT, the next (dense qkvz quant de-dup, ~1.4%) has
no clean inter-node construction and shares the fold's flat-risk, and the contained remainder is
each <=0.5% (FFN de-dup) or entangled in the binding kernel (pointwise) - none clears the
plumbing/risk bar for a 1:1 single-stream gain that the bench noise floor (~0.3-0.5%) can swallow.
FRAME: vLLM 391 is a LOWER-precision (w4a4) reference; bit-exact-vs-vLLM is impossible. llama at 371.81
bit-exact f32 is doing strictly MORE precise arithmetic at ~95% of vLLM's throughput. The only thing
that goes materially further is bf16 state (precision change, KL-gated, out of scope, shelved).
RECOMMENDATION: ship the 0023 plateau as the bit-exact decode result. Do not build the fp4 fold (flat).
If a future agent insists on the dense qkvz de-dup, it must first build the inter-node graph-CSE
scratch-pool/CUDA-graph-lifetime plumbing and prove on a same-build flag toggle that decode_agg lifts
above the +-0.5% noise AND batched md5 == 5951a5b4 - and should expect the Lever-2 flat outcome.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,138 +0,0 @@
# GB10 same-day head-to-head server sweep: llama-server (paged) vs vLLM
Date: 2026-06-23. Hardware: GB10 / DGX Spark (sm_121, 128 GB LPDDR5x unified, ~273 GB/s
weight-read floor). GPU otherwise idle (sibling vLLM had exited; LocalAI docker workers
stopped for the run).
This sweep **replaces** the stale carried "~75-80% of vLLM" figure (commit 07985ba4,
pre-co-batching, single-point). It measures *real serving* steady-state aggregate decode
throughput across the full concurrency curve, for three model classes, with one identical
client driving both engines.
## Method
- **llama**: `llama-server` from the paged dev tree (`~/llama-paged-dev/build-cuda`, HEAD =
patch 0013 / commit 17d97cb), `LLAMA_KV_PAGED=1`, `-fa on -ngl 999 --parallel 128 -c 65536`.
- **vLLM**: 0.23.0, `vllm serve --enforce-eager --enable-prefix-caching --max-num-seqs >=128
--max-model-len 4096` (APC on, eager per the GB10 no-CUDA-graphs edge).
- **Client** (`sweep_client2.py`): N concurrent **non-streaming** `/v1/completions`, short
shared prompt, `max_tokens=min_tokens=256`, `ignore_eos=true`. Aggregate decode tok/s =
total generated tokens / wall. Non-streaming keeps the Python client off the critical path
(one JSON parse per request, not per token), so the **server** is the bottleneck. Validated:
vLLM pushed 4227 tok/s through the exact same client where llama topped out at 2087, so the
client is not the cap. Both engines use the identical client + prompt -> apples-to-apples.
- npl (concurrency) sweep: 8 / 32 / 64 / 128.
Quant parity:
- Dense: llama **NVFP4-dense GGUF** (weight-only FP4, 16-bit compute) vs vLLM **NVFP4A16**
(weight FP4, 16-bit activation) -> matched precision class.
- Small: llama **Q8_0** vs vLLM **bf16** (closest loadable form).
- MoE: llama **mxfp4** GGUF. **vLLM could not serve this MoE on GB10 at all** (see below), so
there is no vLLM MoE column.
## Results: aggregate decode tok/s (higher is better)
### Dense 32B (llama NVFP4-dense vs vLLM NVFP4A16)
| npl | llama (NVFP4) | vLLM (NVFP4A16) | llama % of vLLM |
|----:|--------------:|----------------:|----------------:|
| 8 | 83.2 | 85.9 | **96.9%** |
| 32 | 228.9 | 301.3 | 76.0% |
| 64 | 367.1 | 507.8 | 72.3% |
| 128 | 520.6 | 604.0 | 86.2% |
Plateau: neither has plateaued at 128 (both still climbing, weight-read bound). llama is at
**parity at batch-8** (97%), dips to ~72% mid-curve (npl 32-64), recovers to 86% at 128.
### Small Qwen3-0.6B (llama Q8_0 vs vLLM bf16)
| npl | llama (Q8_0) | vLLM (bf16) | llama % of vLLM |
|----:|-------------:|------------:|----------------:|
| 8 | 911.3 | 923.0 | **98.7%** |
| 32 | 1701.6 | 2531.4 | 67.2% |
| 64 | 1911.7 | 3497.1 | 54.7% |
| 128 | 2087.6 | 4227.6 | 49.4% |
Plateau: **llama plateaus hard** at ~2.0-2.1k by npl 64-128 (+9% from 64->128). vLLM keeps
scaling (3497 -> 4227). For a tiny runtime-bound model, vLLM's scheduler/batching amortizes
better; llama-server's per-token host cost (sampling, detok, slot mgmt) caps it. This is the
worst llama-vs-vLLM ratio in the sweep (down to 49%).
### MoE Qwen3-Coder-30B-A3B (llama mxfp4; vLLM = NOT SERVABLE on GB10)
| npl | llama (mxfp4) | vLLM |
|----:|--------------:|-----:|
| 8 | 290.0 | n/a |
| 32 | 582.5 | n/a |
| 64 | 931.8 | n/a |
| 128 | 1041.3 | n/a |
llama-server scales cleanly to **1041 tok/s** at npl 128 with **no npl-128 expert-activation
cliff** (unlike the prior `llama-batched-bench` MoE numbers 253/505/830/620 that peaked at 64;
short-prompt continuous batching in the server avoids it).
**vLLM could not serve this MoE on GB10 (two independent failures):**
1. **bf16** (`Qwen/Qwen3-Coder-30B-A3B-Instruct`, the only HF form on the box): loads the
56.9 GB of weights, then **hangs at the MoE warmup** (`Using MoEPrepareAndFinalize
NoDPEPModular` -> `Model loading took ...`), GPU 0% util, and **takes the whole box down
(hard reboot)**. Reproduced twice. With tight `--gpu-memory-utilization` it still hangs at
the same step before the API server ever comes up.
2. **mxfp4 GGUF** (same weights llama uses): vLLM 0.23.0's GGUF loader **cannot map the fused
qwen3moe expert tensors** (`RuntimeError: Failed to map GGUF parameters (48):
['model.layers.N.mlp.experts.gate_up_proj', ...]`). Engine init fails outright.
So on GB10, llama.cpp is the *only* engine of the two that serves this 30B-A3B MoE at all -
an availability win, independent of throughput.
## Batch-8 anomaly triage (dense NVFP4) -- RESOLVED
The prior mixed-load run reported llama batch-8 steady decode at **471 ms/step (~19% of vLLM
aggregate, ~17 tok/s)**. This sweep does **not** reproduce it. Clean isolated batch-8 decode:
- `llama-server` batch-8 dense paged = **83.2 tok/s** aggregate = ~96 ms/step = **96.9% of
vLLM's 85.9** (parity, both at the LPDDR5x weight-read floor).
`llama-batched-bench` cross-check, dense NVFP4, `-npp 16 -ntg 128 -npl 1,8`, the three
hypotheses isolated (S_TG = decode tok/s aggregate at batch 8):
| config | batch-8 S_TG t/s | ms/decode-step |
|-----------------------|-----------------:|---------------:|
| paged, ctx 65536 | 90.32 | 88.6 |
| stock, ctx 65536 | 88.39 | 90.5 |
| paged, ctx 163840 | 89.33 | 89.6 |
| stock, ctx 163840 | 87.72 | 91.2 |
Conclusion: clean batch-8 dense decode is **~88-90 tok/s (~89 ms/step) regardless of all three
suspects**:
- **Paged overhead?** No -- paged is within 2% of stock, and at ctx 65k paged is *faster*
(90.3 vs 88.4). The decode path is not paying a paged penalty at batch-8.
- **The 163840-token ctx allocation?** No -- ctx 163840 == ctx 65536 within 1% (89.3 vs 90.3).
The large allocation does not slow steady-state decode.
- **NVFP4 decode cost?** This *is* the cost -- ~89 ms/step is the GB10 weight-read floor for a
32B at batch-8 (it matches vLLM's 86 tok/s server and exceeds it at the kernel level: 90 vs
86). It is the hardware ceiling, not a bug.
The 471 ms/step is ~5.3x slower than this clean floor and is explained by none of the three.
It was a **mixed-load artifact**: the 8 decoders were time-sharing the GPU with a concurrent
prefill (a large prompt / chunked prefill landing on the same steps). That decode-vs-prefill
contention is exactly the stall **patch 0013 (`LLAMA_PREFILL_BUDGET`)** bounds. In steady-state
isolated decode, batch-8 dense is at **parity with vLLM (97%)**, not 19%.
## Aggregate map (replaces the carried 75-80%)
llama-server (paged) as a fraction of vLLM, by regime:
- **Low concurrency (batch-8): parity, 97-99%** on both measurable classes. Both engines sit on
the LPDDR5x weight-read floor; there is nothing to win.
- **Dense 32B, mid-to-high concurrency: 72-86%.** Dips to ~72% at npl 32-64, recovers to 86% at
128. Both still climbing (weight-bound), neither plateaus by 128.
- **Small 0.6B, mid-to-high concurrency: 49-67%.** llama plateaus ~2.0k; vLLM scales to 4.2k.
Runtime/scheduler-bound regime -- vLLM's batching wins; this is llama's weakest ratio.
- **MoE 30B-A3B: llama-only.** vLLM cannot serve it on GB10 (bf16 reboots the box at MoE
warmup; GGUF expert tensors unmappable). llama serves it at 290 -> 1041 tok/s, scaling
cleanly with no npl-128 cliff.
Net: the single "75-80%" number is replaced by a curve. It is *roughly* right only for the
dense mid-band; it is too optimistic for the small model at high concurrency (49%) and moot for
MoE (where llama is the only option). The headline is parity at low concurrency and a hardware
(not engine) ceiling on dense decode.

Some files were not shown because too many files have changed in this diff Show More