mirror of https://github.com/mudler/LocalAI.git synced 2026-06-27 09:57:14 -04:00

Files

Ettore Di Giacinto 08b754f910 chore(paged): keep patches/ patch-only; README to backend root, docs to docs/

The llama-cpp-localai-paged patches/ dir had accumulated docs, plots, a csv,
dev .cpp harnesses, and a dead FP4-MoE kernel scaffold after an earlier git-mv.
Restore the invariant that patches/ holds only the .patch series.

Moves:
- patches/paged/README.md -> README.md (canonical doc at the backend root)
- patches/paged/{PIN_SYNC_c299a92c,PAGED_BITEXACT_NOTE,LOCALAI_LLAMACPP_BACKEND_PLAN,UPSTREAM_LAYER2_SCOPE}.md,
  final_benchmark.csv, qwen36_*.png, paged-burst-bench.cpp, paged-reclaim-unit.cpp -> docs/
- patches/README.md -> docs/PATCH_MAINTENANCE.md (unique patch-regen recipe not in the canonical README)

Deletes:
- patches/BENCHMARKS.md (superseded by README section 4 + the dev-notes section)
- patches/kernel/ (dead FP4-MoE scaffold, never in the 0001-0030 apply glob, zero refs repo-wide)

Repoint every reference to the moved files: README internal links (docs/ + the
.github links drop from 5x ../ to 3x ../), .agents/llama-cpp-localai-paged-backend.md,
.github/scripts/paged-canary-apply.sh, .github/workflows/llama-cpp-paged-canary.yml,
the wrapper Makefile, backend/cpp/llama-cpp/grpc-server.cpp, backend/index.yaml,
docs/content/features/backends.md, gallery/index.yaml.

The build apply glob PAGED_PATCHES_DIR/0*.patch (PAGED_PATCHES_DIR := .../patches/paged)
is unchanged and still resolves to the 28 patches.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-27 13:20:05 +00:00

3.8 KiB

Raw Blame History

Paged bit-exactness gate - per path (canonical references)

TL;DR

The greedy decode of the paged path does not byte-match the non-paged path for the MoE model. This is a benign FP-accumulation-order difference of the paged attention reduction, KL-validated against the f16 reference. It is not a bug. The bit-exactness gate is therefore per path:

path	model	canonical md5
non-paged	MoE q36-35b-a3b-nvfp4	`07db32c2bcb78d17a43ed18bc22705cd`
paged	MoE q36-35b-a3b-nvfp4	`8cb0ce23777bf55f92f63d0292c756b0`
non-paged	dense q36-27b-nvfp4	`5951a5b4d624ce891e22ab5fca9bc439`
paged	dense q36-27b-nvfp4	`5951a5b4d624ce891e22ab5fca9bc439` (bit-exact to non-paged)

Gate command (chat-template / conversation path):

llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" \
                 -n 48 --temp 0 --seed 1
# paged: prefix with  LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1

Note: use the default chat-template path (do not pass -no-cnv; raw completion lands in a different md5 namespace).

Future paged-MoE regressions compare to the PAGED reference 8cb0ce23, not to the non-paged 07db32c2. Dense is bit-exact across paths, so dense uses the single reference 5951a5b4.

Why dense is bit-exact but MoE is not

Dense paged decode reproduces the non-paged reduction order exactly, so dense greedy md5 is identical across paths. The MoE path runs additional kernels (the NVFP4 MoE GEMM + expert routing) whose multi-kernel accumulation order differs between the paged and non-paged attention layouts. Over a long greedy decode this flips a small number of near-tied argmaxes, changing the byte stream. The same divergence is present on the 0028 baseline, with LLAMA_MOE_FORCE_GRAPHS on or off, and with the patch-0029 block-table cache on or off - it is a property of the paged attention path, not of any one lever.

KL evidence that the paged path is sound (the load-bearing check)

llama-perplexity --kl-divergence on q36-35b-a3b-nvfp4.gguf, 16 chunks, -c 512 -ngl 99 --seed 1, base logits from the f16 reference (darwin_36b_opus/f16.gguf, PPL 7.3734):

comparison	PPL(Q)	KL divergence	Same top p	Cor
f16 reference	7.3734	-	-	-
non-paged vs f16	7.3896	0.136597 +/- 0.003157	84.314%	97.68%
paged vs f16	7.4009	0.136000 +/- 0.003285	84.828%	97.58%
paged vs non-paged (direct)	7.4009 (base 7.3818)	0.050011 +/- 0.001653	89.044%	99.04%

Direct paged-vs-non-paged: Mean Delta-p = 0.079% (no bias), RMS Delta-p = 6.187%.

Verdict: BENIGN

Paged does not diverge from the f16 ground truth more than non-paged does. KLD(paged||f16) = 0.13600 <= KLD(nonpaged||f16) = 0.13660, and PPL(paged) = 7.4009 ~ PPL(nonpaged) = 7.3896 (difference 0.011, far inside the +/- 0.29 error bars). A real paged-MoE correctness bug would push paged measurably further from f16; it does not (it is marginally closer).
Paged and non-paged cluster together. They agree with each other (KLD 0.050, 89.0% same-top-p) more than either agrees with f16 (KLD ~0.137, ~84% same-top-p), with essentially zero probability bias. That is the signature of two equivalent FP-reorderings of the same quantized model, both equally approximating the f16 ground truth - not a quality regression.
The direct same-top-p of 89.0% is below a naive ">99%" heuristic, but that heuristic is calibrated for higher-precision models. In a 4-bit (NVFP4) model logit near-ties are abundant, so a different-but-equivalent reduction order flips ~11% of argmaxes with no quality cost (proven by the equal KLD-to-f16 and zero Delta-p bias).

Therefore the canonical gate is per path, and 8cb0ce23 is the validated paged reference for the MoE deployment path.

3.8 KiB Raw Blame History