Files
LocalAI/backend/cpp/llama-cpp-localai-paged/docs/PAGED_BITEXACT_NOTE.md
Ettore Di Giacinto 08b754f910 chore(paged): keep patches/ patch-only; README to backend root, docs to docs/
The llama-cpp-localai-paged patches/ dir had accumulated docs, plots, a csv,
dev .cpp harnesses, and a dead FP4-MoE kernel scaffold after an earlier git-mv.
Restore the invariant that patches/ holds only the .patch series.

Moves:
- patches/paged/README.md -> README.md (canonical doc at the backend root)
- patches/paged/{PIN_SYNC_c299a92c,PAGED_BITEXACT_NOTE,LOCALAI_LLAMACPP_BACKEND_PLAN,UPSTREAM_LAYER2_SCOPE}.md,
  final_benchmark.csv, qwen36_*.png, paged-burst-bench.cpp, paged-reclaim-unit.cpp -> docs/
- patches/README.md -> docs/PATCH_MAINTENANCE.md (unique patch-regen recipe not in the canonical README)

Deletes:
- patches/BENCHMARKS.md (superseded by README section 4 + the dev-notes section)
- patches/kernel/ (dead FP4-MoE scaffold, never in the 0001-0030 apply glob, zero refs repo-wide)

Repoint every reference to the moved files: README internal links (docs/ + the
.github links drop from 5x ../ to 3x ../), .agents/llama-cpp-localai-paged-backend.md,
.github/scripts/paged-canary-apply.sh, .github/workflows/llama-cpp-paged-canary.yml,
the wrapper Makefile, backend/cpp/llama-cpp/grpc-server.cpp, backend/index.yaml,
docs/content/features/backends.md, gallery/index.yaml.

The build apply glob PAGED_PATCHES_DIR/0*.patch (PAGED_PATCHES_DIR := .../patches/paged)
is unchanged and still resolves to the 28 patches.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 13:20:05 +00:00

3.8 KiB

Paged bit-exactness gate - per path (canonical references)

TL;DR

The greedy decode of the paged path does not byte-match the non-paged path for the MoE model. This is a benign FP-accumulation-order difference of the paged attention reduction, KL-validated against the f16 reference. It is not a bug. The bit-exactness gate is therefore per path:

path model canonical md5
non-paged MoE q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd
paged MoE q36-35b-a3b-nvfp4 8cb0ce23777bf55f92f63d0292c756b0
non-paged dense q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439
paged dense q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439 (bit-exact to non-paged)

Gate command (chat-template / conversation path):

llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" \
                 -n 48 --temp 0 --seed 1
# paged: prefix with  LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1

Note: use the default chat-template path (do not pass -no-cnv; raw completion lands in a different md5 namespace).

Future paged-MoE regressions compare to the PAGED reference 8cb0ce23, not to the non-paged 07db32c2. Dense is bit-exact across paths, so dense uses the single reference 5951a5b4.

Why dense is bit-exact but MoE is not

Dense paged decode reproduces the non-paged reduction order exactly, so dense greedy md5 is identical across paths. The MoE path runs additional kernels (the NVFP4 MoE GEMM + expert routing) whose multi-kernel accumulation order differs between the paged and non-paged attention layouts. Over a long greedy decode this flips a small number of near-tied argmaxes, changing the byte stream. The same divergence is present on the 0028 baseline, with LLAMA_MOE_FORCE_GRAPHS on or off, and with the patch-0029 block-table cache on or off - it is a property of the paged attention path, not of any one lever.

KL evidence that the paged path is sound (the load-bearing check)

llama-perplexity --kl-divergence on q36-35b-a3b-nvfp4.gguf, 16 chunks, -c 512 -ngl 99 --seed 1, base logits from the f16 reference (darwin_36b_opus/f16.gguf, PPL 7.3734):

comparison PPL(Q) KL divergence Same top p Cor
f16 reference 7.3734 - - -
non-paged vs f16 7.3896 0.136597 +/- 0.003157 84.314% 97.68%
paged vs f16 7.4009 0.136000 +/- 0.003285 84.828% 97.58%
paged vs non-paged (direct) 7.4009 (base 7.3818) 0.050011 +/- 0.001653 89.044% 99.04%

Direct paged-vs-non-paged: Mean Delta-p = 0.079% (no bias), RMS Delta-p = 6.187%.

Verdict: BENIGN

  • Paged does not diverge from the f16 ground truth more than non-paged does. KLD(paged||f16) = 0.13600 <= KLD(nonpaged||f16) = 0.13660, and PPL(paged) = 7.4009 ~ PPL(nonpaged) = 7.3896 (difference 0.011, far inside the +/- 0.29 error bars). A real paged-MoE correctness bug would push paged measurably further from f16; it does not (it is marginally closer).
  • Paged and non-paged cluster together. They agree with each other (KLD 0.050, 89.0% same-top-p) more than either agrees with f16 (KLD ~0.137, ~84% same-top-p), with essentially zero probability bias. That is the signature of two equivalent FP-reorderings of the same quantized model, both equally approximating the f16 ground truth - not a quality regression.
  • The direct same-top-p of 89.0% is below a naive ">99%" heuristic, but that heuristic is calibrated for higher-precision models. In a 4-bit (NVFP4) model logit near-ties are abundant, so a different-but-equivalent reduction order flips ~11% of argmaxes with no quality cost (proven by the equal KLD-to-f16 and zero Delta-p bias).

Therefore the canonical gate is per path, and 8cb0ce23 is the validated paged reference for the MoE deployment path.