The llama-cpp-localai-paged patches/ dir had accumulated docs, plots, a csv,
dev .cpp harnesses, and a dead FP4-MoE kernel scaffold after an earlier git-mv.
Restore the invariant that patches/ holds only the .patch series.
Moves:
- patches/paged/README.md -> README.md (canonical doc at the backend root)
- patches/paged/{PIN_SYNC_c299a92c,PAGED_BITEXACT_NOTE,LOCALAI_LLAMACPP_BACKEND_PLAN,UPSTREAM_LAYER2_SCOPE}.md,
final_benchmark.csv, qwen36_*.png, paged-burst-bench.cpp, paged-reclaim-unit.cpp -> docs/
- patches/README.md -> docs/PATCH_MAINTENANCE.md (unique patch-regen recipe not in the canonical README)
Deletes:
- patches/BENCHMARKS.md (superseded by README section 4 + the dev-notes section)
- patches/kernel/ (dead FP4-MoE scaffold, never in the 0001-0030 apply glob, zero refs repo-wide)
Repoint every reference to the moved files: README internal links (docs/ + the
.github links drop from 5x ../ to 3x ../), .agents/llama-cpp-localai-paged-backend.md,
.github/scripts/paged-canary-apply.sh, .github/workflows/llama-cpp-paged-canary.yml,
the wrapper Makefile, backend/cpp/llama-cpp/grpc-server.cpp, backend/index.yaml,
docs/content/features/backends.md, gallery/index.yaml.
The build apply glob PAGED_PATCHES_DIR/0*.patch (PAGED_PATCHES_DIR := .../patches/paged)
is unchanged and still resolves to the 28 patches.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
3.8 KiB
Paged bit-exactness gate - per path (canonical references)
TL;DR
The greedy decode of the paged path does not byte-match the non-paged path for the MoE model. This is a benign FP-accumulation-order difference of the paged attention reduction, KL-validated against the f16 reference. It is not a bug. The bit-exactness gate is therefore per path:
| path | model | canonical md5 |
|---|---|---|
| non-paged | MoE q36-35b-a3b-nvfp4 | 07db32c2bcb78d17a43ed18bc22705cd |
| paged | MoE q36-35b-a3b-nvfp4 | 8cb0ce23777bf55f92f63d0292c756b0 |
| non-paged | dense q36-27b-nvfp4 | 5951a5b4d624ce891e22ab5fca9bc439 |
| paged | dense q36-27b-nvfp4 | 5951a5b4d624ce891e22ab5fca9bc439 (bit-exact to non-paged) |
Gate command (chat-template / conversation path):
llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" \
-n 48 --temp 0 --seed 1
# paged: prefix with LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1
Note: use the default chat-template path (do not pass -no-cnv; raw
completion lands in a different md5 namespace).
Future paged-MoE regressions compare to the PAGED reference 8cb0ce23, not to
the non-paged 07db32c2. Dense is bit-exact across paths, so dense uses the
single reference 5951a5b4.
Why dense is bit-exact but MoE is not
Dense paged decode reproduces the non-paged reduction order exactly, so dense
greedy md5 is identical across paths. The MoE path runs additional kernels (the
NVFP4 MoE GEMM + expert routing) whose multi-kernel accumulation order differs
between the paged and non-paged attention layouts. Over a long greedy decode this
flips a small number of near-tied argmaxes, changing the byte stream. The same
divergence is present on the 0028 baseline, with LLAMA_MOE_FORCE_GRAPHS on or
off, and with the patch-0029 block-table cache on or off - it is a property of
the paged attention path, not of any one lever.
KL evidence that the paged path is sound (the load-bearing check)
llama-perplexity --kl-divergence on q36-35b-a3b-nvfp4.gguf, 16 chunks,
-c 512 -ngl 99 --seed 1, base logits from the f16 reference
(darwin_36b_opus/f16.gguf, PPL 7.3734):
| comparison | PPL(Q) | KL divergence | Same top p | Cor |
|---|---|---|---|---|
| f16 reference | 7.3734 | - | - | - |
| non-paged vs f16 | 7.3896 | 0.136597 +/- 0.003157 | 84.314% | 97.68% |
| paged vs f16 | 7.4009 | 0.136000 +/- 0.003285 | 84.828% | 97.58% |
| paged vs non-paged (direct) | 7.4009 (base 7.3818) | 0.050011 +/- 0.001653 | 89.044% | 99.04% |
Direct paged-vs-non-paged: Mean Delta-p = 0.079% (no bias), RMS Delta-p = 6.187%.
Verdict: BENIGN
- Paged does not diverge from the f16 ground truth more than non-paged does. KLD(paged||f16) = 0.13600 <= KLD(nonpaged||f16) = 0.13660, and PPL(paged) = 7.4009 ~ PPL(nonpaged) = 7.3896 (difference 0.011, far inside the +/- 0.29 error bars). A real paged-MoE correctness bug would push paged measurably further from f16; it does not (it is marginally closer).
- Paged and non-paged cluster together. They agree with each other (KLD 0.050, 89.0% same-top-p) more than either agrees with f16 (KLD ~0.137, ~84% same-top-p), with essentially zero probability bias. That is the signature of two equivalent FP-reorderings of the same quantized model, both equally approximating the f16 ground truth - not a quality regression.
- The direct same-top-p of 89.0% is below a naive ">99%" heuristic, but that heuristic is calibrated for higher-precision models. In a 4-bit (NVFP4) model logit near-ties are abundant, so a different-but-equivalent reduction order flips ~11% of argmaxes with no quality cost (proven by the equal KLD-to-f16 and zero Delta-p bias).
Therefore the canonical gate is per path, and 8cb0ce23 is the validated paged
reference for the MoE deployment path.