# Paged bit-exactness gate - per path (canonical references) ## TL;DR The greedy decode of the **paged** path does not byte-match the **non-paged** path for the MoE model. This is a **benign FP-accumulation-order difference of the paged attention reduction**, KL-validated against the f16 reference. It is **not a bug**. The bit-exactness gate is therefore **per path**: | path | model | canonical md5 | |------|-------|---------------| | non-paged | MoE q36-35b-a3b-nvfp4 | `07db32c2bcb78d17a43ed18bc22705cd` | | paged | MoE q36-35b-a3b-nvfp4 | `8cb0ce23777bf55f92f63d0292c756b0` | | non-paged | dense q36-27b-nvfp4 | `5951a5b4d624ce891e22ab5fca9bc439` | | paged | dense q36-27b-nvfp4 | `5951a5b4d624ce891e22ab5fca9bc439` (bit-exact to non-paged) | Gate command (chat-template / conversation path): ``` llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" \ -n 48 --temp 0 --seed 1 # paged: prefix with LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 ``` Note: use the default chat-template path (do **not** pass `-no-cnv`; raw completion lands in a different md5 namespace). **Future paged-MoE regressions compare to the PAGED reference `8cb0ce23`, not to the non-paged `07db32c2`.** Dense is bit-exact across paths, so dense uses the single reference `5951a5b4`. ## Why dense is bit-exact but MoE is not Dense paged decode reproduces the non-paged reduction order exactly, so dense greedy md5 is identical across paths. The MoE path runs additional kernels (the NVFP4 MoE GEMM + expert routing) whose multi-kernel accumulation order differs between the paged and non-paged attention layouts. Over a long greedy decode this flips a small number of near-tied argmaxes, changing the byte stream. The same divergence is present on the 0028 baseline, with `LLAMA_MOE_FORCE_GRAPHS` on or off, and with the patch-0029 block-table cache on or off - it is a property of the paged attention path, not of any one lever. ## KL evidence that the paged path is sound (the load-bearing check) `llama-perplexity --kl-divergence` on `q36-35b-a3b-nvfp4.gguf`, 16 chunks, `-c 512 -ngl 99 --seed 1`, base logits from the f16 reference (`darwin_36b_opus/f16.gguf`, PPL 7.3734): | comparison | PPL(Q) | KL divergence | Same top p | Cor | |------------|-------:|--------------:|-----------:|----:| | f16 reference | 7.3734 | - | - | - | | **non-paged** vs f16 | 7.3896 | 0.136597 +/- 0.003157 | 84.314% | 97.68% | | **paged** vs f16 | 7.4009 | 0.136000 +/- 0.003285 | 84.828% | 97.58% | | paged vs non-paged (direct) | 7.4009 (base 7.3818) | 0.050011 +/- 0.001653 | 89.044% | 99.04% | Direct paged-vs-non-paged: Mean Delta-p = 0.079% (no bias), RMS Delta-p = 6.187%. ### Verdict: BENIGN - **Paged does not diverge from the f16 ground truth more than non-paged does.** KLD(paged||f16) = 0.13600 <= KLD(nonpaged||f16) = 0.13660, and PPL(paged) = 7.4009 ~ PPL(nonpaged) = 7.3896 (difference 0.011, far inside the +/- 0.29 error bars). A real paged-MoE correctness bug would push paged measurably *further* from f16; it does not (it is marginally closer). - **Paged and non-paged cluster together.** They agree with each other (KLD 0.050, 89.0% same-top-p) more than either agrees with f16 (KLD ~0.137, ~84% same-top-p), with essentially zero probability bias. That is the signature of two equivalent FP-reorderings of the same quantized model, both equally approximating the f16 ground truth - not a quality regression. - The direct same-top-p of 89.0% is below a naive ">99%" heuristic, but that heuristic is calibrated for higher-precision models. In a 4-bit (NVFP4) model logit near-ties are abundant, so a different-but-equivalent reduction order flips ~11% of argmaxes with no quality cost (proven by the equal KLD-to-f16 and zero Delta-p bias). Therefore the canonical gate is per path, and `8cb0ce23` is the validated paged reference for the MoE deployment path.