Mirror of paged-dev commit e2acb3b (lever 5). get_block_table() is recomputed
once per full-attention layer per decode step, but the KV cell layout is fixed
for the whole step (it only changes in apply()). This caches the table the first
time it is built in a step and memcpy-reuses the identical bytes for the rest,
invalidating in apply(). Bit-exact; toggle off with LLAMA_PAGED_NO_BT_CACHE=1.
Host-side get_block_table time (llama-batched-bench, npp128 ntg128 npl128,
cache OFF -> ON): MoE 112.94 -> 14.82 ms (-87%), dense 193.78 -> 16.90 ms (-91%).
Dense decode is partly host-bound and gains (TG 364.8 -> 374.7 t/s, ~96% of the
vLLM 391 t/s @npl128 reference); MoE decode is compute-bound (FP4 GEMM) so the
saved host time is off the critical path and MoE TG is flat. Details in
LEVER5_HOSTPIPE_RESULTS.md.
Also records the per-path bit-exactness gate (PAGED_BITEXACT_NOTE.md): the
paged-MoE greedy md5 (8cb0ce23) differs from the non-paged md5 (07db32c2) by a
benign FP-accumulation-order difference of the paged attention reduction, not a
bug. KL-validated vs the f16 reference (16 chunks, c512): KLD(paged||f16) =
0.13600 <= KLD(nonpaged||f16) = 0.13660, PPL(paged) = 7.4009 ~ PPL(nonpaged) =
7.3896 (within +/- 0.29). Canonical references are now per path: non-paged MoE
07db32c2 and paged MoE 8cb0ce23; dense is bit-exact across paths (5951a5b4).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>