LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-27 09:57:14 -04:00

Files

Ettore Di Giacinto db6ebc53b2 feat(paged): block-table within-step host cache (patch 0029)

Mirror of paged-dev commit e2acb3b (lever 5). get_block_table() is recomputed
once per full-attention layer per decode step, but the KV cell layout is fixed
for the whole step (it only changes in apply()). This caches the table the first
time it is built in a step and memcpy-reuses the identical bytes for the rest,
invalidating in apply(). Bit-exact; toggle off with LLAMA_PAGED_NO_BT_CACHE=1.

Host-side get_block_table time (llama-batched-bench, npp128 ntg128 npl128,
cache OFF -> ON): MoE 112.94 -> 14.82 ms (-87%), dense 193.78 -> 16.90 ms (-91%).
Dense decode is partly host-bound and gains (TG 364.8 -> 374.7 t/s, ~96% of the
vLLM 391 t/s @npl128 reference); MoE decode is compute-bound (FP4 GEMM) so the
saved host time is off the critical path and MoE TG is flat. Details in
LEVER5_HOSTPIPE_RESULTS.md.

Also records the per-path bit-exactness gate (PAGED_BITEXACT_NOTE.md): the
paged-MoE greedy md5 (8cb0ce23) differs from the non-paged md5 (07db32c2) by a
benign FP-accumulation-order difference of the paged attention reduction, not a
bug. KL-validated vs the f16 reference (16 chunks, c512): KLD(paged||f16) =
0.13600 <= KLD(nonpaged||f16) = 0.13660, PPL(paged) = 7.4009 ~ PPL(nonpaged) =
7.3896 (within +/- 0.29). Canonical references are now per path: non-paged MoE
07db32c2 and paged MoE 8cb0ce23; dense is bit-exact across paths (5951a5b4).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-27 01:47:08 +00:00

0001-vendor-paged-kv-manager.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0002-paged-kv-block-placement-env-LLAMA_KV_PAGED.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0003-gather-read-plan.md

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0003-paged-gather-read-env-LLAMA_KV_PAGED.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0004-paged-on-demand-block-allocation-env-LLAMA_KV_PAGED.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0006-paged-cross-request-prefix-caching-env-LLAMA_KV_PAGED.patch

feat(llama-cpp/paged): cross-request prefix caching patch 0006

2026-06-22 10:14:27 +00:00

0007-paged-engine-prefix-recompute-skip-env-LLAMA_KV_PAGED.patch

feat(llama-cpp/paged): engine-level prefix recompute-skip (patch 0007)

2026-06-22 10:47:10 +00:00

0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch

feat(paged): pin-sync patchset to llama.cpp 9d5d882d (re-export 4 patches)

2026-06-26 14:12:36 +00:00

0009-paged-in-kernel-decode-read-env-LLAMA_KV_PAGED-patch.patch

paged: in-kernel decode read patch 0009 (kill the gather regression)

2026-06-22 18:04:09 +00:00

0010-paged-tile-in-kernel-read-and-dispatch-guard-env-LLAMA_KV_PAGED.patch

feat(paged): tile in-kernel decode read + dispatch guard (patch 0010)

2026-06-22 20:37:12 +00:00

0011-paged-decode-route-GQA-grouped-tile-kernel-by-defaul.patch

feat(paged): route GQA-grouped tile kernel by default for paged decode (patch 0011)

2026-06-22 22:38:28 +00:00

0012-paged-mask-pad-invariant-assert.patch

feat(paged): assert mask-pad invariant for the paged tile route (patch 0012)

2026-06-23 09:13:08 +00:00

0013-paged-decoupled-prefill-token-budget.patch

feat(paged): pin-sync patchset to llama.cpp 9d5d882d (re-export 4 patches)

2026-06-26 14:12:36 +00:00

0014-paged-expert-aware-moe-token-tile-cap.patch

feat(paged): mirror patch 0014 - expert-aware MoE token-tile cap

2026-06-23 13:49:15 +00:00

0015-paged-expert-density-aware-moe-token-tile-auto-select.patch

feat(paged): pin-sync patchset to llama.cpp 9d5d882d (re-export 4 patches)

2026-06-26 14:12:36 +00:00

0016-paged-dynamic-prefill-budget-continuous-batch.patch

feat(paged): pin-sync patchset to llama.cpp 9d5d882d (re-export 4 patches)

2026-06-26 14:12:36 +00:00

0017-fp4-gemm-decode-tile-tune.patch

docs(paged): mirror FP4 decode-GEMM track-B P0 gate + P1 kill-gate results (patch 0017)

2026-06-24 17:58:00 +00:00

0018-qwen35-ssm-decode-inplace-state.patch

feat(paged): qwen35 gated-DeltaNet in-place SSM state write-back (patch 0018)

2026-06-24 22:45:49 +00:00

0019-qwen35-ssm-decode-fused-gather.patch

feat(paged): qwen35 SSM decode fused recurrent-state gather (patch 0019)

2026-06-24 23:47:51 +00:00

0020-qwen35-gdn-oproj-mmq-reshape.patch

feat(paged): qwen35 gated-DeltaNet o_proj MMVQ->MMQ reshape (patch 0020)

2026-06-25 10:41:38 +00:00

0021-qwen35-conv-state-inplace-fusion.patch

feat(paged): qwen35 decode conv-state in-place fusion (patch 0021)

2026-06-25 16:56:35 +00:00

0022-qwen35-gdn-recurrence-occupancy-retune.patch

feat(paged): qwen35 gated-DeltaNet decode occupancy/coalescing retune (patch 0022)

2026-06-25 18:34:17 +00:00

0023-qwen35moe-nvfp4-quant-dedup.patch

feat(paged): qwen35moe NVFP4 activation-quantize de-dup (patch 0023)

2026-06-25 21:49:15 +00:00

0024-paged-pool-burst-reclaim.patch

feat(paged): paged-pool burst-reclaim (truncate + defrag + slot release) (patch 0024)

2026-06-26 10:44:33 +00:00

0025-qwen35moe-nvfp4-moe-decode-regraph.patch

docs(paged): MoE decode re-graph lever (patch 0025) + speedup-hunt B findings

2026-06-26 14:53:14 +00:00

0026-qwen35-hybrid-perhead-ssm-state.patch

feat(paged): wire ssm_bf16_tau model option for hybrid SSM-state fast mode

2026-06-26 19:51:00 +00:00

0028-qwen35-recurrent-state-gather-fusion.patch

Merge remote-tracking branch 'origin/master' into worktree-feat+paged-attention

2026-06-26 21:38:56 +00:00

0029-qwen35-blocktable-within-step-cache.patch

feat(paged): block-table within-step host cache (patch 0029)

2026-06-27 01:47:08 +00:00

A2_CUDAGRAPH_DECODE.md

docs(paged): A.2 final synthesis - CUDA-graph decode verdict

2026-06-24 21:45:42 +00:00

A_HYBRID_PROGRESS.md

feat(paged): qwen35 hybrid per-head f32/bf16 SSM state - carry fix + gate sweep (patch 0026)

2026-06-26 17:44:05 +00:00

A_HYBRID_SSM_RESULTS.md

feat(paged): qwen35 hybrid per-head f32/bf16 SSM state - carry fix + gate sweep (patch 0026)

2026-06-26 17:44:05 +00:00

ADDITIVE_DESIGN.md

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

B_MOE_PROGRESS.md

docs(paged): B-3 mmq_y-down warp-remap NEGATIVE - bit-exact MoE ceiling ~85% of vLLM

2026-06-26 19:10:24 +00:00

B_MOE_RESULTS.md

docs(paged): B-3 mmq_y-down warp-remap NEGATIVE - bit-exact MoE ceiling ~85% of vLLM

2026-06-26 19:10:24 +00:00

BENCHMARK_PROGRESS.md

bench(paged): final apples-to-apples NVFP4 decode benchmark (0023 vs vLLM 0.23.0, GB10)

2026-06-26 03:47:24 +00:00

BF16_SSM_STATE_PLAN.md

docs(paged): bf16 SSM-state build plan (PART C synthesis: edits, KL gate, bench, risks)

2026-06-25 16:46:59 +00:00

BF16_SSM_STATE_PROGRESS.md

docs(paged): bf16 SSM-state NO-SHIP - fails f32 KL gate (= vLLM's own precision)

2026-06-26 00:49:49 +00:00

BF16_SSM_STATE_RESULTS.md

docs(paged): correct vLLM recurrent-state precision (f32, not bf16)

2026-06-26 06:22:08 +00:00

BITEXACT_VS_VLLM.md

docs(paged): bitexact-vs-vLLM verdict + verified f32 GDN-state correction

2026-06-25 16:55:25 +00:00

BYTEGATE_PROGRESS.md

docs(paged): GDN recurrence byte-gate SETTLED - re-stream ~1.0x, build bf16 state not fused kernel

2026-06-25 15:24:49 +00:00

CONTINUOUS_BATCH_SCHEDULER_SCOPE.md

docs(paged): adversarial review of the continuous-batch scheduler scope

2026-06-23 22:48:31 +00:00

CONV_STATE_FUSION_RESULTS.md

feat(paged): qwen35 decode conv-state in-place fusion (patch 0021)

2026-06-25 16:56:35 +00:00

CRITICALPATH_GAP_ANALYSIS.md

docs(paged): SYNTHESIS - validated decode-parity picture, ranked plan, verdict

2026-06-25 15:03:18 +00:00

DECODE_GAP_STUDY.md

docs(paged): decode-step gap study vs vLLM on GB10

2026-06-22 15:44:24 +00:00

DECODE_PARITY_EXPLORE.md

docs(paged): synthesize decode-parity exploration - the o_proj MMVQ lever

2026-06-25 09:06:50 +00:00

F16_DENSE_RESIDUAL_PROBE.md

docs(paged): finalize f16 glue probe - cost analysis + build verdict

2026-06-26 09:12:55 +00:00

final_benchmark.csv

docs(paged): publish NVFP4 decode benchmark - plot-ready CSV + decode-vs-npl plots

2026-06-26 03:51:35 +00:00

FP4_GEMM_SCOPE_B.md

docs(paged): adversarial review of track-B FP4-GEMM parity go/no-go

2026-06-24 14:31:35 +00:00

FUTURE_LEVERS.md

docs(paged): correct vLLM recurrent-state precision (f32, not bf16)

2026-06-26 06:22:08 +00:00

GDN_DECODE_VERIFY.md

docs(paged): verify llama.cpp GDN decode is O(1)-in-context, not a 2.4x lever

2026-06-24 11:21:44 +00:00

GDN_RECURRENCE_BYTE_GATE.md

docs(paged): FINAL DECISION - NO-BUILD fused recurrence, BUILD conv fusion + bf16 state

2026-06-25 15:27:04 +00:00

LEVER1_GATHER_PROGRESS.md

docs(paged): lever1 gather-fusion bench landed - checkpoint + attribution (patch 0028)

2026-06-26 21:41:45 +00:00

LEVER1_GATHER_RESULTS.md

Merge remote-tracking branch 'origin/master' into worktree-feat+paged-attention

2026-06-26 21:38:56 +00:00

LEVER1_OPROJ_MMQ_RESULTS.md

feat(paged): qwen35 gated-DeltaNet o_proj MMVQ->MMQ reshape (patch 0020)

2026-06-25 10:41:38 +00:00

LEVER4_PROJNVFP4_RESULTS.md

docs(paged): lever-4 KL-gate FAIL - NVFP4 MoE projections cost ~6% PPL, no-ship

2026-06-26 23:36:38 +00:00

LEVER5_HOSTPIPE_RESULTS.md

feat(paged): block-table within-step host cache (patch 0029)

2026-06-27 01:47:08 +00:00

LOCALAI_LLAMACPP_BACKEND_PLAN.md

feat(gallery): -paged suffix rename + qwopus NVFP4-MTP paged variants

2026-06-26 21:26:14 +00:00

MOE_DENSITY_AUTO_TILE.md

feat(paged): mirror MoE token-tile density-aware auto-select (patch 0015)

2026-06-23 19:04:55 +00:00

MOE_GAP_PROGRESS.md

docs(paged): both-engine MoE decode decomposition - the 15% is NOT the Marlin GEMM

2026-06-26 20:11:40 +00:00

MOE_GAP_VS_VLLM.md

docs(paged): residual-assess FINAL - MoE at bit-exact ceiling, hunt DONE

2026-06-27 01:05:06 +00:00

MOE_GROUPED_GEMM_SCOPE.md

docs(paged): scope durable grouped FP4-MMA MoE GEMM port for GB10

2026-06-23 13:17:03 +00:00

MOE_QUANT_DEDUP_RESULTS.md

feat(paged): qwen35moe NVFP4 activation-quantize de-dup (patch 0023)

2026-06-25 21:49:15 +00:00

MOE_TOKEN_TILE_CAP.md

feat(paged): mirror patch 0014 - expert-aware MoE token-tile cap

2026-06-23 13:49:15 +00:00

NONRECURRENCE_BITEXACT.md

feat(paged): qwen35moe NVFP4 activation-quantize de-dup (patch 0023)

2026-06-25 21:49:15 +00:00

OCCUPANCY_RETUNE_RESULTS.md

feat(paged): qwen35 gated-DeltaNet decode occupancy/coalescing retune (patch 0022)

2026-06-25 18:34:17 +00:00

OTHER_PATHS_INVESTIGATION.md

docs(paged): OTHER_PATHS investigation - rank 4 post-0023 paths, pick paged-pool burst bug as first build target

2026-06-26 09:42:55 +00:00

P1_DYNAMIC_BUDGET_RESULTS.md

docs(paged): staggered-arrival evaluation of patch 0016 dynamic budget

2026-06-24 10:56:13 +00:00

PAGED_BENCH.md

docs(llama-cpp/paged): GPU 0007 re-run + shared-prefix benchmark results

2026-06-22 12:59:09 +00:00

PAGED_BITEXACT_NOTE.md

feat(paged): block-table within-step host cache (patch 0029)

2026-06-27 01:47:08 +00:00

PAGED_GPU_VERIFY.md

docs(paged): record GPU correctness + CUDA backend-build verification

2026-06-22 11:50:01 +00:00

PAGED_POOL_BURST_FIX.md

feat(paged): paged-pool burst-reclaim (truncate + defrag + slot release) (patch 0024)

2026-06-26 10:44:33 +00:00

PAGED_VLLM_APPLES.md

docs(paged): apples-to-apples paged llama.cpp vs vLLM (batched+NVFP4+prefix cache)

2026-06-22 14:16:52 +00:00

PAGED_VLLM_COMPARE.md

docs(paged): stock GPU batch-shape determinism + vLLM shared-prefix comparison

2026-06-22 13:48:01 +00:00

paged-burst-bench.cpp

feat(paged): paged-pool burst-reclaim (truncate + defrag + slot release) (patch 0024)

2026-06-26 10:44:33 +00:00

paged-reclaim-unit.cpp

feat(paged): paged-pool burst-reclaim (truncate + defrag + slot release) (patch 0024)

2026-06-26 10:44:33 +00:00

PIN_SYNC_9d5d882d.md

feat(paged): pin-sync patchset to llama.cpp 9d5d882d (re-export 4 patches)

2026-06-26 14:12:36 +00:00

qwen36_dense_decode_vs_npl.png

docs(paged): publish NVFP4 decode benchmark - plot-ready CSV + decode-vs-npl plots

2026-06-26 03:51:35 +00:00

qwen36_moe_decode_vs_npl.png

docs(paged): publish NVFP4 decode benchmark - plot-ready CSV + decode-vs-npl plots

2026-06-26 03:51:35 +00:00

QWEN36_NVFP4_BENCH.md

docs(paged): publish NVFP4 decode benchmark - plot-ready CSV + decode-vs-npl plots

2026-06-26 03:51:35 +00:00

RMSNORM_FP4_FOLD.md

docs(paged): rms_norm->fp4 fold analysis - bit-exact decode ceiling at 95% of vLLM

2026-06-25 22:42:08 +00:00

SERVER_SWEEP.md

docs(paged): GB10 head-to-head server sweep (llama-server vs vLLM)

2026-06-23 12:22:15 +00:00

SPEEDUP_HUNT.md

docs(paged): speedup-hunt C section + final RANK + PLAN synthesis

2026-06-26 14:56:53 +00:00

SSM_DECODE_FIX_RESULTS.md

feat(paged): qwen35 SSM decode fused recurrent-state gather (patch 0019)

2026-06-24 23:47:51 +00:00

THROUGHPUT_B_P1_RESULTS.md

docs(paged): mirror FP4 decode-GEMM track-B P0 gate + P1 kill-gate results (patch 0017)

2026-06-24 17:58:00 +00:00

VLLM_DECODE_GROUNDING.md

docs(paged): ground vLLM 0.23.0 eager-decode architecture vs llama.cpp

2026-06-24 07:44:07 +00:00