LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-27 18:06:58 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	fb2dc33d52	docs(paged): consolidate the dev-trail docs into one canonical README The paged-attention patch directory had accumulated ~55 scattered dev docs (results, progress, scope, lever, and gap-analysis notes). Consolidate the durable content of all of them into one canonical backend/cpp/llama-cpp/patches/paged/README.md covering: what the patchset is, the architecture (paged KV + block-table flash-attn, the gated-DeltaNet SSM decode path, NVFP4 FP4-MMA, the decode-first scheduler), the full 0001-0030 patch series table with bit-exact status, the GB10 benchmarks (patched-vs-stock-vs-vLLM + the Apple M4 architectural note), the dev notes (bit-exact methodology, the per-path gate, the MoE-parity conclusion, the rejected/flat levers, the opt-in bf16-SSM mode), arch+quant generality, the pin + canary maintenance policy, and the published NVFP4 gallery models. Delete the consolidated-away dev trail. Keep the three operational docs the README links to: PIN_SYNC_c299a92c.md (canary reference), PAGED_BITEXACT_NOTE.md (per-path gate reference) and LOCALAI_LLAMACPP_BACKEND_PLAN.md (the ship-as-own-backend design-of-record), plus the benchmark plots + csv. The .patch files and the unit/bench .cpp are untouched. Repoint every external reference to a deleted doc at the new README: grpc-server.cpp, docs/content/features/backends.md, gallery/index.yaml, the canary apply script (PIN_BUMP_APPLY_CHECK.md -> README), and the base patches/README.md (ADDITIVE_DESIGN.md -> README). The canary's PIN_SYNC reference still resolves; its inert SSM_DECODE_FIX_RESULTS.md glob (a patch-internal path matcher, not a repo-doc link) is left intact. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 09:25:47 +00:00
Ettore Di Giacinto	d9d846e04b	feat(paged): patch 0003 gather-read - Gate 0 green, token-identical, additive Implements the paged-attention gather-read (the real engine compute): attention reads ONLY a sequence's used cells by gathering K, V and the kq_mask by the non-empty-cell index list before build_attn_mha. Verified token-identical to stock greedy generation, 9/9 across 3 prompts x {32,96,128} tokens on Qwen3-0.6B, with n_gather=71 < n_kv=256 confirming real compaction (not an identity no-op). Built in the additive "hook, don't edit" form: all logic in new src/paged-attn.{h,cpp} (an llm_graph_input_i gather-index subclass + the K/V/mask gather), hooked by one line in build_attn + two thin accessors on llama_kv_cache_context + one CMake line. No edit to llm_graph_input_attn_kv or llama-graph.h. 216 insertions; default-off behind LLAMA_KV_PAGED so stock path stays byte-identical. Key correctness finding: get_gather_idxs emits cells sorted by token position. CPU flash-attn's online softmax reduces cells in physical-array order and is FP-order- sensitive, so 0002's scattered placement alone (full-window read) diverges from stock past the first block; the position-sorted gather reproduces stock's exact reduction order -> bit-identical. So 0003 is what makes paged placement token-identical under flash-attn. Verified on a dev tree at the pin (0001+0002+0003 on branch paged); not pushed. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 08:26:46 +00:00
Ettore Di Giacinto	c4b4f3a3e4	docs(paged): series status 0001/0002 done+verified; honest parity note Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:05:14 +00:00
Ettore Di Giacinto	ba3fa5a633	build(paged): stacking patch-series scaffolding for llama.cpp paged attention Numbered patches under backend/cpp/llama-cpp/patches/ applied in order against the pinned LLAMA_VERSION (build hook in the llama.cpp: target). Each phase is one small, independently-buildable patch so the work rebases cleanly across llama.cpp bumps (anti-drift). README defines the series (0001 vendor manager -> 0006 prefix caching) + the regen workflow. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 22:53:20 +00:00

4 Commits