LocalAI

mirror/LocalAI

Fork 0

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 08:08:52 -04:00

Commit Graph

Author	SHA1	Message	Date
Ettore Di Giacinto	d9d846e04b	feat(paged): patch 0003 gather-read - Gate 0 green, token-identical, additive Implements the paged-attention gather-read (the real engine compute): attention reads ONLY a sequence's used cells by gathering K, V and the kq_mask by the non-empty-cell index list before build_attn_mha. Verified token-identical to stock greedy generation, 9/9 across 3 prompts x {32,96,128} tokens on Qwen3-0.6B, with n_gather=71 < n_kv=256 confirming real compaction (not an identity no-op). Built in the additive "hook, don't edit" form: all logic in new src/paged-attn.{h,cpp} (an llm_graph_input_i gather-index subclass + the K/V/mask gather), hooked by one line in build_attn + two thin accessors on llama_kv_cache_context + one CMake line. No edit to llm_graph_input_attn_kv or llama-graph.h. 216 insertions; default-off behind LLAMA_KV_PAGED so stock path stays byte-identical. Key correctness finding: get_gather_idxs emits cells sorted by token position. CPU flash-attn's online softmax reduces cells in physical-array order and is FP-order- sensitive, so 0002's scattered placement alone (full-window read) diverges from stock past the first block; the position-sorted gather reproduces stock's exact reduction order -> bit-identical. So 0003 is what makes paged placement token-identical under flash-attn. Verified on a dev tree at the pin (0001+0002+0003 on branch paged); not pushed. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 08:26:46 +00:00
Ettore Di Giacinto	c4b4f3a3e4	docs(paged): series status 0001/0002 done+verified; honest parity note Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:05:14 +00:00
Ettore Di Giacinto	ba3fa5a633	build(paged): stacking patch-series scaffolding for llama.cpp paged attention Numbered patches under backend/cpp/llama-cpp/patches/ applied in order against the pinned LLAMA_VERSION (build hook in the llama.cpp: target). Each phase is one small, independently-buildable patch so the work rebases cleanly across llama.cpp bumps (anti-drift). README defines the series (0001 vendor manager -> 0006 prefix caching) + the regen workflow. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 22:53:20 +00:00

Author

SHA1

Message

Date

Ettore Di Giacinto

d9d846e04b

feat(paged): patch 0003 gather-read - Gate 0 green, token-identical, additive

Implements the paged-attention gather-read (the real engine compute): attention
reads ONLY a sequence's used cells by gathering K, V and the kq_mask by the
non-empty-cell index list before build_attn_mha. Verified token-identical to stock
greedy generation, 9/9 across 3 prompts x {32,96,128} tokens on Qwen3-0.6B, with
n_gather=71 < n_kv=256 confirming real compaction (not an identity no-op).

Built in the additive "hook, don't edit" form: all logic in new src/paged-attn.{h,cpp}
(an llm_graph_input_i gather-index subclass + the K/V/mask gather), hooked by one line
in build_attn + two thin accessors on llama_kv_cache_context + one CMake line. No edit
to llm_graph_input_attn_kv or llama-graph.h. 216 insertions; default-off behind
LLAMA_KV_PAGED so stock path stays byte-identical.

Key correctness finding: get_gather_idxs emits cells sorted by token position. CPU
flash-attn's online softmax reduces cells in physical-array order and is FP-order-
sensitive, so 0002's scattered placement alone (full-window read) diverges from stock
past the first block; the position-sorted gather reproduces stock's exact reduction
order -> bit-identical. So 0003 is what makes paged placement token-identical under
flash-attn.

Verified on a dev tree at the pin (0001+0002+0003 on branch paged); not pushed.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-22 08:26:46 +00:00

Ettore Di Giacinto

c4b4f3a3e4

docs(paged): series status 0001/0002 done+verified; honest parity note

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-19 23:05:14 +00:00

Ettore Di Giacinto

ba3fa5a633

build(paged): stacking patch-series scaffolding for llama.cpp paged attention

Numbered patches under backend/cpp/llama-cpp/patches/ applied in order against
the pinned LLAMA_VERSION (build hook in the llama.cpp: target). Each phase is one
small, independently-buildable patch so the work rebases cleanly across llama.cpp
bumps (anti-drift). README defines the series (0001 vendor manager -> 0006 prefix
caching) + the regen workflow.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-19 22:53:20 +00:00

3 Commits