LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 08:08:52 -04:00

Files

Ettore Di Giacinto 5a5d3df8c8 feat(paged): Phase 2 core - attention over paged KV matches reference

Retire the central numeric risk from the design: feeding gather-to-scratch
KV (a sequence whose blocks are non-contiguous in the shared pool, [2,1,5])
into ggml's standard attention ops produces correct attention.

Path under test: set_rows write -> get_rows gather (K and V) ->
mul_mat(K,Q) -> soft_max_ext -> mul_mat(V^T, probs). Result is compared
against an independent host-computed softmax attention over the same K/V/Q.
Max abs error ~7.5e-08 (n_kv=48, d=8, n_q=4).

This proves the paged read path is numerically sound on CPU with no new
ggml op. Remaining: wire build_attn_paged into llama-graph.cpp and validate
Gate 0 (token-identical greedy generation in a real model).

Phase 2 (core) of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-19 08:35:35 +00:00

test_block_pool.cpp

feat(paged): vLLM-parity KV block manager (Phase 0, CPU-first prototype)

2026-06-19 08:26:31 +00:00

test_free_block_queue.cpp

feat(paged): vLLM-parity KV block manager (Phase 0, CPU-first prototype)

2026-06-19 08:26:31 +00:00

test_ggml_paged_attn.cpp

feat(paged): Phase 2 core - attention over paged KV matches reference

2026-06-19 08:35:35 +00:00

test_ggml_paged_rw.cpp

feat(paged): Phase 1 - ggml paged write/gather mechanism (CPU)