LocalAI

mirror/LocalAI

Fork 0

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 16:19:07 -04:00

Commit Graph

Author	SHA1	Message	Date
Ettore Di Giacinto	edb1a11abc	feat(paged): vLLM-parity KV block manager (Phase 0, CPU-first prototype) Host-side paged-attention block manager ported faithfully from vLLM V1 (block_pool.py, kv_cache_utils.py, single_type_kv_cache_manager.py): - KVCacheBlock + intrusive LRU FreeBlockQueue (O(1) middle removal) - BlockPool: get_new_blocks / touch / free_blocks eviction ordering / cache_full_blocks / lazy eviction on reuse - PagedKVManager: on-demand allocate, block_table, slot arithmetic (slot = block_id*block_size + offset), free - Prefix caching: chained block hashing + find_longest_cache_hit (first-miss stop), enabling automatic cross-tenant prefix sharing Pure C++17, zero ggml/llama.cpp dependency, unit-tested to vLLM behavioral parity (4/4 suites green). Parity is on algorithm/behavior, not hash bytes. Phase 0 of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md. Phases 1-5 (ggml storage, gather-to-scratch read path, Gate 0 correctness, benchmark wins, prefix-share serving) follow. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 08:26:31 +00:00

Author

SHA1

Message

Date

Ettore Di Giacinto

edb1a11abc

feat(paged): vLLM-parity KV block manager (Phase 0, CPU-first prototype)

Host-side paged-attention block manager ported faithfully from vLLM V1
(block_pool.py, kv_cache_utils.py, single_type_kv_cache_manager.py):

- KVCacheBlock + intrusive LRU FreeBlockQueue (O(1) middle removal)
- BlockPool: get_new_blocks / touch / free_blocks eviction ordering /
  cache_full_blocks / lazy eviction on reuse
- PagedKVManager: on-demand allocate, block_table, slot arithmetic
  (slot = block_id*block_size + offset), free
- Prefix caching: chained block hashing + find_longest_cache_hit
  (first-miss stop), enabling automatic cross-tenant prefix sharing

Pure C++17, zero ggml/llama.cpp dependency, unit-tested to vLLM behavioral
parity (4/4 suites green). Parity is on algorithm/behavior, not hash bytes.

Phase 0 of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md.
Phases 1-5 (ggml storage, gather-to-scratch read path, Gate 0 correctness,
benchmark wins, prefix-share serving) follow.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-19 08:26:31 +00:00

1 Commits