Commit Graph

4 Commits

Author SHA1 Message Date
Ettore Di Giacinto
ddace5fb6a feat(paged): paged-bench - measure capacity & prefix-sharing wins
Quantify the two multi-tenant wins that are properties of the host-side
block model (vLLM-parity), independent of the in-model compute path:

  WIN 1 concurrency capacity @ 512-block budget
    contiguous (reserve n_ctx/seq): 4 sequences
    paged (on-demand blocks):       37 sequences
    --> 9.2x more concurrent sequences

  WIN 3 cross-tenant prefix sharing (32 tenants, 1024-tok shared prefix)
    prefix-cache OFF: 2176 physical blocks
    prefix-cache ON:  192 physical blocks
    --> 11.3x less KV memory

WIN 2 (throughput) is deliberately reported as PENDING: it requires the
paged gather-read path wired into llama-graph.cpp (Gate 0) and is not
measurable at the allocation layer. The win-1 baseline is per-sequence
n_ctx reservation (stream mode); llama.cpp's unified cache already shares
one pool, so the honest win there is on-demand sizing + prefix dedup.

Phase 3 (partial) of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 08:44:41 +00:00
Ettore Di Giacinto
5a5d3df8c8 feat(paged): Phase 2 core - attention over paged KV matches reference
Retire the central numeric risk from the design: feeding gather-to-scratch
KV (a sequence whose blocks are non-contiguous in the shared pool, [2,1,5])
into ggml's standard attention ops produces correct attention.

Path under test: set_rows write -> get_rows gather (K and V) ->
mul_mat(K,Q) -> soft_max_ext -> mul_mat(V^T, probs). Result is compared
against an independent host-computed softmax attention over the same K/V/Q.
Max abs error ~7.5e-08 (n_kv=48, d=8, n_q=4).

This proves the paged read path is numerically sound on CPU with no new
ggml op. Remaining: wire build_attn_paged into llama-graph.cpp and validate
Gate 0 (token-identical greedy generation in a real model).

Phase 2 (core) of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 08:35:35 +00:00
Ettore Di Giacinto
c6698dd4bf feat(paged): Phase 1 - ggml paged write/gather mechanism (CPU)
Validate the paged KV read/write path at the ggml-op level, driven by
PagedKVManager:

- write: ggml_set_rows(pool, k_src, slot_mapping)  scatter K rows by slot
- read:  ggml_get_rows(pool, gather_idx)           gather a seq's slots into
         contiguous scratch (the tensor an attention kernel consumes)

The test forces a non-contiguous, out-of-order physical block layout
(allocate seqA+seqB, free seqA, reallocate seqC -> blocks [2,1,5]) and
proves gather(write(x)) == x plus cross-sequence isolation in the shared
pool. This de-risks the central question (does slot-addressed paged storage
round-trip correctly through ggml) before the llama-graph integration.

Pool is statically allocated via ggml_backend_alloc_ctx_tensors, mirroring
how llama.cpp allocates its KV cache. CPU backend, no new ggml op.
Built against ggml from the vendored llama.cpp checkout.

Phase 1 of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 08:33:26 +00:00
Ettore Di Giacinto
edb1a11abc feat(paged): vLLM-parity KV block manager (Phase 0, CPU-first prototype)
Host-side paged-attention block manager ported faithfully from vLLM V1
(block_pool.py, kv_cache_utils.py, single_type_kv_cache_manager.py):

- KVCacheBlock + intrusive LRU FreeBlockQueue (O(1) middle removal)
- BlockPool: get_new_blocks / touch / free_blocks eviction ordering /
  cache_full_blocks / lazy eviction on reuse
- PagedKVManager: on-demand allocate, block_table, slot arithmetic
  (slot = block_id*block_size + offset), free
- Prefix caching: chained block hashing + find_longest_cache_hit
  (first-miss stop), enabling automatic cross-tenant prefix sharing

Pure C++17, zero ggml/llama.cpp dependency, unit-tested to vLLM behavioral
parity (4/4 suites green). Parity is on algorithm/behavior, not hash bytes.

Phase 0 of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md.
Phases 1-5 (ggml storage, gather-to-scratch read path, Gate 0 correctness,
benchmark wins, prefix-share serving) follow.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 08:26:31 +00:00