feat(paged): paged-bench - measure capacity & prefix-sharing wins · ddace5fb6a - LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 16:19:07 -04:00

feat(paged): paged-bench - measure capacity & prefix-sharing wins

Quantify the two multi-tenant wins that are properties of the host-side
block model (vLLM-parity), independent of the in-model compute path:

  WIN 1 concurrency capacity @ 512-block budget
    contiguous (reserve n_ctx/seq): 4 sequences
    paged (on-demand blocks):       37 sequences
    --> 9.2x more concurrent sequences

  WIN 3 cross-tenant prefix sharing (32 tenants, 1024-tok shared prefix)
    prefix-cache OFF: 2176 physical blocks
    prefix-cache ON:  192 physical blocks
    --> 11.3x less KV memory

WIN 2 (throughput) is deliberately reported as PENDING: it requires the
paged gather-read path wired into llama-graph.cpp (Gate 0) and is not
measurable at the allocation layer. The win-1 baseline is per-sequence
n_ctx reservation (stream mode); llama.cpp's unified cache already shares
one pool, so the honest win there is on-demand sizing + prefix dedup.

Phase 3 (partial) of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

This commit is contained in:

Ettore Di Giacinto

2026-06-19 08:44:41 +00:00

parent 5a5d3df8c8

commit ddace5fb6a

3 changed files with 137 additions and 1 deletions

1

backend/cpp/llama-cpp/paged/.gitignore vendored

View File

@@ -4,3 +4,4 @@ tests/test_paged_kv_manager
 tests/test_prefix_cache
 tests/test_ggml_paged_rw
 tests/test_ggml_paged_attn
 paged-bench

feat(paged): paged-bench - measure capacity & prefix-sharing wins

1 backend/cpp/llama-cpp/paged/.gitignore vendored Unescape Escape Copy filename View File

1

backend/cpp/llama-cpp/paged/.gitignore vendored

View File