Files
LocalAI/backend/cpp/llama-cpp/paged/Makefile
Ettore Di Giacinto ddace5fb6a feat(paged): paged-bench - measure capacity & prefix-sharing wins
Quantify the two multi-tenant wins that are properties of the host-side
block model (vLLM-parity), independent of the in-model compute path:

  WIN 1 concurrency capacity @ 512-block budget
    contiguous (reserve n_ctx/seq): 4 sequences
    paged (on-demand blocks):       37 sequences
    --> 9.2x more concurrent sequences

  WIN 3 cross-tenant prefix sharing (32 tenants, 1024-tok shared prefix)
    prefix-cache OFF: 2176 physical blocks
    prefix-cache ON:  192 physical blocks
    --> 11.3x less KV memory

WIN 2 (throughput) is deliberately reported as PENDING: it requires the
paged gather-read path wired into llama-graph.cpp (Gate 0) and is not
measurable at the allocation layer. The win-1 baseline is per-sequence
n_ctx reservation (stream mode); llama.cpp's unified cache already shares
one pool, so the honest win there is on-demand sizing + prefix dedup.

Phase 3 (partial) of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 08:44:41 +00:00

42 lines
1.4 KiB
Makefile

CXX ?= g++
CXXFLAGS ?= -std=c++17 -O2 -Wall -Wextra -I.
TESTS = test_free_block_queue test_block_pool test_paged_kv_manager test_prefix_cache
BINS = $(addprefix tests/,$(TESTS))
all: $(BINS)
tests/%: tests/%.cpp paged_kv_manager.cpp paged_kv_manager.h
$(CXX) $(CXXFLAGS) -o $@ $< paged_kv_manager.cpp
check: all
@for t in $(BINS); do echo "== $$t =="; ./$$t || exit 1; done
paged-bench: paged-bench.cpp paged_kv_manager.cpp paged_kv_manager.h
$(CXX) $(CXXFLAGS) -o $@ paged-bench.cpp paged_kv_manager.cpp
bench: paged-bench
./paged-bench
# --- Optional ggml integration test (Phase 1: paged write/gather mechanism) ---
# Requires a built ggml. Override these to point at your checkout / build:
# make ggml-check GGML_SRC=<llama.cpp>/ggml GGML_BUILD=<ggml-build>
GGML_SRC ?= ../../llama-cpp-fallback-build/llama.cpp/ggml
GGML_BUILD ?= /tmp/ggml-build
GGML_LIBDIR = $(GGML_BUILD)/src
GGML_TESTS = test_ggml_paged_rw test_ggml_paged_attn
GGML_BINS = $(addprefix tests/,$(GGML_TESTS))
tests/test_ggml_%: tests/test_ggml_%.cpp paged_kv_manager.cpp paged_kv_manager.h
$(CXX) $(CXXFLAGS) -I$(GGML_SRC)/include -o $@ $< paged_kv_manager.cpp \
-L$(GGML_LIBDIR) -lggml -lggml-base -lggml-cpu -Wl,-rpath,$(GGML_LIBDIR)
ggml-check: $(GGML_BINS)
@for t in $(GGML_BINS); do echo "== $$t =="; ./$$t || exit 1; done
clean:
rm -f $(BINS) $(GGML_BINS) paged-bench
.PHONY: all check ggml-check clean