LocalAI

mirror/LocalAI

Fork 0

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-24 08:38:51 -04:00

Commit Graph

Author	SHA1	Message	Date
Ettore Di Giacinto	931793aa24	feat(paged): target-readiness for 2xH200 - correctness PASS, load-gen harness, projection Deliverables for pushing paged KV toward the real target (2xH200), since GB10 is only the test box and its "no win" result is a low-bandwidth artifact: 1. Correctness verified. test-paged-kv-e2e is greedy-equivalent to the contiguous reference (top-5 argmax ref=paged=3743, overlap 5/5). Found + fixed the blocking bug: common_fit_paged_kv_blocks over-reports free VRAM on GB10's unified device and tried 245GB of KV on a 119GB box, OOM-aborting context creation. Patch in patches/0002; durable fix (clamp to free_vram, honor --fit off) noted. 2. paged-loadgen.cpp: a dynamic-load benchmark that actually exercises where paging wins - variable prompt/gen lengths, continuous arrival, shared prefix - and reports the capacity ratio (contiguous reserve / paged peak KV). The stock tools run fixed-length all-at-once load, which is why they never show a paged win. 3. Projection to 2xH200, grounded in measured GB10 plateaus. Decode is bandwidth- bound, so the ceiling (~16k t/s for 32B) needs ~3,800 concurrent seqs, but contiguous KV fits only ~490 in HBM at 2k ctx - so KV memory IS the binding constraint on the target (unlike GB10), and paged KV's ~5-10x capacity (no over-reservation + prefix sharing) is what reaches the ceiling. The thesis holds on the target; remaining work is hardening/finishing the paged op (PR22569 was 12-13% slower and lacks prefix sharing). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 23:16:28 +00:00
Ettore Di Giacinto	bbc84a9889	feat(paged): Gate 0 in-model - token-identical generation with paged KV placement Wire paged, non-contiguous fixed-size BLOCK placement into the real llama.cpp KV cache (find_slot), behind env LLAMA_KV_PAGED, and validate Gate 0 on a real GGUF: Qwen3-0.6B greedy generation is TOKEN-IDENTICAL to the contiguous cache while its KV is physically scattered across permuted blocks (cells 0-15, 144-159, 32-47, ...). Proven non-contiguous via LLAMA_KV_PAGED_DEBUG, not a silent fallback. This retires the correctness premise of paged attention IN THE MODEL (not just at the ggml-op level): attention is invariant to physical KV placement, because reads use per-cell pos/seq metadata for masking. The patch lives at patches/0001-paged-kv-block-placement.patch (against llama.cpp 0253fb21f). Scope: storage/placement layer, single sequence. Remaining (P4): the gather-read compute path (attend only a seq's own blocks) for the throughput win, and the multi-sequence driver. README updated with repro + status. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 08:51:42 +00:00

Author

SHA1

Message

Date

Ettore Di Giacinto

931793aa24

feat(paged): target-readiness for 2xH200 - correctness PASS, load-gen harness, projection

Deliverables for pushing paged KV toward the real target (2xH200), since GB10 is
only the test box and its "no win" result is a low-bandwidth artifact:

1. Correctness verified. test-paged-kv-e2e is greedy-equivalent to the contiguous
   reference (top-5 argmax ref=paged=3743, overlap 5/5). Found + fixed the blocking
   bug: common_fit_paged_kv_blocks over-reports free VRAM on GB10's unified device
   and tried 245GB of KV on a 119GB box, OOM-aborting context creation. Patch in
   patches/0002; durable fix (clamp to free_vram, honor --fit off) noted.

2. paged-loadgen.cpp: a dynamic-load benchmark that actually exercises where paging
   wins - variable prompt/gen lengths, continuous arrival, shared prefix - and
   reports the capacity ratio (contiguous reserve / paged peak KV). The stock tools
   run fixed-length all-at-once load, which is why they never show a paged win.

3. Projection to 2xH200, grounded in measured GB10 plateaus. Decode is bandwidth-
   bound, so the ceiling (~16k t/s for 32B) needs ~3,800 concurrent seqs, but
   contiguous KV fits only ~490 in HBM at 2k ctx - so KV memory IS the binding
   constraint on the target (unlike GB10), and paged KV's ~5-10x capacity (no
   over-reservation + prefix sharing) is what reaches the ceiling. The thesis holds
   on the target; remaining work is hardening/finishing the paged op (PR22569 was
   12-13% slower and lacks prefix sharing).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-21 23:16:28 +00:00

Ettore Di Giacinto

bbc84a9889

feat(paged): Gate 0 in-model - token-identical generation with paged KV placement

Wire paged, non-contiguous fixed-size BLOCK placement into the real
llama.cpp KV cache (find_slot), behind env LLAMA_KV_PAGED, and validate
Gate 0 on a real GGUF: Qwen3-0.6B greedy generation is TOKEN-IDENTICAL to
the contiguous cache while its KV is physically scattered across permuted
blocks (cells 0-15, 144-159, 32-47, ...). Proven non-contiguous via
LLAMA_KV_PAGED_DEBUG, not a silent fallback.

This retires the correctness premise of paged attention IN THE MODEL (not
just at the ggml-op level): attention is invariant to physical KV placement,
because reads use per-cell pos/seq metadata for masking. The patch lives at
patches/0001-paged-kv-block-placement.patch (against llama.cpp 0253fb21f).

Scope: storage/placement layer, single sequence. Remaining (P4): the
gather-read compute path (attend only a seq's own blocks) for the throughput
win, and the multi-sequence driver. README updated with repro + status.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-19 08:51:42 +00:00

2 Commits