LocalAI

mirror/LocalAI

Fork 0

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 16:19:07 -04:00

Commit Graph

Author	SHA1	Message	Date
Ettore Di Giacinto	931793aa24	feat(paged): target-readiness for 2xH200 - correctness PASS, load-gen harness, projection Deliverables for pushing paged KV toward the real target (2xH200), since GB10 is only the test box and its "no win" result is a low-bandwidth artifact: 1. Correctness verified. test-paged-kv-e2e is greedy-equivalent to the contiguous reference (top-5 argmax ref=paged=3743, overlap 5/5). Found + fixed the blocking bug: common_fit_paged_kv_blocks over-reports free VRAM on GB10's unified device and tried 245GB of KV on a 119GB box, OOM-aborting context creation. Patch in patches/0002; durable fix (clamp to free_vram, honor --fit off) noted. 2. paged-loadgen.cpp: a dynamic-load benchmark that actually exercises where paging wins - variable prompt/gen lengths, continuous arrival, shared prefix - and reports the capacity ratio (contiguous reserve / paged peak KV). The stock tools run fixed-length all-at-once load, which is why they never show a paged win. 3. Projection to 2xH200, grounded in measured GB10 plateaus. Decode is bandwidth- bound, so the ceiling (~16k t/s for 32B) needs ~3,800 concurrent seqs, but contiguous KV fits only ~490 in HBM at 2k ctx - so KV memory IS the binding constraint on the target (unlike GB10), and paged KV's ~5-10x capacity (no over-reservation + prefix sharing) is what reaches the ceiling. The thesis holds on the target; remaining work is hardening/finishing the paged op (PR22569 was 12-13% slower and lacks prefix sharing). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 23:16:28 +00:00

Author

SHA1

Message

Date

Ettore Di Giacinto

931793aa24

feat(paged): target-readiness for 2xH200 - correctness PASS, load-gen harness, projection

Deliverables for pushing paged KV toward the real target (2xH200), since GB10 is
only the test box and its "no win" result is a low-bandwidth artifact:

1. Correctness verified. test-paged-kv-e2e is greedy-equivalent to the contiguous
   reference (top-5 argmax ref=paged=3743, overlap 5/5). Found + fixed the blocking
   bug: common_fit_paged_kv_blocks over-reports free VRAM on GB10's unified device
   and tried 245GB of KV on a 119GB box, OOM-aborting context creation. Patch in
   patches/0002; durable fix (clamp to free_vram, honor --fit off) noted.

2. paged-loadgen.cpp: a dynamic-load benchmark that actually exercises where paging
   wins - variable prompt/gen lengths, continuous arrival, shared prefix - and
   reports the capacity ratio (contiguous reserve / paged peak KV). The stock tools
   run fixed-length all-at-once load, which is why they never show a paged win.

3. Projection to 2xH200, grounded in measured GB10 plateaus. Decode is bandwidth-
   bound, so the ceiling (~16k t/s for 32B) needs ~3,800 concurrent seqs, but
   contiguous KV fits only ~490 in HBM at 2k ctx - so KV memory IS the binding
   constraint on the target (unlike GB10), and paged KV's ~5-10x capacity (no
   over-reservation + prefix sharing) is what reaches the ceiling. The thesis holds
   on the target; remaining work is hardening/finishing the paged op (PR22569 was
   12-13% slower and lacks prefix sharing).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-21 23:16:28 +00:00

1 Commits