Ettore Di Giacinto
|
76cc0b6abc
|
docs(paged): phased plan to make llama.cpp a viable vLLM alternative
Phase 1 (config, PR #10411, DONE): VRAM-scaled n_parallel + Blackwell batch.
Phase 2: paged KV (PR #22569, ~9.5x concurrency). Phase 3: chunked prefill +
n_batch/ubatch split. Phase 4: batched-GEMM kernel tuning. Phase 5: backend
sampling. Cross-cutting: spec-dec for dense.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
2026-06-20 09:35:53 +00:00 |
|