LocalAI/backend/cpp/llama-cpp/patches/paged at a3abd60ae06732f4ff583ace06f8ec2b062fc1f1 - LocalAI - Gitea: Git with a cup of tea

mirror/LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-24 00:28:55 -04:00

Files

History

Ettore Di Giacinto a3abd60ae0 docs(paged): GB10 head-to-head server sweep (llama-server vs vLLM)

Same-day steady-state aggregate-decode sweep at npl 8/32/64/128 for three
model classes, replacing the stale ~75-80%-of-vLLM carried figure with a
full concurrency curve.

Findings:
- Dense 32B (NVFP4 vs NVFP4A16): parity at batch-8 (97%), 72-86% mid/high.
- Small 0.6B: parity at batch-8 (99%), 49-67% at high concurrency
  (llama plateaus ~2.0k, vLLM scales to 4.2k; runtime/scheduler-bound).
- MoE 30B-A3B: llama-only at 290-1041 tok/s. vLLM cannot serve it on GB10
  (bf16 hangs at MoE warmup and reboots the box, twice; mxfp4 GGUF expert
  tensors unmappable by vLLM 0.23.0).

Batch-8 anomaly resolved: clean isolated dense batch-8 decode is ~88-90
tok/s (~89 ms/step) across paged-vs-stock (within 2%, paged slightly
faster) and ctx 65536-vs-163840 (within 1%). The prior 471 ms/step was a
mixed-load decode/prefill contention artifact, not paged overhead, ctx
allocation, or NVFP4 cost - the case patch 0013 LLAMA_PREFILL_BUDGET bounds.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-23 12:22:15 +00:00

..

0001-vendor-paged-kv-manager.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0002-paged-kv-block-placement-env-LLAMA_KV_PAGED.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0003-gather-read-plan.md

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0003-paged-gather-read-env-LLAMA_KV_PAGED.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0004-paged-on-demand-block-allocation-env-LLAMA_KV_PAGED.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0006-paged-cross-request-prefix-caching-env-LLAMA_KV_PAGED.patch

feat(llama-cpp/paged): cross-request prefix caching patch 0006

2026-06-22 10:14:27 +00:00

0007-paged-engine-prefix-recompute-skip-env-LLAMA_KV_PAGED.patch

feat(llama-cpp/paged): engine-level prefix recompute-skip (patch 0007)

2026-06-22 10:47:10 +00:00

0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch

feat(paged): wire cross-request prefix share into llama-server (patch 0008)

2026-06-22 15:03:16 +00:00

0009-paged-in-kernel-decode-read-env-LLAMA_KV_PAGED-patch.patch

paged: in-kernel decode read patch 0009 (kill the gather regression)

2026-06-22 18:04:09 +00:00

0010-paged-tile-in-kernel-read-and-dispatch-guard-env-LLAMA_KV_PAGED.patch

feat(paged): tile in-kernel decode read + dispatch guard (patch 0010)

2026-06-22 20:37:12 +00:00

0011-paged-decode-route-GQA-grouped-tile-kernel-by-defaul.patch

feat(paged): route GQA-grouped tile kernel by default for paged decode (patch 0011)

2026-06-22 22:38:28 +00:00

0012-paged-mask-pad-invariant-assert.patch

feat(paged): assert mask-pad invariant for the paged tile route (patch 0012)

2026-06-23 09:13:08 +00:00

0013-paged-decoupled-prefill-token-budget.patch

feat(paged): add patch 0013 decoupled per-step prefill-token budget

2026-06-23 09:55:32 +00:00

ADDITIVE_DESIGN.md

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

DECODE_GAP_STUDY.md

docs(paged): decode-step gap study vs vLLM on GB10

2026-06-22 15:44:24 +00:00

PAGED_BENCH.md

docs(llama-cpp/paged): GPU 0007 re-run + shared-prefix benchmark results

2026-06-22 12:59:09 +00:00

PAGED_GPU_VERIFY.md

docs(paged): record GPU correctness + CUDA backend-build verification

2026-06-22 11:50:01 +00:00

PAGED_VLLM_APPLES.md

docs(paged): apples-to-apples paged llama.cpp vs vLLM (batched+NVFP4+prefix cache)

2026-06-22 14:16:52 +00:00

PAGED_VLLM_COMPARE.md

docs(paged): stock GPU batch-shape determinism + vLLM shared-prefix comparison

2026-06-22 13:48:01 +00:00

SERVER_SWEEP.md

docs(paged): GB10 head-to-head server sweep (llama-server vs vLLM)

2026-06-23 12:22:15 +00:00