LocalAI/backend/cpp/llama-cpp/patches/paged at 2dd5d68e6de4e1613dc95c4e0f0c5e5828e8c961 - LocalAI - Gitea: Git with a cup of tea

mirror/LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-25 09:09:07 -04:00

Files

History

Ettore Di Giacinto 2dd5d68e6d docs(paged): A.2 Phase 2 - locate the real decode lever (gated-DeltaNet SSM path)

Phase 1 ruled out CUDA graphs as the paged-decode lever (GPU 99.4% busy,
decode_agg flat graphs on-vs-off) and attributed the 2.6x gap to vLLM to the
per-step GPU kernel work (FP4 GEMM + attention at batch 128). Phase 2 decomposed
that kernel work directly on the Phase-1 nsys reps and corrects the attribution.

Findings (q36-27b-nvfp4 = gguf arch qwen35, a 48:16 hybrid gated-DeltaNet
linear-attention + full-attention model; DGX GB10 sm_121, fusion off):
- Graphs re-confirmed not the lever: fresh paged graphs-ON 146.03 vs OFF 144.90
  t/s (+0.78%, noise); the captured rep is 99.5% busy with the same ~3267ms
  memcpy (graphs capture memcpy nodes too).
- The 99.4% busy is real but ~19% of it is D2D memcpy, not compute: an
  overlap-correct interval-union sweep gives kernels-only 80.2% busy, the gap
  filled by 1584 D2D copies/run (~80/step, ~230MB each = the gated-DeltaNet
  recurrent state). Phase 1's cuda_gpu_trace lumped this into compute.
- Decode GPU-time decomposition (% of kernel+memcpy busy): gated_delta_net 23.4%,
  get_rows 21.9%, D2D state copy 18.9%, FP4 GEMV 15.5%, FP4 GEMM 10.4%,
  full attention 0.4%. Grouped: SSM/gated-DeltaNet machinery ~67%, FP4 matmul
  ~28%, full attention (all paged-attn optimizes) ~0.4%.

Verdict: not graphs, not the host loop, not primarily FP4 GEMM, not attention.
Paged attention touches ~0.4% of decode on this model, so no paged/graph/
block-table change can move decode_agg. The lever is the ggml qwen35
gated-DeltaNet decode: kill the per-layer recurrent-state D2D copy and fuse the
get_rows gather into the recurrence (vLLM's fused_recurrent_gated_delta_rule
keeps state in place). Ceiling: -copy ~146->180; -copy-and-gather ~146->247 t/s.

No code patch (the lever is an SSM-path rewrite, orthogonal to paged attention);
patches/paged/0018 stays free.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-24 21:44:22 +00:00

..

0001-vendor-paged-kv-manager.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0002-paged-kv-block-placement-env-LLAMA_KV_PAGED.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0003-gather-read-plan.md

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0003-paged-gather-read-env-LLAMA_KV_PAGED.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0004-paged-on-demand-block-allocation-env-LLAMA_KV_PAGED.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0006-paged-cross-request-prefix-caching-env-LLAMA_KV_PAGED.patch

feat(llama-cpp/paged): cross-request prefix caching patch 0006

2026-06-22 10:14:27 +00:00

0007-paged-engine-prefix-recompute-skip-env-LLAMA_KV_PAGED.patch

feat(llama-cpp/paged): engine-level prefix recompute-skip (patch 0007)

2026-06-22 10:47:10 +00:00

0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch

feat(paged): wire cross-request prefix share into llama-server (patch 0008)

2026-06-22 15:03:16 +00:00

0009-paged-in-kernel-decode-read-env-LLAMA_KV_PAGED-patch.patch

paged: in-kernel decode read patch 0009 (kill the gather regression)

2026-06-22 18:04:09 +00:00

0010-paged-tile-in-kernel-read-and-dispatch-guard-env-LLAMA_KV_PAGED.patch

feat(paged): tile in-kernel decode read + dispatch guard (patch 0010)

2026-06-22 20:37:12 +00:00

0011-paged-decode-route-GQA-grouped-tile-kernel-by-defaul.patch

feat(paged): route GQA-grouped tile kernel by default for paged decode (patch 0011)

2026-06-22 22:38:28 +00:00

0012-paged-mask-pad-invariant-assert.patch

feat(paged): assert mask-pad invariant for the paged tile route (patch 0012)

2026-06-23 09:13:08 +00:00

0013-paged-decoupled-prefill-token-budget.patch

feat(paged): add patch 0013 decoupled per-step prefill-token budget

2026-06-23 09:55:32 +00:00

0014-paged-expert-aware-moe-token-tile-cap.patch

feat(paged): mirror patch 0014 - expert-aware MoE token-tile cap

2026-06-23 13:49:15 +00:00

0015-paged-expert-density-aware-moe-token-tile-auto-select.patch

feat(paged): mirror MoE token-tile density-aware auto-select (patch 0015)

2026-06-23 19:04:55 +00:00

0016-paged-dynamic-prefill-budget-continuous-batch.patch

feat(llama-cpp/paged): dynamic decode-first prefill budget (patch 0016, continuous-batch P1)

2026-06-24 07:48:20 +00:00

0017-fp4-gemm-decode-tile-tune.patch

docs(paged): mirror FP4 decode-GEMM track-B P0 gate + P1 kill-gate results (patch 0017)

2026-06-24 17:58:00 +00:00

A2_CUDAGRAPH_DECODE.md

docs(paged): A.2 Phase 2 - locate the real decode lever (gated-DeltaNet SSM path)

2026-06-24 21:44:22 +00:00

ADDITIVE_DESIGN.md

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

CONTINUOUS_BATCH_SCHEDULER_SCOPE.md

docs(paged): adversarial review of the continuous-batch scheduler scope

2026-06-23 22:48:31 +00:00

DECODE_GAP_STUDY.md

docs(paged): decode-step gap study vs vLLM on GB10

2026-06-22 15:44:24 +00:00

FP4_GEMM_SCOPE_B.md

docs(paged): adversarial review of track-B FP4-GEMM parity go/no-go

2026-06-24 14:31:35 +00:00

GDN_DECODE_VERIFY.md

docs(paged): verify llama.cpp GDN decode is O(1)-in-context, not a 2.4x lever

2026-06-24 11:21:44 +00:00

MOE_DENSITY_AUTO_TILE.md

feat(paged): mirror MoE token-tile density-aware auto-select (patch 0015)

2026-06-23 19:04:55 +00:00

MOE_GROUPED_GEMM_SCOPE.md

docs(paged): scope durable grouped FP4-MMA MoE GEMM port for GB10

2026-06-23 13:17:03 +00:00

MOE_TOKEN_TILE_CAP.md

feat(paged): mirror patch 0014 - expert-aware MoE token-tile cap

2026-06-23 13:49:15 +00:00

P1_DYNAMIC_BUDGET_RESULTS.md

docs(paged): staggered-arrival evaluation of patch 0016 dynamic budget

2026-06-24 10:56:13 +00:00

PAGED_BENCH.md

docs(llama-cpp/paged): GPU 0007 re-run + shared-prefix benchmark results

2026-06-22 12:59:09 +00:00

PAGED_GPU_VERIFY.md

docs(paged): record GPU correctness + CUDA backend-build verification

2026-06-22 11:50:01 +00:00

PAGED_VLLM_APPLES.md

docs(paged): apples-to-apples paged llama.cpp vs vLLM (batched+NVFP4+prefix cache)

2026-06-22 14:16:52 +00:00

PAGED_VLLM_COMPARE.md

docs(paged): stock GPU batch-shape determinism + vLLM shared-prefix comparison

2026-06-22 13:48:01 +00:00

QWEN36_NVFP4_BENCH.md

docs(paged): fair re-run verdict - synthesize NVFP4 llama vs vLLM scorecard

2026-06-23 21:39:22 +00:00

SERVER_SWEEP.md

docs(paged): GB10 head-to-head server sweep (llama-server vs vLLM)

2026-06-23 12:22:15 +00:00

THROUGHPUT_B_P1_RESULTS.md

docs(paged): mirror FP4 decode-GEMM track-B P0 gate + P1 kill-gate results (patch 0017)

2026-06-24 17:58:00 +00:00

VLLM_DECODE_GROUNDING.md

docs(paged): ground vLLM 0.23.0 eager-decode architecture vs llama.cpp

2026-06-24 07:44:07 +00:00