Commit Graph

7 Commits

Author SHA1 Message Date
Ettore Di Giacinto
c51ff4cec9 docs(paged): scope porting the portable benefits to Metal/SYCL/Vulkan (+ROCm)
Add ACCELERATOR_PORTING_SCOPE.md, the umbrella scope for taking the paged
backend's accelerator-portable wins off the CUDA family. It builds on (does
not duplicate) UPSTREAM_LAYER2_SCOPE.md, which stays the GDN/SSM-fusion
detail (benefit #1), and adds:

- Benefit #2 (paged KV in-kernel block-table flash-attn read, 0009-0011):
  new per-backend feasibility from source analysis of the Metal/SYCL/Vulkan
  flash-attn kernels. SYCL EASY (near line-for-line CUDA mirror), Metal
  EASY-MEDIUM (decode already routes to the vec kernel), Vulkan MEDIUM (the
  fast coopmat2 NVIDIA decode path cannot do the indexed read; push-constants
  are full). Universal constraint: only the vec/scalar decode kernel admits
  the per-cell indexed read, so route block-table ops onto vec (as CUDA's
  0009-0010 dispatch guard already does) and leave the fast MM/coopmat2 path
  contiguous-only. This is the lever that flips paged KV from
  neutral-to-slightly-negative to non-negative off CUDA.
- Benefit #3 (decode-first scheduler, 0013/0016): confirmed a free portable
  win - host-side update_slots() policy, zero kernel work, runs on any
  accelerator as-is.
- Benefit #4 (NVFP4 FP4-MMA, 0017/0023/0025): out of scope (Blackwell only);
  flags the backend-agnostic analogues of the act-quant dedup and the
  graph-coverage lever without over-claiming a port.
- A ROCm note: ROCm rides the CUDA/HIP path (validate, don't re-port);
  FP4-MMA stays Blackwell-only.

Benefits #1 and #2 share the port shape and rank Metal->SYCL->Vulkan, so they
bundle into one per-backend PR behind a shared ops-first PR. Cross-link added
from UPSTREAM_LAYER2_SCOPE.md. All gates are test-backend-ops on-target (no
Metal/SYCL/Vulkan/ROCm hardware here).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 08:34:32 +00:00
Ettore Di Giacinto
0b84fda496 docs(paged): add the bf16-tau opt-in line to the decode plots
Per request, the plots now show all four series: llama.cpp (standard), vLLM,
LocalAI's llama.cpp patches (bit-exact hero), and LocalAI's patches + bf16-tau
(opt-in ceiling, +3% to +17% over the patches, ahead of vLLM at every dense width
and MoE npl>=32). Subtitle flags bf16-tau as opt-in / not bit-exact.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 22:25:02 +00:00
Ettore Di Giacinto
1431f72b92 docs(paged): regenerate decode plots (3-way) from re-measured data + overview
Rebuild the two committed decode plots from the re-measured CSV and add a combined
overview. Three series per the comparison that matters: llama.cpp (standard) vs
vLLM vs LocalAI's llama.cpp patches; x-over-standard called out at npl128. bf16-tau
stays out of the plot (it remains in the CSV + the README table as the opt-in row).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 22:20:12 +00:00
Ettore Di Giacinto
3466094c68 docs(paged): re-measure DGX benchmarks on one harness (stock/patched/bf16-tau)
Re-run the GB10/DGX-Spark llama-batched-bench matrix (dense q36-27b + MoE
q36-35b-a3b, npl 8/32/64/128, -fa on -ngl 99 -npp 128 -ntg 128) so the CSV and
README section 4 carry a single consistent set of llama numbers with all three
configs:

- stock: separately-built unpatched llama.cpp at this backend's exact pin
  9d5d882d (toggling LLAMA_KV_PAGED on the patched binary does NOT reproduce
  stock - the SSM decode fusions are compiled in, not env-gated).
- patched: paged binary, LLAMA_KV_PAGED=1 (+LLAMA_MOE_FORCE_GRAPHS=1 for MoE).
- patched+bf16-tau: patched plus --ssm-bf16-tau 64 (opt-in, NOT bit-exact,
  ~91% same-top-p).

final_benchmark.csv now has stock + patched + bf16-tau + vllm rows for both
models at all four widths (the prior CSV had no stock and no bf16-tau rows).
peak_gb is dropped: the GB10's unified LPDDR5x reports [N/A] to nvidia-smi and
the bench does not print it, so per-run peak could not be captured this session.

Patch series gives up to 2.46x (dense) / 2.26x (MoE) over true-stock; opt-in
bf16-tau adds a further +3% to +17% on top of patched (growing with width).
vLLM column is kept from the prior session (not re-run) and labeled as such.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 22:05:59 +00:00
Ettore Di Giacinto
ed5eb705c7 docs(paged): drop moot PIN_SYNC_c299a92c record, repoint to README sec 7
The paged backend's llama.cpp pin was reverted from c299a92c back to
9d5d882d (== stock), so docs/PIN_SYNC_c299a92c.md (a blow-by-blow of the
reverted sync) is dead weight. The pin-sync PROCESS stays documented in
the three live places: the Makefile comment, README section 7 (Pin +
maintenance policy), and .agents/llama-cpp-localai-paged-backend.md.

Delete the doc and repoint every reference to it (Makefile, README,
.agents, canary script + workflow) at README section 7. No functional
paths change: the canary's patches-dir glob (patches/paged/0*.patch)
is untouched.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 21:34:10 +00:00
Ettore Di Giacinto
53f66a6f03 fix(paged): revert pin to 9d5d882d (== stock); c299a92c broke grpc-server link
The c299a92c bump diverged 23 commits ahead of the stock llama-cpp pin.
grpc-server.cpp is SHARED with the stock backend and tracks the stock pin;
c299a92c's upstream server-API refactor pulled stream_* helpers into the headers
grpc-server.cpp includes, whose definitions the stock-aligned build does not
compile -> every paged variant failed to LINK (undefined reference to
stream_aware_should_stop / stream_pipe_producer::cleanup /
stream_session_attach_pipe). The bump was greedy-md5 bit-exact, but the bit-exact
gate never exercises the full grpc-server build, so it slipped through.

Revert LLAMA_VERSION to 9d5d882d (== stock pin, where the patches are bit-exact
AND grpc-server links - the original DGX-proven baseline). Document the hard
constraint in the Makefile, README, PIN_SYNC record, and the .agents guide: the
paged pin must track the stock pin, and a pin-sync must pass the full CI
grpc-server build, not only the bit-exact gate.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 20:28:28 +00:00
Ettore Di Giacinto
08b754f910 chore(paged): keep patches/ patch-only; README to backend root, docs to docs/
The llama-cpp-localai-paged patches/ dir had accumulated docs, plots, a csv,
dev .cpp harnesses, and a dead FP4-MoE kernel scaffold after an earlier git-mv.
Restore the invariant that patches/ holds only the .patch series.

Moves:
- patches/paged/README.md -> README.md (canonical doc at the backend root)
- patches/paged/{PIN_SYNC_c299a92c,PAGED_BITEXACT_NOTE,LOCALAI_LLAMACPP_BACKEND_PLAN,UPSTREAM_LAYER2_SCOPE}.md,
  final_benchmark.csv, qwen36_*.png, paged-burst-bench.cpp, paged-reclaim-unit.cpp -> docs/
- patches/README.md -> docs/PATCH_MAINTENANCE.md (unique patch-regen recipe not in the canonical README)

Deletes:
- patches/BENCHMARKS.md (superseded by README section 4 + the dev-notes section)
- patches/kernel/ (dead FP4-MoE scaffold, never in the 0001-0030 apply glob, zero refs repo-wide)

Repoint every reference to the moved files: README internal links (docs/ + the
.github links drop from 5x ../ to 3x ../), .agents/llama-cpp-localai-paged-backend.md,
.github/scripts/paged-canary-apply.sh, .github/workflows/llama-cpp-paged-canary.yml,
the wrapper Makefile, backend/cpp/llama-cpp/grpc-server.cpp, backend/index.yaml,
docs/content/features/backends.md, gallery/index.yaml.

The build apply glob PAGED_PATCHES_DIR/0*.patch (PAGED_PATCHES_DIR := .../patches/paged)
is unchanged and still resolves to the 28 patches.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 13:20:05 +00:00