Add ACCELERATOR_PORTING_SCOPE.md, the umbrella scope for taking the paged
backend's accelerator-portable wins off the CUDA family. It builds on (does
not duplicate) UPSTREAM_LAYER2_SCOPE.md, which stays the GDN/SSM-fusion
detail (benefit #1), and adds:
- Benefit #2 (paged KV in-kernel block-table flash-attn read, 0009-0011):
new per-backend feasibility from source analysis of the Metal/SYCL/Vulkan
flash-attn kernels. SYCL EASY (near line-for-line CUDA mirror), Metal
EASY-MEDIUM (decode already routes to the vec kernel), Vulkan MEDIUM (the
fast coopmat2 NVIDIA decode path cannot do the indexed read; push-constants
are full). Universal constraint: only the vec/scalar decode kernel admits
the per-cell indexed read, so route block-table ops onto vec (as CUDA's
0009-0010 dispatch guard already does) and leave the fast MM/coopmat2 path
contiguous-only. This is the lever that flips paged KV from
neutral-to-slightly-negative to non-negative off CUDA.
- Benefit #3 (decode-first scheduler, 0013/0016): confirmed a free portable
win - host-side update_slots() policy, zero kernel work, runs on any
accelerator as-is.
- Benefit #4 (NVFP4 FP4-MMA, 0017/0023/0025): out of scope (Blackwell only);
flags the backend-agnostic analogues of the act-quant dedup and the
graph-coverage lever without over-claiming a port.
- A ROCm note: ROCm rides the CUDA/HIP path (validate, don't re-port);
FP4-MMA stays Blackwell-only.
Benefits #1 and #2 share the port shape and rank Metal->SYCL->Vulkan, so they
bundle into one per-backend PR behind a shared ops-first PR. Cross-link added
from UPSTREAM_LAYER2_SCOPE.md. All gates are test-backend-ops on-target (no
Metal/SYCL/Vulkan/ROCm hardware here).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The llama-cpp-localai-paged patches/ dir had accumulated docs, plots, a csv,
dev .cpp harnesses, and a dead FP4-MoE kernel scaffold after an earlier git-mv.
Restore the invariant that patches/ holds only the .patch series.
Moves:
- patches/paged/README.md -> README.md (canonical doc at the backend root)
- patches/paged/{PIN_SYNC_c299a92c,PAGED_BITEXACT_NOTE,LOCALAI_LLAMACPP_BACKEND_PLAN,UPSTREAM_LAYER2_SCOPE}.md,
final_benchmark.csv, qwen36_*.png, paged-burst-bench.cpp, paged-reclaim-unit.cpp -> docs/
- patches/README.md -> docs/PATCH_MAINTENANCE.md (unique patch-regen recipe not in the canonical README)
Deletes:
- patches/BENCHMARKS.md (superseded by README section 4 + the dev-notes section)
- patches/kernel/ (dead FP4-MoE scaffold, never in the 0001-0030 apply glob, zero refs repo-wide)
Repoint every reference to the moved files: README internal links (docs/ + the
.github links drop from 5x ../ to 3x ../), .agents/llama-cpp-localai-paged-backend.md,
.github/scripts/paged-canary-apply.sh, .github/workflows/llama-cpp-paged-canary.yml,
the wrapper Makefile, backend/cpp/llama-cpp/grpc-server.cpp, backend/index.yaml,
docs/content/features/backends.md, gallery/index.yaml.
The build apply glob PAGED_PATCHES_DIR/0*.patch (PAGED_PATCHES_DIR := .../patches/paged)
is unchanged and still resolves to the 28 patches.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>