docs(paged): Mac stock-vs-patched bench + Vulkan note + cross-backend learnings

Section 4(c): real Apple M4/Metal numbers (Qwen3-8B Q4_K_M, stock vs patched) - patchset is neutral-to-slightly-negative on Metal (the in-kernel block-table read is CUDA-only; NVFP4/GDN-fusions inert), so prefer stock llama-cpp on Apple Silicon. Vulkan: same picture, worse (no upstream GDN op). Section 6: cross-backend learnings + upstream candidates (the GDN decode-plumbing fusions are the portable, bit-exact, CPU-mirrored win worth upstreaming). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 09:57:14 -04:00 · 2026-06-27 11:05:37 +00:00
parent 78fac9a28f
commit 4a9a1dd247
1 changed files with 45 additions and 9 deletions
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/README.md
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/README.md
@@ -197,18 +197,37 @@ groundtruth config**. Memory: llama uses **1.5-3x lower** memory than vLLM.
 stock on the same harness. Dense is **parity-to-ahead of vLLM**; MoE trails - the
 remaining gap is structural (see section 5).

-### (c) Apple M4 (16GB) - for curiosity only
+### (c) Apple Silicon (M4, 16GB Metal) - does the patchset help here?

-No t/s: the 24GB NVFP4 GGUF did not finish downloading and would not fit in 16GB
-RAM (= SSD paging). Architectural findings:
+Short answer: **no - the wins are CUDA/Blackwell-specific.** Two facts first: the
+24GB NVFP4 GGUF doesn't fit a 16GB M4 (SSD paging), and on Metal `supports_op`
+**excludes NVFP4** from `MUL_MAT`/`MUL_MAT_ID`/`GET_ROWS` (FP4 matmuls fall back to
+CPU - no Apple FP4-MMA). So NVFP4 Qwen3.6 is not a Mac fit; a Metal-native Q4_K is.

- Metal `supports_op` **excludes NVFP4** from `MUL_MAT` / `MUL_MAT_ID` /
-  `GET_ROWS`, so the FP4 matmuls **fall back to CPU** - there is no Apple
-  FP4-MMA.
- `GATED_DELTA_NET` and `SSM_CONV` / `SSM_SCAN` **do** have Metal kernels.
+Measured **stock vs patched** (same pin `c299a92c`, both built `-DGGML_METAL=ON`;
+the 28-patch series **compiles clean on Metal** - the CUDA code is `#if`-guarded),
+on **Qwen3-8B Q4_K_M** (a dense GQA model that fits 16GB and exercises the *live*
+Metal features; no Qwen3.6 hybrid GGUF fits 16GB, and the GDN fusions gate off on
+Metal anyway), `llama-bench` pp512/tg128 t/s:

-Verdict: NVFP4 Qwen3.6 needs **Blackwell FP4-MMA + >24GB RAM**; a 16GB M4 is not
-a fit. A Metal-native Q4_K Qwen3.6 would be a different artifact.
+| config | pp512 | tg128 |
+|---|---:|---:|
+| stock | 226.7 | 20.4 |
+| patched, paged **off** | 226.7 | 20.3 (= stock) |
+| patched, paged **on** | 222.6 | 19.8 (~0.97x) |
+
+Concurrency (`batched-bench`) scales identically to stock (S_TG ~20 -> ~137 at
+npl32, from llama.cpp's existing batching). **Verdict: neutral-to-slightly-negative
+on Metal.** Patched-paged-off equals stock; turning paged on is ~0-3% slower
+decode / ~2-8% slower prefill, because the in-kernel block-table flash-attn read
+that *recovers* the gather cost is CUDA-only (`fattn-*.cuh`) - on Metal the paged
+path falls back to a host-side gather, pure overhead over stock's contiguous read.
+Everything Blackwell-specific (NVFP4, GDN fusions via 0030, occupancy) is inert.
+So **on Apple Silicon, prefer the stock `llama-cpp` backend.**
+
+**Vulkan** (source analysis, no box to measure): same picture, worse - the
+CUDA-only levers are inert AND the gated-DeltaNet op has *no Vulkan kernel
+upstream*, so the Qwen3.6 hybrid models assert/fall back and don't run there.

 ---

@@ -279,6 +298,23 @@ in a recommended/gallery config.
  correctness on other GPUs. Patch 0030 makes the fused-op emission CUDA-family +
  CPU only, so a non-CUDA paged build routes to the safe upstream non-fused path.

+- **What generalizes beyond this backend (upstream candidates).** The *speedups*
+  are CUDA/Blackwell-specific (which is why Metal/Vulkan don't benefit - section
+  4c), but several *findings and ops* are portable and worth upstreaming:
+  - The headline is hardware-independent: on hybrid gated-DeltaNet models, decode
+    is bottlenecked by the recurrent-state **plumbing** (memcpy + gathers, ~67% of
+    the step), not the weight GEMM. The fusions for it (in-place state 0018, gather
+    0019/0028, conv 0021) are bit-exact and already have CPU reference kernels, so
+    they would speed up Qwen3.6 / Qwen3-Next / any hybrid-SSM decode on **every**
+    backend once the ggml ops gain the respective (Metal/Vulkan) kernels - the
+    highest-value upstream contribution.
+  - The o_proj GEMV->MMQ reshape (0020) is a model-graph fix (batch the projection
+    to hit the GEMM path) - arch-agnostic in principle, trivial to upstream.
+  - The paged KV + cross-request prefix sharing + decode-first scheduler align with
+    llama.cpp's own in-progress KV / chunked-prefill work and could inform it.
+  - The per-path bit-exact md5 gate + the weekly upstream-drift canary is a reusable
+    maintenance pattern for any vendored-patch backend.
+
 ---

 ## 7. Pin + maintenance policy