The llama-cpp-localai-paged patches/ dir had accumulated docs, plots, a csv,
dev .cpp harnesses, and a dead FP4-MoE kernel scaffold after an earlier git-mv.
Restore the invariant that patches/ holds only the .patch series.
Moves:
- patches/paged/README.md -> README.md (canonical doc at the backend root)
- patches/paged/{PIN_SYNC_c299a92c,PAGED_BITEXACT_NOTE,LOCALAI_LLAMACPP_BACKEND_PLAN,UPSTREAM_LAYER2_SCOPE}.md,
final_benchmark.csv, qwen36_*.png, paged-burst-bench.cpp, paged-reclaim-unit.cpp -> docs/
- patches/README.md -> docs/PATCH_MAINTENANCE.md (unique patch-regen recipe not in the canonical README)
Deletes:
- patches/BENCHMARKS.md (superseded by README section 4 + the dev-notes section)
- patches/kernel/ (dead FP4-MoE scaffold, never in the 0001-0030 apply glob, zero refs repo-wide)
Repoint every reference to the moved files: README internal links (docs/ + the
.github links drop from 5x ../ to 3x ../), .agents/llama-cpp-localai-paged-backend.md,
.github/scripts/paged-canary-apply.sh, .github/workflows/llama-cpp-paged-canary.yml,
the wrapper Makefile, backend/cpp/llama-cpp/grpc-server.cpp, backend/index.yaml,
docs/content/features/backends.md, gallery/index.yaml.
The build apply glob PAGED_PATCHES_DIR/0*.patch (PAGED_PATCHES_DIR := .../patches/paged)
is unchanged and still resolves to the 28 patches.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
20 KiB
Layer-2 upstream scope: native fused-GDN kernels for Metal / Vulkan / SYCL
Source-only analysis (no GPU, no build) of what it would take to give the gated-DeltaNet (GDN / SSM) decode fusions native kernels on the non-CUDA compute backends, so the patch-series decode win extends past CUDA-family hardware.
In our changeset (patches 0018-0030) these fusions ship with CUDA native kernels
- CPU reference kernels ONLY; patch 0030 force-gates them OFF on Metal / Vulkan / SYCL (a CPU-fallback fused op would regress via the device round-trip, and a backend that ran the plain op on the discriminated node would silently miscompute). "Layer 2" is the upstream work that adds the missing native kernels.
This doc was written against the ggml backend trees in
backend/cpp/llama-cpp-paged-dev (upstream base #24732, one commit OLDER than the
series pin c299a92c #25045, with only the two paged-KV patches applied - neither
touches GDN/SSM). So every "kernel already exists" statement below is a
conservative lower bound: the pin has at least these kernels.
0. Headline finding (correct a stale assumption first)
The series README (section 4c) says "the gated-DeltaNet op has no Vulkan kernel upstream, so the Qwen3.6 hybrid models assert / fall back and don't run there." That is now stale. All three backends already carry the BASE compute ops:
| op | Metal | Vulkan | SYCL |
|---|---|---|---|
| GGML_OP_GATED_DELTA_NET | kernel_gated_delta_net_impl (f32, NSG 1/2/4) |
gated_delta_net.comp (d16/32/64/128 x kda, shmem/cluster/nocluster variants) |
gated_delta_net.cpp (launch_gated_delta_net<KDA,keep_rs>) |
| GGML_OP_SSM_CONV | kernel_ssm_conv_f32_f32 (+ _4, + batched) |
ssm_conv.comp (+ APPLY_BIAS, APPLY_SILU specialization consts) |
ssm_conv.cpp (kernel_ssm_conv) |
| GGML_OP_SSM_SCAN | yes | ssm_scan.comp (mamba2) |
ssm_scan.cpp (mamba2) |
Verified: Vulkan gated_delta_net.comp was last touched at the upstream base
commit (#24732), not by any LocalAI patch. So the GDN COMPUTE op is present on
Metal, Vulkan AND SYCL. The Qwen3.6 hybrids therefore DO run on all three today
(via the upstream non-fused path that 0030 routes to). The Layer-2 value-add is
the decode SPEEDUP from the fusions, NOT enabling the model to run at all.
Consequence: the GDN-compute op being "partly there" is true on every backend, not just Metal. What is still missing per backend is only the FUSION plumbing (in-place write-back target, the ids gather read, and the conv-update kernel) - a materially smaller scope than "port GDN from scratch."
1. Per-op semantics (the four fusions to port)
All four reuse an existing GGML_OP enum with extra src[] slots as a
discriminator; none adds a new enum value. f32 throughout. The arithmetic core
is IDENTICAL to the upstream non-fused op; only the read source and/or the write
target are redirected. That single fact drives the whole bit-exactness story
(section 3).
OP A - ggml_gated_delta_net_inplace (patch 0018)
- Enum
GGML_OP_GATED_DELTA_NET, discriminated by a non-nullsrc[6]=state_dst(a contiguous[S_v*S_v*H, n_seqs]view into the recurrent-state cache atkv_head). K == 1 only. - Semantics: run the standard GDN recurrence, but write the FINAL recurrent state
directly into
state_dstinstead of appending it to the op output. The op output then carries only the attention scores. Removes the per-layer per-step ~full-state D2D copy-back (the 0018 win). - Race (in-place read == write): each (seq, head) block owns a disjoint cache
slot. The kernel loads the whole prior state
s0into per-thread registers (s_shardon CUDA,ls[NSG]on Metal, the column shard on Vulkan/SYCL) BEFORE the ring write, so reading and writing the same slot is safe.
OP B - ggml_gated_delta_net_inplace_ids (patch 0019)
- Adds
src[5]= FULL state cache[S_v,S_v,H,n_rs_slots],src[7]=ids(I32, per-seq source slot == the recurrent-states_copy),op_param[1]=rs_head(destination base slot). Still has the OP-Asrc[6]in-place target. - Semantics: read each sequence's prior state directly from
cache[ids[seq]](mirrorsggml_ssm_scan's ids source), eliminating theggml_get_rowsmaterialization. Combined with OP A the op now reads AND writes the cache in place. - Race: identity sequences (
ids[s] == rs_head + s, the steady AR-decode case) read s0 in place from the destination slot (safe via the register snapshot above). Non-identity sequences (reorder / rs_zero remap) are first copied by a TINY separate gather kernel (gdn_gather_nonident, one block/seq) into a DISJOINT scratch that the recurrence then reads, so the recurrence never reads a slot another block is writing. Value-preserving memcpy -> bit-identical to the get_rows path.
OP C - ggml_ssm_conv_update_inplace (patch 0021)
- Enum
GGML_OP_SSM_CONV, discriminated by a non-nullsrc[3]=conv_state_dst([(K-1)*channels, n_seqs]in-place ring view).src[0]= conv_states[K-1, channels, n_seqs],src[1]= conv_kernel[K, channels],src[2]= x_cur[channels, 1, n_seqs].op_param[0]= fuse_silu. - Semantics (decode, n_seq_tokens == 1): per (channel, sequence) assemble the
width-K conv window in registers from the K-1 cached taps + the current token,
compute the depthwise conv with the SAME ascending-tap FMA order as plain
ssm_conv(tap0*w0 + ... + xc*w_{K-1}, then+0.0fto match plain conv'ssumf += bwith b==0), optionally fold SiLU, write the conv output[channels,1,n_seqs], and write the 1-token-shifted ring state back in place. Replaces the 4-op decode conv chain (transpose + concat + conv + silu + ring cpy). - Race: read source (gathered taps) and write target (cache view) are disjoint buffers -> race-free by construction, no ids/identity logic.
OP D - ggml_ssm_conv_update_inplace_ids (patch 0028)
- Same enum, discriminated by a non-null
src[4]=ids;src[0]becomes the FULL conv cache[K-1, channels, n_cells];op_param[1]= rs_head. - Semantics: gather-free conv-update - read each sequence's prior taps from
cache[ids[s]]in-kernel (no get_rows). Identity reads in place fromconv_state_dst; non-identity gathered into a disjoint scratch first by a tinyssm_conv_gather_nonidentkernel. The window is copied to a local array BEFORE the (possibly aliasing) ring write so the identity read==write slot is correct. Bit-identical to get_rows + OP C.
Net new kernels vs reuse, per op
- OP A: NOT a new compute kernel - a write-target redirection of the EXISTING GDN kernel + 1 buffer binding + a supports_op/op-handler branch.
- OP B: the GDN kernel gains a per-seq read-base select (identity vs scratch) + 1 ids binding + rs_head param + 1 tiny gather kernel.
- OP C: a GENUINELY NEW kernel on each backend. The existing
ssm_convcomputes a windowed reduction over a PRE-concatenated input; it does not assemble the window from cached taps + the current token, fold silu, or write the shifted ring state. This is the largest net-new piece. - OP D: the OP-C kernel gains the read-base select + 1 ids binding + rs_head + 1 tiny conv gather kernel.
The ggml.h / ggml.c builders, the CPU reference kernels, the model-graph
emission (delta-net-base.cpp, qwen35*), and the test-backend-ops cases are
SHARED and already done by patches 0018/0019/0021/0028. The only NEW per-backend
work is the kernel(s) + the backend wiring.
2. Per-backend: authoring model, effort, gotchas, wiring
2.1 Metal (MSL)
Authoring model: .metal MSL source (ggml-metal.metal), function-constant
specialization (e.g. FC_GATED_DELTA_NET), kernels templated on NSG; host
glue split across ggml-metal-ops.cpp (ggml_metal_op_* encode), the pipeline
lookup in ggml-metal-device.cpp/.m, the kargs struct in ggml-metal-impl.h,
and supports_op in ggml-metal-device.m. Threadgroup model; Apple GPU
simdgroup width is a FIXED 32, simd_sum for the per-column reduce.
Effort: MEDIUM. ~350-500 LOC. The GDN and plain-ssm_conv kernels already exist
and are ergonomic to extend. OP A is a write-base redirect of the existing
kernel_gated_delta_net_impl (its tail already does
dst_state = dst + attn_size + state_out_base; dst_state[is] = ls[j] after
loading ls[] into registers - just point dst_state at the state_dst buffer
and add the binding). OP C is the one net-new MSL kernel (Metal has NO bias/silu
ssm_conv variant today - only plain + _4 + batched - so the silu-fold and ring
write are both new). Host glue spans 3-4 files.
Gotchas:
- In-place race: the existing kernel ALREADY snapshots the state column into
ls[NSG]registers before writing, so OP A/B are safe with no barrier; OP C/D must mirror thefloat window[K]local-copy-before-write that CPU/CUDA use. - Discriminated SSM_CONV:
supports_opforGGML_OP_SSM_CONVcurrently returnshas_simdgroup_reductionwith NO check ofsrc[3]/src[4]; GDN returnshas_simdgroup_reduction && src[2]->ne[0] % 32 == 0with NO check ofsrc[6]/src[7]. Both must be tightened (accept the discriminated variant only once the kernel exists) ANDggml_metal_op_ssm_conv/ggml_metal_op_gated_delta_netmust branch on the extra src to pick the kernel. - Bit-exactness: fixed 32-wide simdgroup makes this the SIMPLEST of the three -
the fused variant only redirects addresses, so it is bit-identical to Metal's
own non-fused path by construction (the conv per-channel FMA needs the exact
ascending order + the
+0.0f). - The kargs struct grows by the
state_dst/ids/rs_headfields; a new pipeline name (or a function-constant branch) distinguishes the variants.
2.2 Vulkan (GLSL .comp -> SPIR-V)
Authoring model: GLSL .comp in vulkan-shaders/, compiled at build time by
vulkan-shaders-gen into embedded SPIR-V byte arrays (gated_delta_net_f32_data
etc.); pipeline creation in ggml-vulkan.cpp declares the binding count +
push-constant size; a push-constant struct per op; host dispatch ggml_vk_*
binds subbuffers; supports_op in the device support function. Subgroup size
VARIES by vendor (NVIDIA 32, AMD 64, Intel 8/16/32).
Effort: HARDEST. ~450-650 LOC + the most build/host glue. Same kernel logic as
Metal/SYCL, but every new shader or variant requires: the shaders-gen regen, a
new ggml_vk_create_pipeline registration with an explicit binding count and
push-constant size, a new/extended push-constant struct (add rs_head), and
GROWING the descriptor binding set from the current 7 (src[0..5] + dst) to 8-9
(state_dst, ids). The GDN host dispatch hardcodes a 6-src bind loop and the
pipeline is created with "main", 7, ... - both must change.
Gotchas:
- Subgroup variance interacts with the EXISTING variant matrix: the GDN comp
already ships shmem / cluster / nocluster variants keyed on subgroup size and
relies on
S_V % COLS_PER_WG == 0. The OP-A/B read/write redirect must be applied across ALL of those variants, and re-validated per vendor. - In-place race: GLSL must read the full column shard into local registers before the ring write (same pattern); confirm the SPIR-V memory model is not relied on for cross-invocation ordering (it is not - blocks are disjoint per (seq,head)). OP C/D need the explicit window-to-local copy.
- Discriminated SSM_CONV:
supports_opreturnsop->src[0]->type == F32with NO discriminator check; GDN loopssrc[0..5]F32 with NOsrc[6]/src[7]check. Both must be tightened. This is the backend where the 0030 hazard is most concrete (a present plain-conv kernel + a permissive supports_op = silent miscompute) - Vulkan is the exact case 0030 was written for. - conv-update is per-channel (one invocation per channel) so it is subgroup-AGNOSTIC; only the GDN recurrence carries the subgroup-width burden.
- Vulkan's
ssm_conv.compALREADY has APPLY_SILU + APPLY_BIAS specialization constants, so the silu-fold half of OP C is partly precedented here (unlike Metal); the ring write-back + tap-window assembly are still new.
2.3 SYCL (single-source DPC++)
Authoring model: plain C++ .cpp/.hpp per op (gated_delta_net.cpp,
ssm_conv.cpp); a SYCL queue.parallel_for over an nd_range with
reqd_sub_group_size(WARP_SIZE); sub-group reductions (warp_reduce_sum);
supports_op in ggml-sycl.cpp. NO separate shader-compile step (single
source).
Effort: EASIEST to author. ~250-350 LOC. The SYCL op handlers + kernels are
near-VERBATIM mirrors of the CUDA ones (launch_gated_delta_net<KDA,keep_rs>,
s_shard, curr_state, state = dst + attn_score_elems, warp_reduce_sum) -
a dpct/SYCLomatic-style port. The CUDA diffs in 0018/0019/0021/0028 would port
almost line-for-line: add the state_dst param, the ids/rs_head params, the
read-base select, the two tiny gather kernels, and the new conv-update kernel.
No pipeline/push-constant/binding bookkeeping.
Gotchas:
- In-place race: the
s_shard[]/ window arrays are per-work-item private, so the register-snapshot-before-write pattern carries over directly. Safe. - Discriminated SSM_CONV:
supports_opcheckssrc[0]/src[1]F32 with NO discriminator check; GDN returns a BAREtrue(the MOST permissive, so the hazard is worst here). Both must be tightened, andggml_sycl_op_ssm_conv/ggml_sycl_op_gated_delta_netmust branch on the extra src. - Bit-exactness:
WARP_SIZEis compile-fixed (Intel sub-group 8/16/32), same situation as CUDA; the fused variant matches SYCL's own non-fused path by construction. conv-update is per-channel -> subgroup-agnostic.
2.4 Common wiring (all three) + the 0030 emission-gate change
Per backend, four wiring touch-points beyond the kernel body:
supports_op: tighten theGGML_OP_SSM_CONVandGGML_OP_GATED_DELTA_NETentries so the discriminated/extra-src node is reported supported ONLY when the new kernel handles it (and rejected otherwise, instead of today's silently-true-for-the-plain-kernel).- op handler: branch on
src[3]/src[4](conv) andsrc[6]/src[7](GDN) to dispatch the fused kernel. - pipeline/kernel registration (Vulkan: + push-constant struct + descriptor bindings; Metal: + kargs fields + pipeline name; SYCL: just the new functions).
- The patch-0030 gate in
src/llama-context.cpp.
The 0030 change today is a hard allow-list: any non-CPU compute backend whose reg
name is not "CUDA"/"ROCm"/"MUSA" forces fused_gdn_ar = fused_gdn_ch = auto_fgdn = false. As each backend gains kernels this must become capability-
driven, in one of two ways:
- minimal: add the backend's reg name (e.g.
"Metal") to the allow-list once its kernels + tightened supports_op ship; OR - clean (recommended upstream form): DELETE the name allow-list and make
supports_opauthoritative - have theauto_fgdnresolution probeggml_backend_dev_supports_opon a representative node that carries the discriminatedsrc[]slots. Then routing falls out of the normal scheduler fallback and no backend name is ever hard-coded. This also fixes 0030's stated weakness that the upstreamauto_fgdncheck only inspects GATED_DELTA_NET nodes and covered the discriminated SSM_CONV only incidentally.
3. Bit-exactness per backend (the md5 gate question)
Feasible on ALL THREE, and not actually constraining, because of how the gate is scoped:
- The series md5 gate is a CUDA-vs-CPU comparison; each GPU backend ALREADY has
its own f32 reduction order (Metal
simd_sum, Vulkan subgroup reduce, SYCLwarp_reduce_sum) that differs from CUDA's and from CPU's. There is no cross-backend md5 and none is expected. - The relevant per-backend invariant is: the FUSED variant must equal that backend's OWN non-fused path. The fusions change only the read source (gather -> indexed read; the gather is a value-preserving memcpy) and the write target (appended output -> in-place cache slot). They do NOT touch the per-column FMA/reduce order. So the fused op is bit-identical to the non-fused op on the same backend BY CONSTRUCTION.
- Two arithmetic details each port MUST preserve exactly: (a) the conv
ascending-tap order plus the
+0.0fthat matches plainssm_conv'ssumf += bwith b==0; (b) the existing GDN per-column subgroup reduce (do not re-order it). Get those right andtest-backend-ops(backendX-vs-CPU, already registered for SSM_CONV / SSM_CONV_UPDATE / SSM_CONV_UPDATE_IDS / GATED_DELTA_NET) is the per-backend gate.
4. Upstream path and ranked recommendation
Ops-first, then one PR per backend (NOT one big PR)
Recommended sequence:
- PR #1 - OPS (already essentially done, upstreamable as-is): the
ggml.h/ggml.cbuilders, the CPU reference kernels, the CUDA kernels, thetest-backend-opscases, and the capability-driven gate (the cleansupports_op-authoritative version of 0030). This is independently mergeable and mirrors how llama.cpp lands new ops (CPU + CUDA first; GDN itself landed that way). - PR #2 - Metal kernels + wiring.
- PR #3 - SYCL kernels + wiring.
- PR #4 - Vulkan kernels + wiring.
Do NOT bundle the backends: each needs its own hardware to validate
test-backend-ops, reviewers are backend-specialized, and a regression in one
must not block the others.
Value x effort ranking (which backend first)
| backend | user base / value | author effort | bit-exact difficulty | net rank |
|---|---|---|---|---|
| Metal | HIGH (Apple Silicon = largest non-CUDA LocalAI base; unified memory makes the no-copy / no-gather plumbing wins map directly) | MEDIUM | LOWEST (fixed 32 simdgroup) | 1st |
| SYCL | LOW-MED (Intel GPU) | LOWEST (near-verbatim CUDA mirror) | LOW | 2nd |
| Vulkan | HIGHEST breadth (AMD + Intel + cross-vendor) | HIGHEST (shaders-gen + variant matrix + subgroup variance + descriptor growth) | MEDIUM (per-vendor subgroup validation) | 3rd |
Recommendation: Metal first. It banks the biggest user-facing decode win at medium effort, the base GDN + conv kernels already exist, and Apple's fixed simdgroup width makes bit-exactness the simplest. SYCL second as a cheap, nearly mechanical follow-on (the port is a line-for-line CUDA mirror, so it is low-cost insurance even though the Intel-GPU audience is smaller). Vulkan last as the high-effort / high-breadth capstone - it reaches the widest hardware (AMD + Intel + anything with a Vulkan driver), but the shader-gen pipeline, the existing variant matrix, the subgroup-width variance, and the per-vendor validation burden make it the right capstone once the pattern is proven on Metal + SYCL.
A reasonable cheaper variant: ship Metal + SYCL together right after the ops PR (both are register-snapshot ports with no shader-gen step) and treat Vulkan as a separate later effort.
5. Summary
- GDN-compute and plain SSM_CONV kernels ALREADY EXIST on Metal, Vulkan and SYCL (the README's "no Vulkan kernel" line is stale). The Qwen3.6 hybrids run on all three today via the non-fused path; Layer-2 is about the decode SPEEDUP.
- Per backend the NEW work is: redirect the GDN state write (OP A) + add the ids read (OP B) to the existing GDN kernel, write ONE new conv-update kernel (OP C) + its ids variant (OP D), add two tiny gather kernels, and tighten supports_op + the op-handler branch + (Vulkan) the pipeline/push-constant/ descriptor wiring. The builders, CPU refs, model graph and tests are shared and already done.
- Bit-exactness is feasible everywhere and per-backend by construction (the
fusions redirect addresses, not the f32 reduction order);
test-backend-ops(backendX-vs-CPU) is the gate. - Sequence: ops-first PR (incl. the capability-driven replacement for 0030's name allow-list), then Metal, then SYCL, then Vulkan.