# Layer-2 upstream scope: native fused-GDN kernels for Metal / Vulkan / SYCL Source-only analysis (no GPU, no build) of what it would take to give the gated-DeltaNet (GDN / SSM) decode fusions native kernels on the non-CUDA compute backends, so the patch-series decode win extends past CUDA-family hardware. In our changeset (patches 0018-0030) these fusions ship with CUDA native kernels + CPU reference kernels ONLY; patch 0030 force-gates them OFF on Metal / Vulkan / SYCL (a CPU-fallback fused op would regress via the device round-trip, and a backend that ran the plain op on the discriminated node would silently miscompute). "Layer 2" is the upstream work that adds the missing native kernels. This doc was written against the ggml backend trees in `backend/cpp/llama-cpp-paged-dev` (upstream base #24732, one commit OLDER than the series pin `c299a92c` #25045, with only the two paged-KV patches applied - neither touches GDN/SSM). So every "kernel already exists" statement below is a conservative lower bound: the pin has at least these kernels. -------------------------------------------------------------------------------- ## 0. Headline finding (correct a stale assumption first) The series README (section 4c) says "the gated-DeltaNet op has no Vulkan kernel upstream, so the Qwen3.6 hybrid models assert / fall back and don't run there." **That is now stale.** All three backends already carry the BASE compute ops: | op | Metal | Vulkan | SYCL | |------------------------|------------------------------------|------------------------------------------|---------------------------------| | GGML_OP_GATED_DELTA_NET| `kernel_gated_delta_net_impl` (f32, NSG 1/2/4) | `gated_delta_net.comp` (d16/32/64/128 x kda, shmem/cluster/nocluster variants) | `gated_delta_net.cpp` (`launch_gated_delta_net`) | | GGML_OP_SSM_CONV | `kernel_ssm_conv_f32_f32` (+ `_4`, + batched) | `ssm_conv.comp` (+ APPLY_BIAS, APPLY_SILU specialization consts) | `ssm_conv.cpp` (`kernel_ssm_conv`) | | GGML_OP_SSM_SCAN | yes | `ssm_scan.comp` (mamba2) | `ssm_scan.cpp` (mamba2) | Verified: Vulkan `gated_delta_net.comp` was last touched at the upstream base commit (#24732), not by any LocalAI patch. So the GDN COMPUTE op is present on Metal, Vulkan AND SYCL. The Qwen3.6 hybrids therefore DO run on all three today (via the upstream non-fused path that 0030 routes to). The Layer-2 value-add is the decode SPEEDUP from the fusions, NOT enabling the model to run at all. Consequence: the GDN-compute op being "partly there" is true on every backend, not just Metal. What is still missing per backend is only the FUSION plumbing (in-place write-back target, the ids gather read, and the conv-update kernel) - a materially smaller scope than "port GDN from scratch." -------------------------------------------------------------------------------- ## 1. Per-op semantics (the four fusions to port) All four reuse an existing GGML_OP enum with extra `src[]` slots as a discriminator; none adds a new enum value. f32 throughout. The arithmetic core is IDENTICAL to the upstream non-fused op; only the read source and/or the write target are redirected. That single fact drives the whole bit-exactness story (section 3). ### OP A - `ggml_gated_delta_net_inplace` (patch 0018) - Enum `GGML_OP_GATED_DELTA_NET`, discriminated by a non-null `src[6]` = `state_dst` (a contiguous `[S_v*S_v*H, n_seqs]` view into the recurrent-state cache at `kv_head`). K == 1 only. - Semantics: run the standard GDN recurrence, but write the FINAL recurrent state directly into `state_dst` instead of appending it to the op output. The op output then carries only the attention scores. Removes the per-layer per-step ~full-state D2D copy-back (the 0018 win). - Race (in-place read == write): each (seq, head) block owns a disjoint cache slot. The kernel loads the whole prior state `s0` into per-thread registers (`s_shard` on CUDA, `ls[NSG]` on Metal, the column shard on Vulkan/SYCL) BEFORE the ring write, so reading and writing the same slot is safe. ### OP B - `ggml_gated_delta_net_inplace_ids` (patch 0019) - Adds `src[5]` = FULL state cache `[S_v,S_v,H,n_rs_slots]`, `src[7]` = `ids` (I32, per-seq source slot == the recurrent-state `s_copy`), `op_param[1]` = `rs_head` (destination base slot). Still has the OP-A `src[6]` in-place target. - Semantics: read each sequence's prior state directly from `cache[ids[seq]]` (mirrors `ggml_ssm_scan`'s ids source), eliminating the `ggml_get_rows` materialization. Combined with OP A the op now reads AND writes the cache in place. - Race: identity sequences (`ids[s] == rs_head + s`, the steady AR-decode case) read s0 in place from the destination slot (safe via the register snapshot above). Non-identity sequences (reorder / rs_zero remap) are first copied by a TINY separate gather kernel (`gdn_gather_nonident`, one block/seq) into a DISJOINT scratch that the recurrence then reads, so the recurrence never reads a slot another block is writing. Value-preserving memcpy -> bit-identical to the get_rows path. ### OP C - `ggml_ssm_conv_update_inplace` (patch 0021) - Enum `GGML_OP_SSM_CONV`, discriminated by a non-null `src[3]` = `conv_state_dst` (`[(K-1)*channels, n_seqs]` in-place ring view). `src[0]` = conv_states `[K-1, channels, n_seqs]`, `src[1]` = conv_kernel `[K, channels]`, `src[2]` = x_cur `[channels, 1, n_seqs]`. `op_param[0]` = fuse_silu. - Semantics (decode, n_seq_tokens == 1): per (channel, sequence) assemble the width-K conv window in registers from the K-1 cached taps + the current token, compute the depthwise conv with the SAME ascending-tap FMA order as plain `ssm_conv` (`tap0*w0 + ... + xc*w_{K-1}`, then `+0.0f` to match plain conv's `sumf += b` with b==0), optionally fold SiLU, write the conv output `[channels,1,n_seqs]`, and write the 1-token-shifted ring state back in place. Replaces the 4-op decode conv chain (transpose + concat + conv + silu + ring cpy). - Race: read source (gathered taps) and write target (cache view) are disjoint buffers -> race-free by construction, no ids/identity logic. ### OP D - `ggml_ssm_conv_update_inplace_ids` (patch 0028) - Same enum, discriminated by a non-null `src[4]` = `ids`; `src[0]` becomes the FULL conv cache `[K-1, channels, n_cells]`; `op_param[1]` = rs_head. - Semantics: gather-free conv-update - read each sequence's prior taps from `cache[ids[s]]` in-kernel (no get_rows). Identity reads in place from `conv_state_dst`; non-identity gathered into a disjoint scratch first by a tiny `ssm_conv_gather_nonident` kernel. The window is copied to a local array BEFORE the (possibly aliasing) ring write so the identity read==write slot is correct. Bit-identical to get_rows + OP C. ### Net new kernels vs reuse, per op - OP A: NOT a new compute kernel - a write-target redirection of the EXISTING GDN kernel + 1 buffer binding + a supports_op/op-handler branch. - OP B: the GDN kernel gains a per-seq read-base select (identity vs scratch) + 1 ids binding + rs_head param + 1 tiny gather kernel. - OP C: a GENUINELY NEW kernel on each backend. The existing `ssm_conv` computes a windowed reduction over a PRE-concatenated input; it does not assemble the window from cached taps + the current token, fold silu, or write the shifted ring state. This is the largest net-new piece. - OP D: the OP-C kernel gains the read-base select + 1 ids binding + rs_head + 1 tiny conv gather kernel. The `ggml.h` / `ggml.c` builders, the CPU reference kernels, the model-graph emission (`delta-net-base.cpp`, qwen35*), and the `test-backend-ops` cases are SHARED and already done by patches 0018/0019/0021/0028. The only NEW per-backend work is the kernel(s) + the backend wiring. -------------------------------------------------------------------------------- ## 2. Per-backend: authoring model, effort, gotchas, wiring ### 2.1 Metal (MSL) Authoring model: `.metal` MSL source (`ggml-metal.metal`), function-constant specialization (e.g. `FC_GATED_DELTA_NET`), kernels templated on `NSG`; host glue split across `ggml-metal-ops.cpp` (`ggml_metal_op_*` encode), the pipeline lookup in `ggml-metal-device.cpp`/`.m`, the kargs struct in `ggml-metal-impl.h`, and `supports_op` in `ggml-metal-device.m`. Threadgroup model; Apple GPU simdgroup width is a FIXED 32, `simd_sum` for the per-column reduce. Effort: MEDIUM. ~350-500 LOC. The GDN and plain-ssm_conv kernels already exist and are ergonomic to extend. OP A is a write-base redirect of the existing `kernel_gated_delta_net_impl` (its tail already does `dst_state = dst + attn_size + state_out_base; dst_state[is] = ls[j]` after loading `ls[]` into registers - just point `dst_state` at the `state_dst` buffer and add the binding). OP C is the one net-new MSL kernel (Metal has NO bias/silu ssm_conv variant today - only plain + `_4` + batched - so the silu-fold and ring write are both new). Host glue spans 3-4 files. Gotchas: - In-place race: the existing kernel ALREADY snapshots the state column into `ls[NSG]` registers before writing, so OP A/B are safe with no barrier; OP C/D must mirror the `float window[K]` local-copy-before-write that CPU/CUDA use. - Discriminated SSM_CONV: `supports_op` for `GGML_OP_SSM_CONV` currently returns `has_simdgroup_reduction` with NO check of `src[3]`/`src[4]`; GDN returns `has_simdgroup_reduction && src[2]->ne[0] % 32 == 0` with NO check of `src[6]`/`src[7]`. Both must be tightened (accept the discriminated variant only once the kernel exists) AND `ggml_metal_op_ssm_conv` / `ggml_metal_op_gated_delta_net` must branch on the extra src to pick the kernel. - Bit-exactness: fixed 32-wide simdgroup makes this the SIMPLEST of the three - the fused variant only redirects addresses, so it is bit-identical to Metal's own non-fused path by construction (the conv per-channel FMA needs the exact ascending order + the `+0.0f`). - The kargs struct grows by the `state_dst` / `ids` / `rs_head` fields; a new pipeline name (or a function-constant branch) distinguishes the variants. ### 2.2 Vulkan (GLSL .comp -> SPIR-V) Authoring model: GLSL `.comp` in `vulkan-shaders/`, compiled at build time by `vulkan-shaders-gen` into embedded SPIR-V byte arrays (`gated_delta_net_f32_data` etc.); pipeline creation in `ggml-vulkan.cpp` declares the binding count + push-constant size; a push-constant struct per op; host dispatch `ggml_vk_*` binds subbuffers; `supports_op` in the device support function. Subgroup size VARIES by vendor (NVIDIA 32, AMD 64, Intel 8/16/32). Effort: HARDEST. ~450-650 LOC + the most build/host glue. Same kernel logic as Metal/SYCL, but every new shader or variant requires: the shaders-gen regen, a new `ggml_vk_create_pipeline` registration with an explicit binding count and push-constant size, a new/extended push-constant struct (add `rs_head`), and GROWING the descriptor binding set from the current 7 (`src[0..5]` + dst) to 8-9 (`state_dst`, `ids`). The GDN host dispatch hardcodes a 6-src bind loop and the pipeline is created with `"main", 7, ...` - both must change. Gotchas: - Subgroup variance interacts with the EXISTING variant matrix: the GDN comp already ships shmem / cluster / nocluster variants keyed on subgroup size and relies on `S_V % COLS_PER_WG == 0`. The OP-A/B read/write redirect must be applied across ALL of those variants, and re-validated per vendor. - In-place race: GLSL must read the full column shard into local registers before the ring write (same pattern); confirm the SPIR-V memory model is not relied on for cross-invocation ordering (it is not - blocks are disjoint per (seq,head)). OP C/D need the explicit window-to-local copy. - Discriminated SSM_CONV: `supports_op` returns `op->src[0]->type == F32` with NO discriminator check; GDN loops `src[0..5]` F32 with NO `src[6]`/`src[7]` check. Both must be tightened. This is the backend where the 0030 hazard is most concrete (a present plain-conv kernel + a permissive supports_op = silent miscompute) - Vulkan is the exact case 0030 was written for. - conv-update is per-channel (one invocation per channel) so it is subgroup-AGNOSTIC; only the GDN recurrence carries the subgroup-width burden. - Vulkan's `ssm_conv.comp` ALREADY has APPLY_SILU + APPLY_BIAS specialization constants, so the silu-fold half of OP C is partly precedented here (unlike Metal); the ring write-back + tap-window assembly are still new. ### 2.3 SYCL (single-source DPC++) Authoring model: plain C++ `.cpp`/`.hpp` per op (`gated_delta_net.cpp`, `ssm_conv.cpp`); a SYCL `queue.parallel_for` over an `nd_range` with `reqd_sub_group_size(WARP_SIZE)`; sub-group reductions (`warp_reduce_sum`); `supports_op` in `ggml-sycl.cpp`. NO separate shader-compile step (single source). Effort: EASIEST to author. ~250-350 LOC. The SYCL op handlers + kernels are near-VERBATIM mirrors of the CUDA ones (`launch_gated_delta_net`, `s_shard`, `curr_state`, `state = dst + attn_score_elems`, `warp_reduce_sum`) - a dpct/SYCLomatic-style port. The CUDA diffs in 0018/0019/0021/0028 would port almost line-for-line: add the `state_dst` param, the `ids`/`rs_head` params, the read-base select, the two tiny gather kernels, and the new conv-update kernel. No pipeline/push-constant/binding bookkeeping. Gotchas: - In-place race: the `s_shard[]` / window arrays are per-work-item private, so the register-snapshot-before-write pattern carries over directly. Safe. - Discriminated SSM_CONV: `supports_op` checks `src[0]`/`src[1]` F32 with NO discriminator check; GDN returns a BARE `true` (the MOST permissive, so the hazard is worst here). Both must be tightened, and `ggml_sycl_op_ssm_conv` / `ggml_sycl_op_gated_delta_net` must branch on the extra src. - Bit-exactness: `WARP_SIZE` is compile-fixed (Intel sub-group 8/16/32), same situation as CUDA; the fused variant matches SYCL's own non-fused path by construction. conv-update is per-channel -> subgroup-agnostic. ### 2.4 Common wiring (all three) + the 0030 emission-gate change Per backend, four wiring touch-points beyond the kernel body: 1. `supports_op`: tighten the `GGML_OP_SSM_CONV` and `GGML_OP_GATED_DELTA_NET` entries so the discriminated/extra-src node is reported supported ONLY when the new kernel handles it (and rejected otherwise, instead of today's silently-true-for-the-plain-kernel). 2. op handler: branch on `src[3]`/`src[4]` (conv) and `src[6]`/`src[7]` (GDN) to dispatch the fused kernel. 3. pipeline/kernel registration (Vulkan: + push-constant struct + descriptor bindings; Metal: + kargs fields + pipeline name; SYCL: just the new functions). 4. The patch-0030 gate in `src/llama-context.cpp`. The 0030 change today is a hard allow-list: any non-CPU compute backend whose reg name is not `"CUDA"`/`"ROCm"`/`"MUSA"` forces `fused_gdn_ar = fused_gdn_ch = auto_fgdn = false`. As each backend gains kernels this must become capability- driven, in one of two ways: - minimal: add the backend's reg name (e.g. `"Metal"`) to the allow-list once its kernels + tightened supports_op ship; OR - clean (recommended upstream form): DELETE the name allow-list and make `supports_op` authoritative - have the `auto_fgdn` resolution probe `ggml_backend_dev_supports_op` on a representative node that carries the discriminated `src[]` slots. Then routing falls out of the normal scheduler fallback and no backend name is ever hard-coded. This also fixes 0030's stated weakness that the upstream `auto_fgdn` check only inspects GATED_DELTA_NET nodes and covered the discriminated SSM_CONV only incidentally. -------------------------------------------------------------------------------- ## 3. Bit-exactness per backend (the md5 gate question) Feasible on ALL THREE, and not actually constraining, because of how the gate is scoped: - The series md5 gate is a CUDA-vs-CPU comparison; each GPU backend ALREADY has its own f32 reduction order (Metal `simd_sum`, Vulkan subgroup reduce, SYCL `warp_reduce_sum`) that differs from CUDA's and from CPU's. There is no cross-backend md5 and none is expected. - The relevant per-backend invariant is: the FUSED variant must equal that backend's OWN non-fused path. The fusions change only the read source (gather -> indexed read; the gather is a value-preserving memcpy) and the write target (appended output -> in-place cache slot). They do NOT touch the per-column FMA/reduce order. So the fused op is bit-identical to the non-fused op on the same backend BY CONSTRUCTION. - Two arithmetic details each port MUST preserve exactly: (a) the conv ascending-tap order plus the `+0.0f` that matches plain `ssm_conv`'s `sumf += b` with b==0; (b) the existing GDN per-column subgroup reduce (do not re-order it). Get those right and `test-backend-ops` (backendX-vs-CPU, already registered for SSM_CONV / SSM_CONV_UPDATE / SSM_CONV_UPDATE_IDS / GATED_DELTA_NET) is the per-backend gate. -------------------------------------------------------------------------------- ## 4. Upstream path and ranked recommendation ### Ops-first, then one PR per backend (NOT one big PR) Recommended sequence: 1. PR #1 - OPS (already essentially done, upstreamable as-is): the `ggml.h`/ `ggml.c` builders, the CPU reference kernels, the CUDA kernels, the `test-backend-ops` cases, and the capability-driven gate (the clean `supports_op`-authoritative version of 0030). This is independently mergeable and mirrors how llama.cpp lands new ops (CPU + CUDA first; GDN itself landed that way). 2. PR #2 - Metal kernels + wiring. 3. PR #3 - SYCL kernels + wiring. 4. PR #4 - Vulkan kernels + wiring. Do NOT bundle the backends: each needs its own hardware to validate `test-backend-ops`, reviewers are backend-specialized, and a regression in one must not block the others. ### Value x effort ranking (which backend first) | backend | user base / value | author effort | bit-exact difficulty | net rank | |---------|----------------------------|---------------|----------------------|----------| | Metal | HIGH (Apple Silicon = largest non-CUDA LocalAI base; unified memory makes the no-copy / no-gather plumbing wins map directly) | MEDIUM | LOWEST (fixed 32 simdgroup) | **1st** | | SYCL | LOW-MED (Intel GPU) | LOWEST (near-verbatim CUDA mirror) | LOW | **2nd** | | Vulkan | HIGHEST breadth (AMD + Intel + cross-vendor) | HIGHEST (shaders-gen + variant matrix + subgroup variance + descriptor growth) | MEDIUM (per-vendor subgroup validation) | **3rd** | Recommendation: **Metal first.** It banks the biggest user-facing decode win at medium effort, the base GDN + conv kernels already exist, and Apple's fixed simdgroup width makes bit-exactness the simplest. **SYCL second** as a cheap, nearly mechanical follow-on (the port is a line-for-line CUDA mirror, so it is low-cost insurance even though the Intel-GPU audience is smaller). **Vulkan last** as the high-effort / high-breadth capstone - it reaches the widest hardware (AMD + Intel + anything with a Vulkan driver), but the shader-gen pipeline, the existing variant matrix, the subgroup-width variance, and the per-vendor validation burden make it the right capstone once the pattern is proven on Metal + SYCL. A reasonable cheaper variant: ship Metal + SYCL together right after the ops PR (both are register-snapshot ports with no shader-gen step) and treat Vulkan as a separate later effort. -------------------------------------------------------------------------------- ## 5. Summary - GDN-compute and plain SSM_CONV kernels ALREADY EXIST on Metal, Vulkan and SYCL (the README's "no Vulkan kernel" line is stale). The Qwen3.6 hybrids run on all three today via the non-fused path; Layer-2 is about the decode SPEEDUP. - Per backend the NEW work is: redirect the GDN state write (OP A) + add the ids read (OP B) to the existing GDN kernel, write ONE new conv-update kernel (OP C) + its ids variant (OP D), add two tiny gather kernels, and tighten supports_op + the op-handler branch + (Vulkan) the pipeline/push-constant/ descriptor wiring. The builders, CPU refs, model graph and tests are shared and already done. - Bit-exactness is feasible everywhere and per-backend by construction (the fusions redirect addresses, not the f32 reduction order); `test-backend-ops` (backendX-vs-CPU) is the gate. - Sequence: ops-first PR (incl. the capability-driven replacement for 0030's name allow-list), then Metal, then SYCL, then Vulkan.