docs(paged): Layer-2 upstream scope for native fused-GDN kernels (Metal/Vulkan/SYCL)

Source-only analysis of what it would take to give the gated-DeltaNet decode
fusions (0018 in-place state write-back, 0019 fused recurrent-state gather,
0021 ssm_conv_update_inplace, 0028 conv-tap gather fusion) native kernels on
the non-CUDA compute backends, so the patch-series decode win extends past
CUDA-family hardware.

Key findings:
- The base GGML_OP_GATED_DELTA_NET and GGML_OP_SSM_CONV kernels ALREADY exist
  upstream on Metal, Vulkan AND SYCL (the README's no-Vulkan-kernel line is
  stale). The Qwen3.6 hybrids run on all three today via the non-fused path;
  Layer-2 is the decode SPEEDUP, not enabling the model to run.
- Per backend the new work is only the FUSION plumbing: redirect the GDN state
  write (in-place), add the ids read, write one new conv-update kernel + its
  ids variant, two tiny gather kernels, plus supports_op + op-handler + (Vulkan)
  pipeline/push-constant/descriptor wiring. Builders, CPU refs, model graph and
  test-backend-ops cases are shared and already done.
- Bit-exactness is feasible per backend by construction (the fusions redirect
  addresses, not the f32 reduction order); test-backend-ops (backendX-vs-CPU)
  is the gate.
- The 0030 name allow-list should become capability-driven (make supports_op
  authoritative for the discriminated src slots).
- Ranked: ops-first PR, then Metal (highest value/effort, fixed simdgroup =
  simplest bit-exactness), then SYCL (near-verbatim CUDA mirror, cheapest to
  author), then Vulkan (widest hardware reach but the shader-gen + variant
  matrix + subgroup variance make it the capstone).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-27 12:11:24 +00:00
parent 4a9a1dd247
commit 984c8fcbea

View File

@@ -0,0 +1,337 @@
# Layer-2 upstream scope: native fused-GDN kernels for Metal / Vulkan / SYCL
Source-only analysis (no GPU, no build) of what it would take to give the
gated-DeltaNet (GDN / SSM) decode fusions native kernels on the non-CUDA compute
backends, so the patch-series decode win extends past CUDA-family hardware.
In our changeset (patches 0018-0030) these fusions ship with CUDA native kernels
+ CPU reference kernels ONLY; patch 0030 force-gates them OFF on Metal / Vulkan /
SYCL (a CPU-fallback fused op would regress via the device round-trip, and a
backend that ran the plain op on the discriminated node would silently
miscompute). "Layer 2" is the upstream work that adds the missing native kernels.
This doc was written against the ggml backend trees in
`backend/cpp/llama-cpp-paged-dev` (upstream base #24732, one commit OLDER than the
series pin `c299a92c` #25045, with only the two paged-KV patches applied - neither
touches GDN/SSM). So every "kernel already exists" statement below is a
conservative lower bound: the pin has at least these kernels.
--------------------------------------------------------------------------------
## 0. Headline finding (correct a stale assumption first)
The series README (section 4c) says "the gated-DeltaNet op has no Vulkan kernel
upstream, so the Qwen3.6 hybrid models assert / fall back and don't run there."
**That is now stale.** All three backends already carry the BASE compute ops:
| op | Metal | Vulkan | SYCL |
|------------------------|------------------------------------|------------------------------------------|---------------------------------|
| GGML_OP_GATED_DELTA_NET| `kernel_gated_delta_net_impl` (f32, NSG 1/2/4) | `gated_delta_net.comp` (d16/32/64/128 x kda, shmem/cluster/nocluster variants) | `gated_delta_net.cpp` (`launch_gated_delta_net<KDA,keep_rs>`) |
| GGML_OP_SSM_CONV | `kernel_ssm_conv_f32_f32` (+ `_4`, + batched) | `ssm_conv.comp` (+ APPLY_BIAS, APPLY_SILU specialization consts) | `ssm_conv.cpp` (`kernel_ssm_conv`) |
| GGML_OP_SSM_SCAN | yes | `ssm_scan.comp` (mamba2) | `ssm_scan.cpp` (mamba2) |
Verified: Vulkan `gated_delta_net.comp` was last touched at the upstream base
commit (#24732), not by any LocalAI patch. So the GDN COMPUTE op is present on
Metal, Vulkan AND SYCL. The Qwen3.6 hybrids therefore DO run on all three today
(via the upstream non-fused path that 0030 routes to). The Layer-2 value-add is
the decode SPEEDUP from the fusions, NOT enabling the model to run at all.
Consequence: the GDN-compute op being "partly there" is true on every backend,
not just Metal. What is still missing per backend is only the FUSION plumbing
(in-place write-back target, the ids gather read, and the conv-update kernel) -
a materially smaller scope than "port GDN from scratch."
--------------------------------------------------------------------------------
## 1. Per-op semantics (the four fusions to port)
All four reuse an existing GGML_OP enum with extra `src[]` slots as a
discriminator; none adds a new enum value. f32 throughout. The arithmetic core
is IDENTICAL to the upstream non-fused op; only the read source and/or the write
target are redirected. That single fact drives the whole bit-exactness story
(section 3).
### OP A - `ggml_gated_delta_net_inplace` (patch 0018)
- Enum `GGML_OP_GATED_DELTA_NET`, discriminated by a non-null `src[6]` =
`state_dst` (a contiguous `[S_v*S_v*H, n_seqs]` view into the recurrent-state
cache at `kv_head`). K == 1 only.
- Semantics: run the standard GDN recurrence, but write the FINAL recurrent state
directly into `state_dst` instead of appending it to the op output. The op
output then carries only the attention scores. Removes the per-layer per-step
~full-state D2D copy-back (the 0018 win).
- Race (in-place read == write): each (seq, head) block owns a disjoint cache
slot. The kernel loads the whole prior state `s0` into per-thread registers
(`s_shard` on CUDA, `ls[NSG]` on Metal, the column shard on Vulkan/SYCL)
BEFORE the ring write, so reading and writing the same slot is safe.
### OP B - `ggml_gated_delta_net_inplace_ids` (patch 0019)
- Adds `src[5]` = FULL state cache `[S_v,S_v,H,n_rs_slots]`, `src[7]` = `ids`
(I32, per-seq source slot == the recurrent-state `s_copy`), `op_param[1]` =
`rs_head` (destination base slot). Still has the OP-A `src[6]` in-place target.
- Semantics: read each sequence's prior state directly from `cache[ids[seq]]`
(mirrors `ggml_ssm_scan`'s ids source), eliminating the `ggml_get_rows`
materialization. Combined with OP A the op now reads AND writes the cache in
place.
- Race: identity sequences (`ids[s] == rs_head + s`, the steady AR-decode case)
read s0 in place from the destination slot (safe via the register snapshot
above). Non-identity sequences (reorder / rs_zero remap) are first copied by a
TINY separate gather kernel (`gdn_gather_nonident`, one block/seq) into a
DISJOINT scratch that the recurrence then reads, so the recurrence never reads
a slot another block is writing. Value-preserving memcpy -> bit-identical to
the get_rows path.
### OP C - `ggml_ssm_conv_update_inplace` (patch 0021)
- Enum `GGML_OP_SSM_CONV`, discriminated by a non-null `src[3]` =
`conv_state_dst` (`[(K-1)*channels, n_seqs]` in-place ring view).
`src[0]` = conv_states `[K-1, channels, n_seqs]`, `src[1]` = conv_kernel
`[K, channels]`, `src[2]` = x_cur `[channels, 1, n_seqs]`. `op_param[0]` =
fuse_silu.
- Semantics (decode, n_seq_tokens == 1): per (channel, sequence) assemble the
width-K conv window in registers from the K-1 cached taps + the current token,
compute the depthwise conv with the SAME ascending-tap FMA order as plain
`ssm_conv` (`tap0*w0 + ... + xc*w_{K-1}`, then `+0.0f` to match plain conv's
`sumf += b` with b==0), optionally fold SiLU, write the conv output
`[channels,1,n_seqs]`, and write the 1-token-shifted ring state back in place.
Replaces the 4-op decode conv chain (transpose + concat + conv + silu + ring
cpy).
- Race: read source (gathered taps) and write target (cache view) are disjoint
buffers -> race-free by construction, no ids/identity logic.
### OP D - `ggml_ssm_conv_update_inplace_ids` (patch 0028)
- Same enum, discriminated by a non-null `src[4]` = `ids`; `src[0]` becomes the
FULL conv cache `[K-1, channels, n_cells]`; `op_param[1]` = rs_head.
- Semantics: gather-free conv-update - read each sequence's prior taps from
`cache[ids[s]]` in-kernel (no get_rows). Identity reads in place from
`conv_state_dst`; non-identity gathered into a disjoint scratch first by a tiny
`ssm_conv_gather_nonident` kernel. The window is copied to a local array
BEFORE the (possibly aliasing) ring write so the identity read==write slot is
correct. Bit-identical to get_rows + OP C.
### Net new kernels vs reuse, per op
- OP A: NOT a new compute kernel - a write-target redirection of the EXISTING
GDN kernel + 1 buffer binding + a supports_op/op-handler branch.
- OP B: the GDN kernel gains a per-seq read-base select (identity vs scratch) +
1 ids binding + rs_head param + 1 tiny gather kernel.
- OP C: a GENUINELY NEW kernel on each backend. The existing `ssm_conv` computes
a windowed reduction over a PRE-concatenated input; it does not assemble the
window from cached taps + the current token, fold silu, or write the shifted
ring state. This is the largest net-new piece.
- OP D: the OP-C kernel gains the read-base select + 1 ids binding + rs_head + 1
tiny conv gather kernel.
The `ggml.h` / `ggml.c` builders, the CPU reference kernels, the model-graph
emission (`delta-net-base.cpp`, qwen35*), and the `test-backend-ops` cases are
SHARED and already done by patches 0018/0019/0021/0028. The only NEW per-backend
work is the kernel(s) + the backend wiring.
--------------------------------------------------------------------------------
## 2. Per-backend: authoring model, effort, gotchas, wiring
### 2.1 Metal (MSL)
Authoring model: `.metal` MSL source (`ggml-metal.metal`), function-constant
specialization (e.g. `FC_GATED_DELTA_NET`), kernels templated on `NSG`; host
glue split across `ggml-metal-ops.cpp` (`ggml_metal_op_*` encode), the pipeline
lookup in `ggml-metal-device.cpp`/`.m`, the kargs struct in `ggml-metal-impl.h`,
and `supports_op` in `ggml-metal-device.m`. Threadgroup model; Apple GPU
simdgroup width is a FIXED 32, `simd_sum` for the per-column reduce.
Effort: MEDIUM. ~350-500 LOC. The GDN and plain-ssm_conv kernels already exist
and are ergonomic to extend. OP A is a write-base redirect of the existing
`kernel_gated_delta_net_impl` (its tail already does
`dst_state = dst + attn_size + state_out_base; dst_state[is] = ls[j]` after
loading `ls[]` into registers - just point `dst_state` at the `state_dst` buffer
and add the binding). OP C is the one net-new MSL kernel (Metal has NO bias/silu
ssm_conv variant today - only plain + `_4` + batched - so the silu-fold and ring
write are both new). Host glue spans 3-4 files.
Gotchas:
- In-place race: the existing kernel ALREADY snapshots the state column into
`ls[NSG]` registers before writing, so OP A/B are safe with no barrier; OP C/D
must mirror the `float window[K]` local-copy-before-write that CPU/CUDA use.
- Discriminated SSM_CONV: `supports_op` for `GGML_OP_SSM_CONV` currently returns
`has_simdgroup_reduction` with NO check of `src[3]`/`src[4]`; GDN returns
`has_simdgroup_reduction && src[2]->ne[0] % 32 == 0` with NO check of
`src[6]`/`src[7]`. Both must be tightened (accept the discriminated variant
only once the kernel exists) AND `ggml_metal_op_ssm_conv` /
`ggml_metal_op_gated_delta_net` must branch on the extra src to pick the kernel.
- Bit-exactness: fixed 32-wide simdgroup makes this the SIMPLEST of the three -
the fused variant only redirects addresses, so it is bit-identical to Metal's
own non-fused path by construction (the conv per-channel FMA needs the exact
ascending order + the `+0.0f`).
- The kargs struct grows by the `state_dst` / `ids` / `rs_head` fields; a new
pipeline name (or a function-constant branch) distinguishes the variants.
### 2.2 Vulkan (GLSL .comp -> SPIR-V)
Authoring model: GLSL `.comp` in `vulkan-shaders/`, compiled at build time by
`vulkan-shaders-gen` into embedded SPIR-V byte arrays (`gated_delta_net_f32_data`
etc.); pipeline creation in `ggml-vulkan.cpp` declares the binding count +
push-constant size; a push-constant struct per op; host dispatch `ggml_vk_*`
binds subbuffers; `supports_op` in the device support function. Subgroup size
VARIES by vendor (NVIDIA 32, AMD 64, Intel 8/16/32).
Effort: HARDEST. ~450-650 LOC + the most build/host glue. Same kernel logic as
Metal/SYCL, but every new shader or variant requires: the shaders-gen regen, a
new `ggml_vk_create_pipeline` registration with an explicit binding count and
push-constant size, a new/extended push-constant struct (add `rs_head`), and
GROWING the descriptor binding set from the current 7 (`src[0..5]` + dst) to 8-9
(`state_dst`, `ids`). The GDN host dispatch hardcodes a 6-src bind loop and the
pipeline is created with `"main", 7, ...` - both must change.
Gotchas:
- Subgroup variance interacts with the EXISTING variant matrix: the GDN comp
already ships shmem / cluster / nocluster variants keyed on subgroup size and
relies on `S_V % COLS_PER_WG == 0`. The OP-A/B read/write redirect must be
applied across ALL of those variants, and re-validated per vendor.
- In-place race: GLSL must read the full column shard into local registers before
the ring write (same pattern); confirm the SPIR-V memory model is not relied on
for cross-invocation ordering (it is not - blocks are disjoint per (seq,head)).
OP C/D need the explicit window-to-local copy.
- Discriminated SSM_CONV: `supports_op` returns `op->src[0]->type == F32` with NO
discriminator check; GDN loops `src[0..5]` F32 with NO `src[6]`/`src[7]` check.
Both must be tightened. This is the backend where the 0030 hazard is most
concrete (a present plain-conv kernel + a permissive supports_op = silent
miscompute) - Vulkan is the exact case 0030 was written for.
- conv-update is per-channel (one invocation per channel) so it is
subgroup-AGNOSTIC; only the GDN recurrence carries the subgroup-width burden.
- Vulkan's `ssm_conv.comp` ALREADY has APPLY_SILU + APPLY_BIAS specialization
constants, so the silu-fold half of OP C is partly precedented here (unlike
Metal); the ring write-back + tap-window assembly are still new.
### 2.3 SYCL (single-source DPC++)
Authoring model: plain C++ `.cpp`/`.hpp` per op (`gated_delta_net.cpp`,
`ssm_conv.cpp`); a SYCL `queue.parallel_for` over an `nd_range` with
`reqd_sub_group_size(WARP_SIZE)`; sub-group reductions (`warp_reduce_sum`);
`supports_op` in `ggml-sycl.cpp`. NO separate shader-compile step (single
source).
Effort: EASIEST to author. ~250-350 LOC. The SYCL op handlers + kernels are
near-VERBATIM mirrors of the CUDA ones (`launch_gated_delta_net<KDA,keep_rs>`,
`s_shard`, `curr_state`, `state = dst + attn_score_elems`, `warp_reduce_sum`) -
a dpct/SYCLomatic-style port. The CUDA diffs in 0018/0019/0021/0028 would port
almost line-for-line: add the `state_dst` param, the `ids`/`rs_head` params, the
read-base select, the two tiny gather kernels, and the new conv-update kernel.
No pipeline/push-constant/binding bookkeeping.
Gotchas:
- In-place race: the `s_shard[]` / window arrays are per-work-item private, so
the register-snapshot-before-write pattern carries over directly. Safe.
- Discriminated SSM_CONV: `supports_op` checks `src[0]`/`src[1]` F32 with NO
discriminator check; GDN returns a BARE `true` (the MOST permissive, so the
hazard is worst here). Both must be tightened, and `ggml_sycl_op_ssm_conv` /
`ggml_sycl_op_gated_delta_net` must branch on the extra src.
- Bit-exactness: `WARP_SIZE` is compile-fixed (Intel sub-group 8/16/32), same
situation as CUDA; the fused variant matches SYCL's own non-fused path by
construction. conv-update is per-channel -> subgroup-agnostic.
### 2.4 Common wiring (all three) + the 0030 emission-gate change
Per backend, four wiring touch-points beyond the kernel body:
1. `supports_op`: tighten the `GGML_OP_SSM_CONV` and `GGML_OP_GATED_DELTA_NET`
entries so the discriminated/extra-src node is reported supported ONLY when
the new kernel handles it (and rejected otherwise, instead of today's
silently-true-for-the-plain-kernel).
2. op handler: branch on `src[3]`/`src[4]` (conv) and `src[6]`/`src[7]` (GDN) to
dispatch the fused kernel.
3. pipeline/kernel registration (Vulkan: + push-constant struct + descriptor
bindings; Metal: + kargs fields + pipeline name; SYCL: just the new functions).
4. The patch-0030 gate in `src/llama-context.cpp`.
The 0030 change today is a hard allow-list: any non-CPU compute backend whose reg
name is not `"CUDA"`/`"ROCm"`/`"MUSA"` forces `fused_gdn_ar = fused_gdn_ch =
auto_fgdn = false`. As each backend gains kernels this must become capability-
driven, in one of two ways:
- minimal: add the backend's reg name (e.g. `"Metal"`) to the allow-list once its
kernels + tightened supports_op ship; OR
- clean (recommended upstream form): DELETE the name allow-list and make
`supports_op` authoritative - have the `auto_fgdn` resolution probe
`ggml_backend_dev_supports_op` on a representative node that carries the
discriminated `src[]` slots. Then routing falls out of the normal scheduler
fallback and no backend name is ever hard-coded. This also fixes 0030's stated
weakness that the upstream `auto_fgdn` check only inspects GATED_DELTA_NET
nodes and covered the discriminated SSM_CONV only incidentally.
--------------------------------------------------------------------------------
## 3. Bit-exactness per backend (the md5 gate question)
Feasible on ALL THREE, and not actually constraining, because of how the gate is
scoped:
- The series md5 gate is a CUDA-vs-CPU comparison; each GPU backend ALREADY has
its own f32 reduction order (Metal `simd_sum`, Vulkan subgroup reduce, SYCL
`warp_reduce_sum`) that differs from CUDA's and from CPU's. There is no
cross-backend md5 and none is expected.
- The relevant per-backend invariant is: the FUSED variant must equal that
backend's OWN non-fused path. The fusions change only the read source
(gather -> indexed read; the gather is a value-preserving memcpy) and the write
target (appended output -> in-place cache slot). They do NOT touch the
per-column FMA/reduce order. So the fused op is bit-identical to the
non-fused op on the same backend BY CONSTRUCTION.
- Two arithmetic details each port MUST preserve exactly: (a) the conv
ascending-tap order plus the `+0.0f` that matches plain `ssm_conv`'s
`sumf += b` with b==0; (b) the existing GDN per-column subgroup reduce (do not
re-order it). Get those right and `test-backend-ops` (backendX-vs-CPU, already
registered for SSM_CONV / SSM_CONV_UPDATE / SSM_CONV_UPDATE_IDS /
GATED_DELTA_NET) is the per-backend gate.
--------------------------------------------------------------------------------
## 4. Upstream path and ranked recommendation
### Ops-first, then one PR per backend (NOT one big PR)
Recommended sequence:
1. PR #1 - OPS (already essentially done, upstreamable as-is): the `ggml.h`/
`ggml.c` builders, the CPU reference kernels, the CUDA kernels, the
`test-backend-ops` cases, and the capability-driven gate (the clean
`supports_op`-authoritative version of 0030). This is independently mergeable
and mirrors how llama.cpp lands new ops (CPU + CUDA first; GDN itself landed
that way).
2. PR #2 - Metal kernels + wiring.
3. PR #3 - SYCL kernels + wiring.
4. PR #4 - Vulkan kernels + wiring.
Do NOT bundle the backends: each needs its own hardware to validate
`test-backend-ops`, reviewers are backend-specialized, and a regression in one
must not block the others.
### Value x effort ranking (which backend first)
| backend | user base / value | author effort | bit-exact difficulty | net rank |
|---------|----------------------------|---------------|----------------------|----------|
| Metal | HIGH (Apple Silicon = largest non-CUDA LocalAI base; unified memory makes the no-copy / no-gather plumbing wins map directly) | MEDIUM | LOWEST (fixed 32 simdgroup) | **1st** |
| SYCL | LOW-MED (Intel GPU) | LOWEST (near-verbatim CUDA mirror) | LOW | **2nd** |
| Vulkan | HIGHEST breadth (AMD + Intel + cross-vendor) | HIGHEST (shaders-gen + variant matrix + subgroup variance + descriptor growth) | MEDIUM (per-vendor subgroup validation) | **3rd** |
Recommendation: **Metal first.** It banks the biggest user-facing decode win at
medium effort, the base GDN + conv kernels already exist, and Apple's fixed
simdgroup width makes bit-exactness the simplest. **SYCL second** as a cheap,
nearly mechanical follow-on (the port is a line-for-line CUDA mirror, so it is
low-cost insurance even though the Intel-GPU audience is smaller). **Vulkan last**
as the high-effort / high-breadth capstone - it reaches the widest hardware
(AMD + Intel + anything with a Vulkan driver), but the shader-gen pipeline, the
existing variant matrix, the subgroup-width variance, and the per-vendor
validation burden make it the right capstone once the pattern is proven on
Metal + SYCL.
A reasonable cheaper variant: ship Metal + SYCL together right after the ops PR
(both are register-snapshot ports with no shader-gen step) and treat Vulkan as a
separate later effort.
--------------------------------------------------------------------------------
## 5. Summary
- GDN-compute and plain SSM_CONV kernels ALREADY EXIST on Metal, Vulkan and SYCL
(the README's "no Vulkan kernel" line is stale). The Qwen3.6 hybrids run on all
three today via the non-fused path; Layer-2 is about the decode SPEEDUP.
- Per backend the NEW work is: redirect the GDN state write (OP A) + add the ids
read (OP B) to the existing GDN kernel, write ONE new conv-update kernel
(OP C) + its ids variant (OP D), add two tiny gather kernels, and tighten
supports_op + the op-handler branch + (Vulkan) the pipeline/push-constant/
descriptor wiring. The builders, CPU refs, model graph and tests are shared and
already done.
- Bit-exactness is feasible everywhere and per-backend by construction (the
fusions redirect addresses, not the f32 reduction order); `test-backend-ops`
(backendX-vs-CPU) is the gate.
- Sequence: ops-first PR (incl. the capability-driven replacement for 0030's
name allow-list), then Metal, then SYCL, then Vulkan.