docs(paged): Layer-2 upstream scope for native fused-GDN kernels (Metal/Vulkan/SYCL)

Source-only analysis of what it would take to give the gated-DeltaNet decode fusions (0018 in-place state write-back, 0019 fused recurrent-state gather, 0021 ssm_conv_update_inplace, 0028 conv-tap gather fusion) native kernels on the non-CUDA compute backends, so the patch-series decode win extends past CUDA-family hardware. Key findings: - The base GGML_OP_GATED_DELTA_NET and GGML_OP_SSM_CONV kernels ALREADY exist upstream on Metal, Vulkan AND SYCL (the README's no-Vulkan-kernel line is stale). The Qwen3.6 hybrids run on all three today via the non-fused path; Layer-2 is the decode SPEEDUP, not enabling the model to run. - Per backend the new work is only the FUSION plumbing: redirect the GDN state write (in-place), add the ids read, write one new conv-update kernel + its ids variant, two tiny gather kernels, plus supports_op + op-handler + (Vulkan) pipeline/push-constant/descriptor wiring. Builders, CPU refs, model graph and test-backend-ops cases are shared and already done. - Bit-exactness is feasible per backend by construction (the fusions redirect addresses, not the f32 reduction order); test-backend-ops (backendX-vs-CPU) is the gate. - The 0030 name allow-list should become capability-driven (make supports_op authoritative for the discriminated src slots). - Ranked: ops-first PR, then Metal (highest value/effort, fixed simdgroup = simplest bit-exactness), then SYCL (near-verbatim CUDA mirror, cheapest to author), then Vulkan (widest hardware reach but the shader-gen + variant matrix + subgroup variance make it the capstone). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 09:57:14 -04:00 · 2026-06-27 12:11:24 +00:00
parent 4a9a1dd247
commit 984c8fcbea
1 changed files with 337 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/UPSTREAM_LAYER2_SCOPE.md
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/UPSTREAM_LAYER2_SCOPE.md
@@ -0,0 +1,337 @@
+# Layer-2 upstream scope: native fused-GDN kernels for Metal / Vulkan / SYCL
+
+Source-only analysis (no GPU, no build) of what it would take to give the
+gated-DeltaNet (GDN / SSM) decode fusions native kernels on the non-CUDA compute
+backends, so the patch-series decode win extends past CUDA-family hardware.
+
+In our changeset (patches 0018-0030) these fusions ship with CUDA native kernels
+ CPU reference kernels ONLY; patch 0030 force-gates them OFF on Metal / Vulkan /
+SYCL (a CPU-fallback fused op would regress via the device round-trip, and a
+backend that ran the plain op on the discriminated node would silently
+miscompute). "Layer 2" is the upstream work that adds the missing native kernels.
+
+This doc was written against the ggml backend trees in
+`backend/cpp/llama-cpp-paged-dev` (upstream base #24732, one commit OLDER than the
+series pin `c299a92c` #25045, with only the two paged-KV patches applied - neither
+touches GDN/SSM). So every "kernel already exists" statement below is a
+conservative lower bound: the pin has at least these kernels.
+
+--------------------------------------------------------------------------------
+## 0. Headline finding (correct a stale assumption first)
+
+The series README (section 4c) says "the gated-DeltaNet op has no Vulkan kernel
+upstream, so the Qwen3.6 hybrid models assert / fall back and don't run there."
+**That is now stale.** All three backends already carry the BASE compute ops:
+
+| op                     | Metal                              | Vulkan                                   | SYCL                            |
+|------------------------|------------------------------------|------------------------------------------|---------------------------------|
+| GGML_OP_GATED_DELTA_NET| `kernel_gated_delta_net_impl` (f32, NSG 1/2/4) | `gated_delta_net.comp` (d16/32/64/128 x kda, shmem/cluster/nocluster variants) | `gated_delta_net.cpp` (`launch_gated_delta_net<KDA,keep_rs>`) |
+| GGML_OP_SSM_CONV       | `kernel_ssm_conv_f32_f32` (+ `_4`, + batched) | `ssm_conv.comp` (+ APPLY_BIAS, APPLY_SILU specialization consts) | `ssm_conv.cpp` (`kernel_ssm_conv`) |
+| GGML_OP_SSM_SCAN       | yes                                | `ssm_scan.comp` (mamba2)                 | `ssm_scan.cpp` (mamba2)         |
+
+Verified: Vulkan `gated_delta_net.comp` was last touched at the upstream base
+commit (#24732), not by any LocalAI patch. So the GDN COMPUTE op is present on
+Metal, Vulkan AND SYCL. The Qwen3.6 hybrids therefore DO run on all three today
+(via the upstream non-fused path that 0030 routes to). The Layer-2 value-add is
+the decode SPEEDUP from the fusions, NOT enabling the model to run at all.
+
+Consequence: the GDN-compute op being "partly there" is true on every backend,
+not just Metal. What is still missing per backend is only the FUSION plumbing
+(in-place write-back target, the ids gather read, and the conv-update kernel) -
+a materially smaller scope than "port GDN from scratch."
+
+--------------------------------------------------------------------------------
+## 1. Per-op semantics (the four fusions to port)
+
+All four reuse an existing GGML_OP enum with extra `src[]` slots as a
+discriminator; none adds a new enum value. f32 throughout. The arithmetic core
+is IDENTICAL to the upstream non-fused op; only the read source and/or the write
+target are redirected. That single fact drives the whole bit-exactness story
+(section 3).
+
+### OP A - `ggml_gated_delta_net_inplace` (patch 0018)
+- Enum `GGML_OP_GATED_DELTA_NET`, discriminated by a non-null `src[6]` =
+  `state_dst` (a contiguous `[S_v*S_v*H, n_seqs]` view into the recurrent-state
+  cache at `kv_head`). K == 1 only.
+- Semantics: run the standard GDN recurrence, but write the FINAL recurrent state
+  directly into `state_dst` instead of appending it to the op output. The op
+  output then carries only the attention scores. Removes the per-layer per-step
+  ~full-state D2D copy-back (the 0018 win).
+- Race (in-place read == write): each (seq, head) block owns a disjoint cache
+  slot. The kernel loads the whole prior state `s0` into per-thread registers
+  (`s_shard` on CUDA, `ls[NSG]` on Metal, the column shard on Vulkan/SYCL)
+  BEFORE the ring write, so reading and writing the same slot is safe.
+
+### OP B - `ggml_gated_delta_net_inplace_ids` (patch 0019)
+- Adds `src[5]` = FULL state cache `[S_v,S_v,H,n_rs_slots]`, `src[7]` = `ids`
+  (I32, per-seq source slot == the recurrent-state `s_copy`), `op_param[1]` =
+  `rs_head` (destination base slot). Still has the OP-A `src[6]` in-place target.
+- Semantics: read each sequence's prior state directly from `cache[ids[seq]]`
+  (mirrors `ggml_ssm_scan`'s ids source), eliminating the `ggml_get_rows`
+  materialization. Combined with OP A the op now reads AND writes the cache in
+  place.
+- Race: identity sequences (`ids[s] == rs_head + s`, the steady AR-decode case)
+  read s0 in place from the destination slot (safe via the register snapshot
+  above). Non-identity sequences (reorder / rs_zero remap) are first copied by a
+  TINY separate gather kernel (`gdn_gather_nonident`, one block/seq) into a
+  DISJOINT scratch that the recurrence then reads, so the recurrence never reads
+  a slot another block is writing. Value-preserving memcpy -> bit-identical to
+  the get_rows path.
+
+### OP C - `ggml_ssm_conv_update_inplace` (patch 0021)
+- Enum `GGML_OP_SSM_CONV`, discriminated by a non-null `src[3]` =
+  `conv_state_dst` (`[(K-1)*channels, n_seqs]` in-place ring view).
+  `src[0]` = conv_states `[K-1, channels, n_seqs]`, `src[1]` = conv_kernel
+  `[K, channels]`, `src[2]` = x_cur `[channels, 1, n_seqs]`. `op_param[0]` =
+  fuse_silu.
+- Semantics (decode, n_seq_tokens == 1): per (channel, sequence) assemble the
+  width-K conv window in registers from the K-1 cached taps + the current token,
+  compute the depthwise conv with the SAME ascending-tap FMA order as plain
+  `ssm_conv` (`tap0*w0 + ... + xc*w_{K-1}`, then `+0.0f` to match plain conv's
+  `sumf += b` with b==0), optionally fold SiLU, write the conv output
+  `[channels,1,n_seqs]`, and write the 1-token-shifted ring state back in place.
+  Replaces the 4-op decode conv chain (transpose + concat + conv + silu + ring
+  cpy).
+- Race: read source (gathered taps) and write target (cache view) are disjoint
+  buffers -> race-free by construction, no ids/identity logic.
+
+### OP D - `ggml_ssm_conv_update_inplace_ids` (patch 0028)
+- Same enum, discriminated by a non-null `src[4]` = `ids`; `src[0]` becomes the
+  FULL conv cache `[K-1, channels, n_cells]`; `op_param[1]` = rs_head.
+- Semantics: gather-free conv-update - read each sequence's prior taps from
+  `cache[ids[s]]` in-kernel (no get_rows). Identity reads in place from
+  `conv_state_dst`; non-identity gathered into a disjoint scratch first by a tiny
+  `ssm_conv_gather_nonident` kernel. The window is copied to a local array
+  BEFORE the (possibly aliasing) ring write so the identity read==write slot is
+  correct. Bit-identical to get_rows + OP C.
+
+### Net new kernels vs reuse, per op
+- OP A: NOT a new compute kernel - a write-target redirection of the EXISTING
+  GDN kernel + 1 buffer binding + a supports_op/op-handler branch.
+- OP B: the GDN kernel gains a per-seq read-base select (identity vs scratch) +
+  1 ids binding + rs_head param + 1 tiny gather kernel.
+- OP C: a GENUINELY NEW kernel on each backend. The existing `ssm_conv` computes
+  a windowed reduction over a PRE-concatenated input; it does not assemble the
+  window from cached taps + the current token, fold silu, or write the shifted
+  ring state. This is the largest net-new piece.
+- OP D: the OP-C kernel gains the read-base select + 1 ids binding + rs_head + 1
+  tiny conv gather kernel.
+
+The `ggml.h` / `ggml.c` builders, the CPU reference kernels, the model-graph
+emission (`delta-net-base.cpp`, qwen35*), and the `test-backend-ops` cases are
+SHARED and already done by patches 0018/0019/0021/0028. The only NEW per-backend
+work is the kernel(s) + the backend wiring.
+
+--------------------------------------------------------------------------------
+## 2. Per-backend: authoring model, effort, gotchas, wiring
+
+### 2.1 Metal (MSL)
+
+Authoring model: `.metal` MSL source (`ggml-metal.metal`), function-constant
+specialization (e.g. `FC_GATED_DELTA_NET`), kernels templated on `NSG`; host
+glue split across `ggml-metal-ops.cpp` (`ggml_metal_op_*` encode), the pipeline
+lookup in `ggml-metal-device.cpp`/`.m`, the kargs struct in `ggml-metal-impl.h`,
+and `supports_op` in `ggml-metal-device.m`. Threadgroup model; Apple GPU
+simdgroup width is a FIXED 32, `simd_sum` for the per-column reduce.
+
+Effort: MEDIUM. ~350-500 LOC. The GDN and plain-ssm_conv kernels already exist
+and are ergonomic to extend. OP A is a write-base redirect of the existing
+`kernel_gated_delta_net_impl` (its tail already does
+`dst_state = dst + attn_size + state_out_base; dst_state[is] = ls[j]` after
+loading `ls[]` into registers - just point `dst_state` at the `state_dst` buffer
+and add the binding). OP C is the one net-new MSL kernel (Metal has NO bias/silu
+ssm_conv variant today - only plain + `_4` + batched - so the silu-fold and ring
+write are both new). Host glue spans 3-4 files.
+
+Gotchas:
+- In-place race: the existing kernel ALREADY snapshots the state column into
+  `ls[NSG]` registers before writing, so OP A/B are safe with no barrier; OP C/D
+  must mirror the `float window[K]` local-copy-before-write that CPU/CUDA use.
+- Discriminated SSM_CONV: `supports_op` for `GGML_OP_SSM_CONV` currently returns
+  `has_simdgroup_reduction` with NO check of `src[3]`/`src[4]`; GDN returns
+  `has_simdgroup_reduction && src[2]->ne[0] % 32 == 0` with NO check of
+  `src[6]`/`src[7]`. Both must be tightened (accept the discriminated variant
+  only once the kernel exists) AND `ggml_metal_op_ssm_conv` /
+  `ggml_metal_op_gated_delta_net` must branch on the extra src to pick the kernel.
+- Bit-exactness: fixed 32-wide simdgroup makes this the SIMPLEST of the three -
+  the fused variant only redirects addresses, so it is bit-identical to Metal's
+  own non-fused path by construction (the conv per-channel FMA needs the exact
+  ascending order + the `+0.0f`).
+- The kargs struct grows by the `state_dst` / `ids` / `rs_head` fields; a new
+  pipeline name (or a function-constant branch) distinguishes the variants.
+
+### 2.2 Vulkan (GLSL .comp -> SPIR-V)
+
+Authoring model: GLSL `.comp` in `vulkan-shaders/`, compiled at build time by
+`vulkan-shaders-gen` into embedded SPIR-V byte arrays (`gated_delta_net_f32_data`
+etc.); pipeline creation in `ggml-vulkan.cpp` declares the binding count +
+push-constant size; a push-constant struct per op; host dispatch `ggml_vk_*`
+binds subbuffers; `supports_op` in the device support function. Subgroup size
+VARIES by vendor (NVIDIA 32, AMD 64, Intel 8/16/32).
+
+Effort: HARDEST. ~450-650 LOC + the most build/host glue. Same kernel logic as
+Metal/SYCL, but every new shader or variant requires: the shaders-gen regen, a
+new `ggml_vk_create_pipeline` registration with an explicit binding count and
+push-constant size, a new/extended push-constant struct (add `rs_head`), and
+GROWING the descriptor binding set from the current 7 (`src[0..5]` + dst) to 8-9
+(`state_dst`, `ids`). The GDN host dispatch hardcodes a 6-src bind loop and the
+pipeline is created with `"main", 7, ...` - both must change.
+
+Gotchas:
+- Subgroup variance interacts with the EXISTING variant matrix: the GDN comp
+  already ships shmem / cluster / nocluster variants keyed on subgroup size and
+  relies on `S_V % COLS_PER_WG == 0`. The OP-A/B read/write redirect must be
+  applied across ALL of those variants, and re-validated per vendor.
+- In-place race: GLSL must read the full column shard into local registers before
+  the ring write (same pattern); confirm the SPIR-V memory model is not relied on
+  for cross-invocation ordering (it is not - blocks are disjoint per (seq,head)).
+  OP C/D need the explicit window-to-local copy.
+- Discriminated SSM_CONV: `supports_op` returns `op->src[0]->type == F32` with NO
+  discriminator check; GDN loops `src[0..5]` F32 with NO `src[6]`/`src[7]` check.
+  Both must be tightened. This is the backend where the 0030 hazard is most
+  concrete (a present plain-conv kernel + a permissive supports_op = silent
+  miscompute) - Vulkan is the exact case 0030 was written for.
+- conv-update is per-channel (one invocation per channel) so it is
+  subgroup-AGNOSTIC; only the GDN recurrence carries the subgroup-width burden.
+- Vulkan's `ssm_conv.comp` ALREADY has APPLY_SILU + APPLY_BIAS specialization
+  constants, so the silu-fold half of OP C is partly precedented here (unlike
+  Metal); the ring write-back + tap-window assembly are still new.
+
+### 2.3 SYCL (single-source DPC++)
+
+Authoring model: plain C++ `.cpp`/`.hpp` per op (`gated_delta_net.cpp`,
+`ssm_conv.cpp`); a SYCL `queue.parallel_for` over an `nd_range` with
+`reqd_sub_group_size(WARP_SIZE)`; sub-group reductions (`warp_reduce_sum`);
+`supports_op` in `ggml-sycl.cpp`. NO separate shader-compile step (single
+source).
+
+Effort: EASIEST to author. ~250-350 LOC. The SYCL op handlers + kernels are
+near-VERBATIM mirrors of the CUDA ones (`launch_gated_delta_net<KDA,keep_rs>`,
+`s_shard`, `curr_state`, `state = dst + attn_score_elems`, `warp_reduce_sum`) -
+a dpct/SYCLomatic-style port. The CUDA diffs in 0018/0019/0021/0028 would port
+almost line-for-line: add the `state_dst` param, the `ids`/`rs_head` params, the
+read-base select, the two tiny gather kernels, and the new conv-update kernel.
+No pipeline/push-constant/binding bookkeeping.
+
+Gotchas:
+- In-place race: the `s_shard[]` / window arrays are per-work-item private, so
+  the register-snapshot-before-write pattern carries over directly. Safe.
+- Discriminated SSM_CONV: `supports_op` checks `src[0]`/`src[1]` F32 with NO
+  discriminator check; GDN returns a BARE `true` (the MOST permissive, so the
+  hazard is worst here). Both must be tightened, and `ggml_sycl_op_ssm_conv` /
+  `ggml_sycl_op_gated_delta_net` must branch on the extra src.
+- Bit-exactness: `WARP_SIZE` is compile-fixed (Intel sub-group 8/16/32), same
+  situation as CUDA; the fused variant matches SYCL's own non-fused path by
+  construction. conv-update is per-channel -> subgroup-agnostic.
+
+### 2.4 Common wiring (all three) + the 0030 emission-gate change
+
+Per backend, four wiring touch-points beyond the kernel body:
+1. `supports_op`: tighten the `GGML_OP_SSM_CONV` and `GGML_OP_GATED_DELTA_NET`
+   entries so the discriminated/extra-src node is reported supported ONLY when
+   the new kernel handles it (and rejected otherwise, instead of today's
+   silently-true-for-the-plain-kernel).
+2. op handler: branch on `src[3]`/`src[4]` (conv) and `src[6]`/`src[7]` (GDN) to
+   dispatch the fused kernel.
+3. pipeline/kernel registration (Vulkan: + push-constant struct + descriptor
+   bindings; Metal: + kargs fields + pipeline name; SYCL: just the new functions).
+4. The patch-0030 gate in `src/llama-context.cpp`.
+
+The 0030 change today is a hard allow-list: any non-CPU compute backend whose reg
+name is not `"CUDA"`/`"ROCm"`/`"MUSA"` forces `fused_gdn_ar = fused_gdn_ch =
+auto_fgdn = false`. As each backend gains kernels this must become capability-
+driven, in one of two ways:
+- minimal: add the backend's reg name (e.g. `"Metal"`) to the allow-list once its
+  kernels + tightened supports_op ship; OR
+- clean (recommended upstream form): DELETE the name allow-list and make
+  `supports_op` authoritative - have the `auto_fgdn` resolution probe
+  `ggml_backend_dev_supports_op` on a representative node that carries the
+  discriminated `src[]` slots. Then routing falls out of the normal scheduler
+  fallback and no backend name is ever hard-coded. This also fixes 0030's stated
+  weakness that the upstream `auto_fgdn` check only inspects GATED_DELTA_NET
+  nodes and covered the discriminated SSM_CONV only incidentally.
+
+--------------------------------------------------------------------------------
+## 3. Bit-exactness per backend (the md5 gate question)
+
+Feasible on ALL THREE, and not actually constraining, because of how the gate is
+scoped:
+
+- The series md5 gate is a CUDA-vs-CPU comparison; each GPU backend ALREADY has
+  its own f32 reduction order (Metal `simd_sum`, Vulkan subgroup reduce, SYCL
+  `warp_reduce_sum`) that differs from CUDA's and from CPU's. There is no
+  cross-backend md5 and none is expected.
+- The relevant per-backend invariant is: the FUSED variant must equal that
+  backend's OWN non-fused path. The fusions change only the read source
+  (gather -> indexed read; the gather is a value-preserving memcpy) and the write
+  target (appended output -> in-place cache slot). They do NOT touch the
+  per-column FMA/reduce order. So the fused op is bit-identical to the
+  non-fused op on the same backend BY CONSTRUCTION.
+- Two arithmetic details each port MUST preserve exactly: (a) the conv
+  ascending-tap order plus the `+0.0f` that matches plain `ssm_conv`'s
+  `sumf += b` with b==0; (b) the existing GDN per-column subgroup reduce (do not
+  re-order it). Get those right and `test-backend-ops` (backendX-vs-CPU, already
+  registered for SSM_CONV / SSM_CONV_UPDATE / SSM_CONV_UPDATE_IDS /
+  GATED_DELTA_NET) is the per-backend gate.
+
+--------------------------------------------------------------------------------
+## 4. Upstream path and ranked recommendation
+
+### Ops-first, then one PR per backend (NOT one big PR)
+
+Recommended sequence:
+
+1. PR #1 - OPS (already essentially done, upstreamable as-is): the `ggml.h`/
+   `ggml.c` builders, the CPU reference kernels, the CUDA kernels, the
+   `test-backend-ops` cases, and the capability-driven gate (the clean
+   `supports_op`-authoritative version of 0030). This is independently mergeable
+   and mirrors how llama.cpp lands new ops (CPU + CUDA first; GDN itself landed
+   that way).
+2. PR #2 - Metal kernels + wiring.
+3. PR #3 - SYCL kernels + wiring.
+4. PR #4 - Vulkan kernels + wiring.
+
+Do NOT bundle the backends: each needs its own hardware to validate
+`test-backend-ops`, reviewers are backend-specialized, and a regression in one
+must not block the others.
+
+### Value x effort ranking (which backend first)
+
+| backend | user base / value          | author effort | bit-exact difficulty | net rank |
+|---------|----------------------------|---------------|----------------------|----------|
+| Metal   | HIGH (Apple Silicon = largest non-CUDA LocalAI base; unified memory makes the no-copy / no-gather plumbing wins map directly) | MEDIUM | LOWEST (fixed 32 simdgroup) | **1st** |
+| SYCL    | LOW-MED (Intel GPU)        | LOWEST (near-verbatim CUDA mirror) | LOW   | **2nd** |
+| Vulkan  | HIGHEST breadth (AMD + Intel + cross-vendor) | HIGHEST (shaders-gen + variant matrix + subgroup variance + descriptor growth) | MEDIUM (per-vendor subgroup validation) | **3rd** |
+
+Recommendation: **Metal first.** It banks the biggest user-facing decode win at
+medium effort, the base GDN + conv kernels already exist, and Apple's fixed
+simdgroup width makes bit-exactness the simplest. **SYCL second** as a cheap,
+nearly mechanical follow-on (the port is a line-for-line CUDA mirror, so it is
+low-cost insurance even though the Intel-GPU audience is smaller). **Vulkan last**
+as the high-effort / high-breadth capstone - it reaches the widest hardware
+(AMD + Intel + anything with a Vulkan driver), but the shader-gen pipeline, the
+existing variant matrix, the subgroup-width variance, and the per-vendor
+validation burden make it the right capstone once the pattern is proven on
+Metal + SYCL.
+
+A reasonable cheaper variant: ship Metal + SYCL together right after the ops PR
+(both are register-snapshot ports with no shader-gen step) and treat Vulkan as a
+separate later effort.
+
+--------------------------------------------------------------------------------
+## 5. Summary
+
+- GDN-compute and plain SSM_CONV kernels ALREADY EXIST on Metal, Vulkan and SYCL
+  (the README's "no Vulkan kernel" line is stale). The Qwen3.6 hybrids run on all
+  three today via the non-fused path; Layer-2 is about the decode SPEEDUP.
+- Per backend the NEW work is: redirect the GDN state write (OP A) + add the ids
+  read (OP B) to the existing GDN kernel, write ONE new conv-update kernel
+  (OP C) + its ids variant (OP D), add two tiny gather kernels, and tighten
+  supports_op + the op-handler branch + (Vulkan) the pipeline/push-constant/
+  descriptor wiring. The builders, CPU refs, model graph and tests are shared and
+  already done.
+- Bit-exactness is feasible everywhere and per-backend by construction (the
+  fusions redirect addresses, not the f32 reduction order); `test-backend-ops`
+  (backendX-vs-CPU) is the gate.
+- Sequence: ops-first PR (incl. the capability-driven replacement for 0030's
+  name allow-list), then Metal, then SYCL, then Vulkan.