From 984c8fcbeaeda4315e6eda0cbe0fd01420344515 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Sat, 27 Jun 2026 12:11:24 +0000 Subject: [PATCH] docs(paged): Layer-2 upstream scope for native fused-GDN kernels (Metal/Vulkan/SYCL) Source-only analysis of what it would take to give the gated-DeltaNet decode fusions (0018 in-place state write-back, 0019 fused recurrent-state gather, 0021 ssm_conv_update_inplace, 0028 conv-tap gather fusion) native kernels on the non-CUDA compute backends, so the patch-series decode win extends past CUDA-family hardware. Key findings: - The base GGML_OP_GATED_DELTA_NET and GGML_OP_SSM_CONV kernels ALREADY exist upstream on Metal, Vulkan AND SYCL (the README's no-Vulkan-kernel line is stale). The Qwen3.6 hybrids run on all three today via the non-fused path; Layer-2 is the decode SPEEDUP, not enabling the model to run. - Per backend the new work is only the FUSION plumbing: redirect the GDN state write (in-place), add the ids read, write one new conv-update kernel + its ids variant, two tiny gather kernels, plus supports_op + op-handler + (Vulkan) pipeline/push-constant/descriptor wiring. Builders, CPU refs, model graph and test-backend-ops cases are shared and already done. - Bit-exactness is feasible per backend by construction (the fusions redirect addresses, not the f32 reduction order); test-backend-ops (backendX-vs-CPU) is the gate. - The 0030 name allow-list should become capability-driven (make supports_op authoritative for the discriminated src slots). - Ranked: ops-first PR, then Metal (highest value/effort, fixed simdgroup = simplest bit-exactness), then SYCL (near-verbatim CUDA mirror, cheapest to author), then Vulkan (widest hardware reach but the shader-gen + variant matrix + subgroup variance make it the capstone). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .../patches/paged/UPSTREAM_LAYER2_SCOPE.md | 337 ++++++++++++++++++ 1 file changed, 337 insertions(+) create mode 100644 backend/cpp/llama-cpp-localai-paged/patches/paged/UPSTREAM_LAYER2_SCOPE.md diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/UPSTREAM_LAYER2_SCOPE.md b/backend/cpp/llama-cpp-localai-paged/patches/paged/UPSTREAM_LAYER2_SCOPE.md new file mode 100644 index 000000000..080e1f327 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/UPSTREAM_LAYER2_SCOPE.md @@ -0,0 +1,337 @@ +# Layer-2 upstream scope: native fused-GDN kernels for Metal / Vulkan / SYCL + +Source-only analysis (no GPU, no build) of what it would take to give the +gated-DeltaNet (GDN / SSM) decode fusions native kernels on the non-CUDA compute +backends, so the patch-series decode win extends past CUDA-family hardware. + +In our changeset (patches 0018-0030) these fusions ship with CUDA native kernels ++ CPU reference kernels ONLY; patch 0030 force-gates them OFF on Metal / Vulkan / +SYCL (a CPU-fallback fused op would regress via the device round-trip, and a +backend that ran the plain op on the discriminated node would silently +miscompute). "Layer 2" is the upstream work that adds the missing native kernels. + +This doc was written against the ggml backend trees in +`backend/cpp/llama-cpp-paged-dev` (upstream base #24732, one commit OLDER than the +series pin `c299a92c` #25045, with only the two paged-KV patches applied - neither +touches GDN/SSM). So every "kernel already exists" statement below is a +conservative lower bound: the pin has at least these kernels. + +-------------------------------------------------------------------------------- +## 0. Headline finding (correct a stale assumption first) + +The series README (section 4c) says "the gated-DeltaNet op has no Vulkan kernel +upstream, so the Qwen3.6 hybrid models assert / fall back and don't run there." +**That is now stale.** All three backends already carry the BASE compute ops: + +| op | Metal | Vulkan | SYCL | +|------------------------|------------------------------------|------------------------------------------|---------------------------------| +| GGML_OP_GATED_DELTA_NET| `kernel_gated_delta_net_impl` (f32, NSG 1/2/4) | `gated_delta_net.comp` (d16/32/64/128 x kda, shmem/cluster/nocluster variants) | `gated_delta_net.cpp` (`launch_gated_delta_net`) | +| GGML_OP_SSM_CONV | `kernel_ssm_conv_f32_f32` (+ `_4`, + batched) | `ssm_conv.comp` (+ APPLY_BIAS, APPLY_SILU specialization consts) | `ssm_conv.cpp` (`kernel_ssm_conv`) | +| GGML_OP_SSM_SCAN | yes | `ssm_scan.comp` (mamba2) | `ssm_scan.cpp` (mamba2) | + +Verified: Vulkan `gated_delta_net.comp` was last touched at the upstream base +commit (#24732), not by any LocalAI patch. So the GDN COMPUTE op is present on +Metal, Vulkan AND SYCL. The Qwen3.6 hybrids therefore DO run on all three today +(via the upstream non-fused path that 0030 routes to). The Layer-2 value-add is +the decode SPEEDUP from the fusions, NOT enabling the model to run at all. + +Consequence: the GDN-compute op being "partly there" is true on every backend, +not just Metal. What is still missing per backend is only the FUSION plumbing +(in-place write-back target, the ids gather read, and the conv-update kernel) - +a materially smaller scope than "port GDN from scratch." + +-------------------------------------------------------------------------------- +## 1. Per-op semantics (the four fusions to port) + +All four reuse an existing GGML_OP enum with extra `src[]` slots as a +discriminator; none adds a new enum value. f32 throughout. The arithmetic core +is IDENTICAL to the upstream non-fused op; only the read source and/or the write +target are redirected. That single fact drives the whole bit-exactness story +(section 3). + +### OP A - `ggml_gated_delta_net_inplace` (patch 0018) +- Enum `GGML_OP_GATED_DELTA_NET`, discriminated by a non-null `src[6]` = + `state_dst` (a contiguous `[S_v*S_v*H, n_seqs]` view into the recurrent-state + cache at `kv_head`). K == 1 only. +- Semantics: run the standard GDN recurrence, but write the FINAL recurrent state + directly into `state_dst` instead of appending it to the op output. The op + output then carries only the attention scores. Removes the per-layer per-step + ~full-state D2D copy-back (the 0018 win). +- Race (in-place read == write): each (seq, head) block owns a disjoint cache + slot. The kernel loads the whole prior state `s0` into per-thread registers + (`s_shard` on CUDA, `ls[NSG]` on Metal, the column shard on Vulkan/SYCL) + BEFORE the ring write, so reading and writing the same slot is safe. + +### OP B - `ggml_gated_delta_net_inplace_ids` (patch 0019) +- Adds `src[5]` = FULL state cache `[S_v,S_v,H,n_rs_slots]`, `src[7]` = `ids` + (I32, per-seq source slot == the recurrent-state `s_copy`), `op_param[1]` = + `rs_head` (destination base slot). Still has the OP-A `src[6]` in-place target. +- Semantics: read each sequence's prior state directly from `cache[ids[seq]]` + (mirrors `ggml_ssm_scan`'s ids source), eliminating the `ggml_get_rows` + materialization. Combined with OP A the op now reads AND writes the cache in + place. +- Race: identity sequences (`ids[s] == rs_head + s`, the steady AR-decode case) + read s0 in place from the destination slot (safe via the register snapshot + above). Non-identity sequences (reorder / rs_zero remap) are first copied by a + TINY separate gather kernel (`gdn_gather_nonident`, one block/seq) into a + DISJOINT scratch that the recurrence then reads, so the recurrence never reads + a slot another block is writing. Value-preserving memcpy -> bit-identical to + the get_rows path. + +### OP C - `ggml_ssm_conv_update_inplace` (patch 0021) +- Enum `GGML_OP_SSM_CONV`, discriminated by a non-null `src[3]` = + `conv_state_dst` (`[(K-1)*channels, n_seqs]` in-place ring view). + `src[0]` = conv_states `[K-1, channels, n_seqs]`, `src[1]` = conv_kernel + `[K, channels]`, `src[2]` = x_cur `[channels, 1, n_seqs]`. `op_param[0]` = + fuse_silu. +- Semantics (decode, n_seq_tokens == 1): per (channel, sequence) assemble the + width-K conv window in registers from the K-1 cached taps + the current token, + compute the depthwise conv with the SAME ascending-tap FMA order as plain + `ssm_conv` (`tap0*w0 + ... + xc*w_{K-1}`, then `+0.0f` to match plain conv's + `sumf += b` with b==0), optionally fold SiLU, write the conv output + `[channels,1,n_seqs]`, and write the 1-token-shifted ring state back in place. + Replaces the 4-op decode conv chain (transpose + concat + conv + silu + ring + cpy). +- Race: read source (gathered taps) and write target (cache view) are disjoint + buffers -> race-free by construction, no ids/identity logic. + +### OP D - `ggml_ssm_conv_update_inplace_ids` (patch 0028) +- Same enum, discriminated by a non-null `src[4]` = `ids`; `src[0]` becomes the + FULL conv cache `[K-1, channels, n_cells]`; `op_param[1]` = rs_head. +- Semantics: gather-free conv-update - read each sequence's prior taps from + `cache[ids[s]]` in-kernel (no get_rows). Identity reads in place from + `conv_state_dst`; non-identity gathered into a disjoint scratch first by a tiny + `ssm_conv_gather_nonident` kernel. The window is copied to a local array + BEFORE the (possibly aliasing) ring write so the identity read==write slot is + correct. Bit-identical to get_rows + OP C. + +### Net new kernels vs reuse, per op +- OP A: NOT a new compute kernel - a write-target redirection of the EXISTING + GDN kernel + 1 buffer binding + a supports_op/op-handler branch. +- OP B: the GDN kernel gains a per-seq read-base select (identity vs scratch) + + 1 ids binding + rs_head param + 1 tiny gather kernel. +- OP C: a GENUINELY NEW kernel on each backend. The existing `ssm_conv` computes + a windowed reduction over a PRE-concatenated input; it does not assemble the + window from cached taps + the current token, fold silu, or write the shifted + ring state. This is the largest net-new piece. +- OP D: the OP-C kernel gains the read-base select + 1 ids binding + rs_head + 1 + tiny conv gather kernel. + +The `ggml.h` / `ggml.c` builders, the CPU reference kernels, the model-graph +emission (`delta-net-base.cpp`, qwen35*), and the `test-backend-ops` cases are +SHARED and already done by patches 0018/0019/0021/0028. The only NEW per-backend +work is the kernel(s) + the backend wiring. + +-------------------------------------------------------------------------------- +## 2. Per-backend: authoring model, effort, gotchas, wiring + +### 2.1 Metal (MSL) + +Authoring model: `.metal` MSL source (`ggml-metal.metal`), function-constant +specialization (e.g. `FC_GATED_DELTA_NET`), kernels templated on `NSG`; host +glue split across `ggml-metal-ops.cpp` (`ggml_metal_op_*` encode), the pipeline +lookup in `ggml-metal-device.cpp`/`.m`, the kargs struct in `ggml-metal-impl.h`, +and `supports_op` in `ggml-metal-device.m`. Threadgroup model; Apple GPU +simdgroup width is a FIXED 32, `simd_sum` for the per-column reduce. + +Effort: MEDIUM. ~350-500 LOC. The GDN and plain-ssm_conv kernels already exist +and are ergonomic to extend. OP A is a write-base redirect of the existing +`kernel_gated_delta_net_impl` (its tail already does +`dst_state = dst + attn_size + state_out_base; dst_state[is] = ls[j]` after +loading `ls[]` into registers - just point `dst_state` at the `state_dst` buffer +and add the binding). OP C is the one net-new MSL kernel (Metal has NO bias/silu +ssm_conv variant today - only plain + `_4` + batched - so the silu-fold and ring +write are both new). Host glue spans 3-4 files. + +Gotchas: +- In-place race: the existing kernel ALREADY snapshots the state column into + `ls[NSG]` registers before writing, so OP A/B are safe with no barrier; OP C/D + must mirror the `float window[K]` local-copy-before-write that CPU/CUDA use. +- Discriminated SSM_CONV: `supports_op` for `GGML_OP_SSM_CONV` currently returns + `has_simdgroup_reduction` with NO check of `src[3]`/`src[4]`; GDN returns + `has_simdgroup_reduction && src[2]->ne[0] % 32 == 0` with NO check of + `src[6]`/`src[7]`. Both must be tightened (accept the discriminated variant + only once the kernel exists) AND `ggml_metal_op_ssm_conv` / + `ggml_metal_op_gated_delta_net` must branch on the extra src to pick the kernel. +- Bit-exactness: fixed 32-wide simdgroup makes this the SIMPLEST of the three - + the fused variant only redirects addresses, so it is bit-identical to Metal's + own non-fused path by construction (the conv per-channel FMA needs the exact + ascending order + the `+0.0f`). +- The kargs struct grows by the `state_dst` / `ids` / `rs_head` fields; a new + pipeline name (or a function-constant branch) distinguishes the variants. + +### 2.2 Vulkan (GLSL .comp -> SPIR-V) + +Authoring model: GLSL `.comp` in `vulkan-shaders/`, compiled at build time by +`vulkan-shaders-gen` into embedded SPIR-V byte arrays (`gated_delta_net_f32_data` +etc.); pipeline creation in `ggml-vulkan.cpp` declares the binding count + +push-constant size; a push-constant struct per op; host dispatch `ggml_vk_*` +binds subbuffers; `supports_op` in the device support function. Subgroup size +VARIES by vendor (NVIDIA 32, AMD 64, Intel 8/16/32). + +Effort: HARDEST. ~450-650 LOC + the most build/host glue. Same kernel logic as +Metal/SYCL, but every new shader or variant requires: the shaders-gen regen, a +new `ggml_vk_create_pipeline` registration with an explicit binding count and +push-constant size, a new/extended push-constant struct (add `rs_head`), and +GROWING the descriptor binding set from the current 7 (`src[0..5]` + dst) to 8-9 +(`state_dst`, `ids`). The GDN host dispatch hardcodes a 6-src bind loop and the +pipeline is created with `"main", 7, ...` - both must change. + +Gotchas: +- Subgroup variance interacts with the EXISTING variant matrix: the GDN comp + already ships shmem / cluster / nocluster variants keyed on subgroup size and + relies on `S_V % COLS_PER_WG == 0`. The OP-A/B read/write redirect must be + applied across ALL of those variants, and re-validated per vendor. +- In-place race: GLSL must read the full column shard into local registers before + the ring write (same pattern); confirm the SPIR-V memory model is not relied on + for cross-invocation ordering (it is not - blocks are disjoint per (seq,head)). + OP C/D need the explicit window-to-local copy. +- Discriminated SSM_CONV: `supports_op` returns `op->src[0]->type == F32` with NO + discriminator check; GDN loops `src[0..5]` F32 with NO `src[6]`/`src[7]` check. + Both must be tightened. This is the backend where the 0030 hazard is most + concrete (a present plain-conv kernel + a permissive supports_op = silent + miscompute) - Vulkan is the exact case 0030 was written for. +- conv-update is per-channel (one invocation per channel) so it is + subgroup-AGNOSTIC; only the GDN recurrence carries the subgroup-width burden. +- Vulkan's `ssm_conv.comp` ALREADY has APPLY_SILU + APPLY_BIAS specialization + constants, so the silu-fold half of OP C is partly precedented here (unlike + Metal); the ring write-back + tap-window assembly are still new. + +### 2.3 SYCL (single-source DPC++) + +Authoring model: plain C++ `.cpp`/`.hpp` per op (`gated_delta_net.cpp`, +`ssm_conv.cpp`); a SYCL `queue.parallel_for` over an `nd_range` with +`reqd_sub_group_size(WARP_SIZE)`; sub-group reductions (`warp_reduce_sum`); +`supports_op` in `ggml-sycl.cpp`. NO separate shader-compile step (single +source). + +Effort: EASIEST to author. ~250-350 LOC. The SYCL op handlers + kernels are +near-VERBATIM mirrors of the CUDA ones (`launch_gated_delta_net`, +`s_shard`, `curr_state`, `state = dst + attn_score_elems`, `warp_reduce_sum`) - +a dpct/SYCLomatic-style port. The CUDA diffs in 0018/0019/0021/0028 would port +almost line-for-line: add the `state_dst` param, the `ids`/`rs_head` params, the +read-base select, the two tiny gather kernels, and the new conv-update kernel. +No pipeline/push-constant/binding bookkeeping. + +Gotchas: +- In-place race: the `s_shard[]` / window arrays are per-work-item private, so + the register-snapshot-before-write pattern carries over directly. Safe. +- Discriminated SSM_CONV: `supports_op` checks `src[0]`/`src[1]` F32 with NO + discriminator check; GDN returns a BARE `true` (the MOST permissive, so the + hazard is worst here). Both must be tightened, and `ggml_sycl_op_ssm_conv` / + `ggml_sycl_op_gated_delta_net` must branch on the extra src. +- Bit-exactness: `WARP_SIZE` is compile-fixed (Intel sub-group 8/16/32), same + situation as CUDA; the fused variant matches SYCL's own non-fused path by + construction. conv-update is per-channel -> subgroup-agnostic. + +### 2.4 Common wiring (all three) + the 0030 emission-gate change + +Per backend, four wiring touch-points beyond the kernel body: +1. `supports_op`: tighten the `GGML_OP_SSM_CONV` and `GGML_OP_GATED_DELTA_NET` + entries so the discriminated/extra-src node is reported supported ONLY when + the new kernel handles it (and rejected otherwise, instead of today's + silently-true-for-the-plain-kernel). +2. op handler: branch on `src[3]`/`src[4]` (conv) and `src[6]`/`src[7]` (GDN) to + dispatch the fused kernel. +3. pipeline/kernel registration (Vulkan: + push-constant struct + descriptor + bindings; Metal: + kargs fields + pipeline name; SYCL: just the new functions). +4. The patch-0030 gate in `src/llama-context.cpp`. + +The 0030 change today is a hard allow-list: any non-CPU compute backend whose reg +name is not `"CUDA"`/`"ROCm"`/`"MUSA"` forces `fused_gdn_ar = fused_gdn_ch = +auto_fgdn = false`. As each backend gains kernels this must become capability- +driven, in one of two ways: +- minimal: add the backend's reg name (e.g. `"Metal"`) to the allow-list once its + kernels + tightened supports_op ship; OR +- clean (recommended upstream form): DELETE the name allow-list and make + `supports_op` authoritative - have the `auto_fgdn` resolution probe + `ggml_backend_dev_supports_op` on a representative node that carries the + discriminated `src[]` slots. Then routing falls out of the normal scheduler + fallback and no backend name is ever hard-coded. This also fixes 0030's stated + weakness that the upstream `auto_fgdn` check only inspects GATED_DELTA_NET + nodes and covered the discriminated SSM_CONV only incidentally. + +-------------------------------------------------------------------------------- +## 3. Bit-exactness per backend (the md5 gate question) + +Feasible on ALL THREE, and not actually constraining, because of how the gate is +scoped: + +- The series md5 gate is a CUDA-vs-CPU comparison; each GPU backend ALREADY has + its own f32 reduction order (Metal `simd_sum`, Vulkan subgroup reduce, SYCL + `warp_reduce_sum`) that differs from CUDA's and from CPU's. There is no + cross-backend md5 and none is expected. +- The relevant per-backend invariant is: the FUSED variant must equal that + backend's OWN non-fused path. The fusions change only the read source + (gather -> indexed read; the gather is a value-preserving memcpy) and the write + target (appended output -> in-place cache slot). They do NOT touch the + per-column FMA/reduce order. So the fused op is bit-identical to the + non-fused op on the same backend BY CONSTRUCTION. +- Two arithmetic details each port MUST preserve exactly: (a) the conv + ascending-tap order plus the `+0.0f` that matches plain `ssm_conv`'s + `sumf += b` with b==0; (b) the existing GDN per-column subgroup reduce (do not + re-order it). Get those right and `test-backend-ops` (backendX-vs-CPU, already + registered for SSM_CONV / SSM_CONV_UPDATE / SSM_CONV_UPDATE_IDS / + GATED_DELTA_NET) is the per-backend gate. + +-------------------------------------------------------------------------------- +## 4. Upstream path and ranked recommendation + +### Ops-first, then one PR per backend (NOT one big PR) + +Recommended sequence: + +1. PR #1 - OPS (already essentially done, upstreamable as-is): the `ggml.h`/ + `ggml.c` builders, the CPU reference kernels, the CUDA kernels, the + `test-backend-ops` cases, and the capability-driven gate (the clean + `supports_op`-authoritative version of 0030). This is independently mergeable + and mirrors how llama.cpp lands new ops (CPU + CUDA first; GDN itself landed + that way). +2. PR #2 - Metal kernels + wiring. +3. PR #3 - SYCL kernels + wiring. +4. PR #4 - Vulkan kernels + wiring. + +Do NOT bundle the backends: each needs its own hardware to validate +`test-backend-ops`, reviewers are backend-specialized, and a regression in one +must not block the others. + +### Value x effort ranking (which backend first) + +| backend | user base / value | author effort | bit-exact difficulty | net rank | +|---------|----------------------------|---------------|----------------------|----------| +| Metal | HIGH (Apple Silicon = largest non-CUDA LocalAI base; unified memory makes the no-copy / no-gather plumbing wins map directly) | MEDIUM | LOWEST (fixed 32 simdgroup) | **1st** | +| SYCL | LOW-MED (Intel GPU) | LOWEST (near-verbatim CUDA mirror) | LOW | **2nd** | +| Vulkan | HIGHEST breadth (AMD + Intel + cross-vendor) | HIGHEST (shaders-gen + variant matrix + subgroup variance + descriptor growth) | MEDIUM (per-vendor subgroup validation) | **3rd** | + +Recommendation: **Metal first.** It banks the biggest user-facing decode win at +medium effort, the base GDN + conv kernels already exist, and Apple's fixed +simdgroup width makes bit-exactness the simplest. **SYCL second** as a cheap, +nearly mechanical follow-on (the port is a line-for-line CUDA mirror, so it is +low-cost insurance even though the Intel-GPU audience is smaller). **Vulkan last** +as the high-effort / high-breadth capstone - it reaches the widest hardware +(AMD + Intel + anything with a Vulkan driver), but the shader-gen pipeline, the +existing variant matrix, the subgroup-width variance, and the per-vendor +validation burden make it the right capstone once the pattern is proven on +Metal + SYCL. + +A reasonable cheaper variant: ship Metal + SYCL together right after the ops PR +(both are register-snapshot ports with no shader-gen step) and treat Vulkan as a +separate later effort. + +-------------------------------------------------------------------------------- +## 5. Summary + +- GDN-compute and plain SSM_CONV kernels ALREADY EXIST on Metal, Vulkan and SYCL + (the README's "no Vulkan kernel" line is stale). The Qwen3.6 hybrids run on all + three today via the non-fused path; Layer-2 is about the decode SPEEDUP. +- Per backend the NEW work is: redirect the GDN state write (OP A) + add the ids + read (OP B) to the existing GDN kernel, write ONE new conv-update kernel + (OP C) + its ids variant (OP D), add two tiny gather kernels, and tighten + supports_op + the op-handler branch + (Vulkan) the pipeline/push-constant/ + descriptor wiring. The builders, CPU refs, model graph and tests are shared and + already done. +- Bit-exactness is feasible everywhere and per-backend by construction (the + fusions redirect addresses, not the f32 reduction order); `test-backend-ops` + (backendX-vs-CPU) is the gate. +- Sequence: ops-first PR (incl. the capability-driven replacement for 0030's + name allow-list), then Metal, then SYCL, then Vulkan.