mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-27 09:57:14 -04:00
docs(paged): Layer-2 upstream scope for native fused-GDN kernels (Metal/Vulkan/SYCL)
Source-only analysis of what it would take to give the gated-DeltaNet decode fusions (0018 in-place state write-back, 0019 fused recurrent-state gather, 0021 ssm_conv_update_inplace, 0028 conv-tap gather fusion) native kernels on the non-CUDA compute backends, so the patch-series decode win extends past CUDA-family hardware. Key findings: - The base GGML_OP_GATED_DELTA_NET and GGML_OP_SSM_CONV kernels ALREADY exist upstream on Metal, Vulkan AND SYCL (the README's no-Vulkan-kernel line is stale). The Qwen3.6 hybrids run on all three today via the non-fused path; Layer-2 is the decode SPEEDUP, not enabling the model to run. - Per backend the new work is only the FUSION plumbing: redirect the GDN state write (in-place), add the ids read, write one new conv-update kernel + its ids variant, two tiny gather kernels, plus supports_op + op-handler + (Vulkan) pipeline/push-constant/descriptor wiring. Builders, CPU refs, model graph and test-backend-ops cases are shared and already done. - Bit-exactness is feasible per backend by construction (the fusions redirect addresses, not the f32 reduction order); test-backend-ops (backendX-vs-CPU) is the gate. - The 0030 name allow-list should become capability-driven (make supports_op authoritative for the discriminated src slots). - Ranked: ops-first PR, then Metal (highest value/effort, fixed simdgroup = simplest bit-exactness), then SYCL (near-verbatim CUDA mirror, cheapest to author), then Vulkan (widest hardware reach but the shader-gen + variant matrix + subgroup variance make it the capstone). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -0,0 +1,337 @@
|
||||
# Layer-2 upstream scope: native fused-GDN kernels for Metal / Vulkan / SYCL
|
||||
|
||||
Source-only analysis (no GPU, no build) of what it would take to give the
|
||||
gated-DeltaNet (GDN / SSM) decode fusions native kernels on the non-CUDA compute
|
||||
backends, so the patch-series decode win extends past CUDA-family hardware.
|
||||
|
||||
In our changeset (patches 0018-0030) these fusions ship with CUDA native kernels
|
||||
+ CPU reference kernels ONLY; patch 0030 force-gates them OFF on Metal / Vulkan /
|
||||
SYCL (a CPU-fallback fused op would regress via the device round-trip, and a
|
||||
backend that ran the plain op on the discriminated node would silently
|
||||
miscompute). "Layer 2" is the upstream work that adds the missing native kernels.
|
||||
|
||||
This doc was written against the ggml backend trees in
|
||||
`backend/cpp/llama-cpp-paged-dev` (upstream base #24732, one commit OLDER than the
|
||||
series pin `c299a92c` #25045, with only the two paged-KV patches applied - neither
|
||||
touches GDN/SSM). So every "kernel already exists" statement below is a
|
||||
conservative lower bound: the pin has at least these kernels.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
## 0. Headline finding (correct a stale assumption first)
|
||||
|
||||
The series README (section 4c) says "the gated-DeltaNet op has no Vulkan kernel
|
||||
upstream, so the Qwen3.6 hybrid models assert / fall back and don't run there."
|
||||
**That is now stale.** All three backends already carry the BASE compute ops:
|
||||
|
||||
| op | Metal | Vulkan | SYCL |
|
||||
|------------------------|------------------------------------|------------------------------------------|---------------------------------|
|
||||
| GGML_OP_GATED_DELTA_NET| `kernel_gated_delta_net_impl` (f32, NSG 1/2/4) | `gated_delta_net.comp` (d16/32/64/128 x kda, shmem/cluster/nocluster variants) | `gated_delta_net.cpp` (`launch_gated_delta_net<KDA,keep_rs>`) |
|
||||
| GGML_OP_SSM_CONV | `kernel_ssm_conv_f32_f32` (+ `_4`, + batched) | `ssm_conv.comp` (+ APPLY_BIAS, APPLY_SILU specialization consts) | `ssm_conv.cpp` (`kernel_ssm_conv`) |
|
||||
| GGML_OP_SSM_SCAN | yes | `ssm_scan.comp` (mamba2) | `ssm_scan.cpp` (mamba2) |
|
||||
|
||||
Verified: Vulkan `gated_delta_net.comp` was last touched at the upstream base
|
||||
commit (#24732), not by any LocalAI patch. So the GDN COMPUTE op is present on
|
||||
Metal, Vulkan AND SYCL. The Qwen3.6 hybrids therefore DO run on all three today
|
||||
(via the upstream non-fused path that 0030 routes to). The Layer-2 value-add is
|
||||
the decode SPEEDUP from the fusions, NOT enabling the model to run at all.
|
||||
|
||||
Consequence: the GDN-compute op being "partly there" is true on every backend,
|
||||
not just Metal. What is still missing per backend is only the FUSION plumbing
|
||||
(in-place write-back target, the ids gather read, and the conv-update kernel) -
|
||||
a materially smaller scope than "port GDN from scratch."
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
## 1. Per-op semantics (the four fusions to port)
|
||||
|
||||
All four reuse an existing GGML_OP enum with extra `src[]` slots as a
|
||||
discriminator; none adds a new enum value. f32 throughout. The arithmetic core
|
||||
is IDENTICAL to the upstream non-fused op; only the read source and/or the write
|
||||
target are redirected. That single fact drives the whole bit-exactness story
|
||||
(section 3).
|
||||
|
||||
### OP A - `ggml_gated_delta_net_inplace` (patch 0018)
|
||||
- Enum `GGML_OP_GATED_DELTA_NET`, discriminated by a non-null `src[6]` =
|
||||
`state_dst` (a contiguous `[S_v*S_v*H, n_seqs]` view into the recurrent-state
|
||||
cache at `kv_head`). K == 1 only.
|
||||
- Semantics: run the standard GDN recurrence, but write the FINAL recurrent state
|
||||
directly into `state_dst` instead of appending it to the op output. The op
|
||||
output then carries only the attention scores. Removes the per-layer per-step
|
||||
~full-state D2D copy-back (the 0018 win).
|
||||
- Race (in-place read == write): each (seq, head) block owns a disjoint cache
|
||||
slot. The kernel loads the whole prior state `s0` into per-thread registers
|
||||
(`s_shard` on CUDA, `ls[NSG]` on Metal, the column shard on Vulkan/SYCL)
|
||||
BEFORE the ring write, so reading and writing the same slot is safe.
|
||||
|
||||
### OP B - `ggml_gated_delta_net_inplace_ids` (patch 0019)
|
||||
- Adds `src[5]` = FULL state cache `[S_v,S_v,H,n_rs_slots]`, `src[7]` = `ids`
|
||||
(I32, per-seq source slot == the recurrent-state `s_copy`), `op_param[1]` =
|
||||
`rs_head` (destination base slot). Still has the OP-A `src[6]` in-place target.
|
||||
- Semantics: read each sequence's prior state directly from `cache[ids[seq]]`
|
||||
(mirrors `ggml_ssm_scan`'s ids source), eliminating the `ggml_get_rows`
|
||||
materialization. Combined with OP A the op now reads AND writes the cache in
|
||||
place.
|
||||
- Race: identity sequences (`ids[s] == rs_head + s`, the steady AR-decode case)
|
||||
read s0 in place from the destination slot (safe via the register snapshot
|
||||
above). Non-identity sequences (reorder / rs_zero remap) are first copied by a
|
||||
TINY separate gather kernel (`gdn_gather_nonident`, one block/seq) into a
|
||||
DISJOINT scratch that the recurrence then reads, so the recurrence never reads
|
||||
a slot another block is writing. Value-preserving memcpy -> bit-identical to
|
||||
the get_rows path.
|
||||
|
||||
### OP C - `ggml_ssm_conv_update_inplace` (patch 0021)
|
||||
- Enum `GGML_OP_SSM_CONV`, discriminated by a non-null `src[3]` =
|
||||
`conv_state_dst` (`[(K-1)*channels, n_seqs]` in-place ring view).
|
||||
`src[0]` = conv_states `[K-1, channels, n_seqs]`, `src[1]` = conv_kernel
|
||||
`[K, channels]`, `src[2]` = x_cur `[channels, 1, n_seqs]`. `op_param[0]` =
|
||||
fuse_silu.
|
||||
- Semantics (decode, n_seq_tokens == 1): per (channel, sequence) assemble the
|
||||
width-K conv window in registers from the K-1 cached taps + the current token,
|
||||
compute the depthwise conv with the SAME ascending-tap FMA order as plain
|
||||
`ssm_conv` (`tap0*w0 + ... + xc*w_{K-1}`, then `+0.0f` to match plain conv's
|
||||
`sumf += b` with b==0), optionally fold SiLU, write the conv output
|
||||
`[channels,1,n_seqs]`, and write the 1-token-shifted ring state back in place.
|
||||
Replaces the 4-op decode conv chain (transpose + concat + conv + silu + ring
|
||||
cpy).
|
||||
- Race: read source (gathered taps) and write target (cache view) are disjoint
|
||||
buffers -> race-free by construction, no ids/identity logic.
|
||||
|
||||
### OP D - `ggml_ssm_conv_update_inplace_ids` (patch 0028)
|
||||
- Same enum, discriminated by a non-null `src[4]` = `ids`; `src[0]` becomes the
|
||||
FULL conv cache `[K-1, channels, n_cells]`; `op_param[1]` = rs_head.
|
||||
- Semantics: gather-free conv-update - read each sequence's prior taps from
|
||||
`cache[ids[s]]` in-kernel (no get_rows). Identity reads in place from
|
||||
`conv_state_dst`; non-identity gathered into a disjoint scratch first by a tiny
|
||||
`ssm_conv_gather_nonident` kernel. The window is copied to a local array
|
||||
BEFORE the (possibly aliasing) ring write so the identity read==write slot is
|
||||
correct. Bit-identical to get_rows + OP C.
|
||||
|
||||
### Net new kernels vs reuse, per op
|
||||
- OP A: NOT a new compute kernel - a write-target redirection of the EXISTING
|
||||
GDN kernel + 1 buffer binding + a supports_op/op-handler branch.
|
||||
- OP B: the GDN kernel gains a per-seq read-base select (identity vs scratch) +
|
||||
1 ids binding + rs_head param + 1 tiny gather kernel.
|
||||
- OP C: a GENUINELY NEW kernel on each backend. The existing `ssm_conv` computes
|
||||
a windowed reduction over a PRE-concatenated input; it does not assemble the
|
||||
window from cached taps + the current token, fold silu, or write the shifted
|
||||
ring state. This is the largest net-new piece.
|
||||
- OP D: the OP-C kernel gains the read-base select + 1 ids binding + rs_head + 1
|
||||
tiny conv gather kernel.
|
||||
|
||||
The `ggml.h` / `ggml.c` builders, the CPU reference kernels, the model-graph
|
||||
emission (`delta-net-base.cpp`, qwen35*), and the `test-backend-ops` cases are
|
||||
SHARED and already done by patches 0018/0019/0021/0028. The only NEW per-backend
|
||||
work is the kernel(s) + the backend wiring.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
## 2. Per-backend: authoring model, effort, gotchas, wiring
|
||||
|
||||
### 2.1 Metal (MSL)
|
||||
|
||||
Authoring model: `.metal` MSL source (`ggml-metal.metal`), function-constant
|
||||
specialization (e.g. `FC_GATED_DELTA_NET`), kernels templated on `NSG`; host
|
||||
glue split across `ggml-metal-ops.cpp` (`ggml_metal_op_*` encode), the pipeline
|
||||
lookup in `ggml-metal-device.cpp`/`.m`, the kargs struct in `ggml-metal-impl.h`,
|
||||
and `supports_op` in `ggml-metal-device.m`. Threadgroup model; Apple GPU
|
||||
simdgroup width is a FIXED 32, `simd_sum` for the per-column reduce.
|
||||
|
||||
Effort: MEDIUM. ~350-500 LOC. The GDN and plain-ssm_conv kernels already exist
|
||||
and are ergonomic to extend. OP A is a write-base redirect of the existing
|
||||
`kernel_gated_delta_net_impl` (its tail already does
|
||||
`dst_state = dst + attn_size + state_out_base; dst_state[is] = ls[j]` after
|
||||
loading `ls[]` into registers - just point `dst_state` at the `state_dst` buffer
|
||||
and add the binding). OP C is the one net-new MSL kernel (Metal has NO bias/silu
|
||||
ssm_conv variant today - only plain + `_4` + batched - so the silu-fold and ring
|
||||
write are both new). Host glue spans 3-4 files.
|
||||
|
||||
Gotchas:
|
||||
- In-place race: the existing kernel ALREADY snapshots the state column into
|
||||
`ls[NSG]` registers before writing, so OP A/B are safe with no barrier; OP C/D
|
||||
must mirror the `float window[K]` local-copy-before-write that CPU/CUDA use.
|
||||
- Discriminated SSM_CONV: `supports_op` for `GGML_OP_SSM_CONV` currently returns
|
||||
`has_simdgroup_reduction` with NO check of `src[3]`/`src[4]`; GDN returns
|
||||
`has_simdgroup_reduction && src[2]->ne[0] % 32 == 0` with NO check of
|
||||
`src[6]`/`src[7]`. Both must be tightened (accept the discriminated variant
|
||||
only once the kernel exists) AND `ggml_metal_op_ssm_conv` /
|
||||
`ggml_metal_op_gated_delta_net` must branch on the extra src to pick the kernel.
|
||||
- Bit-exactness: fixed 32-wide simdgroup makes this the SIMPLEST of the three -
|
||||
the fused variant only redirects addresses, so it is bit-identical to Metal's
|
||||
own non-fused path by construction (the conv per-channel FMA needs the exact
|
||||
ascending order + the `+0.0f`).
|
||||
- The kargs struct grows by the `state_dst` / `ids` / `rs_head` fields; a new
|
||||
pipeline name (or a function-constant branch) distinguishes the variants.
|
||||
|
||||
### 2.2 Vulkan (GLSL .comp -> SPIR-V)
|
||||
|
||||
Authoring model: GLSL `.comp` in `vulkan-shaders/`, compiled at build time by
|
||||
`vulkan-shaders-gen` into embedded SPIR-V byte arrays (`gated_delta_net_f32_data`
|
||||
etc.); pipeline creation in `ggml-vulkan.cpp` declares the binding count +
|
||||
push-constant size; a push-constant struct per op; host dispatch `ggml_vk_*`
|
||||
binds subbuffers; `supports_op` in the device support function. Subgroup size
|
||||
VARIES by vendor (NVIDIA 32, AMD 64, Intel 8/16/32).
|
||||
|
||||
Effort: HARDEST. ~450-650 LOC + the most build/host glue. Same kernel logic as
|
||||
Metal/SYCL, but every new shader or variant requires: the shaders-gen regen, a
|
||||
new `ggml_vk_create_pipeline` registration with an explicit binding count and
|
||||
push-constant size, a new/extended push-constant struct (add `rs_head`), and
|
||||
GROWING the descriptor binding set from the current 7 (`src[0..5]` + dst) to 8-9
|
||||
(`state_dst`, `ids`). The GDN host dispatch hardcodes a 6-src bind loop and the
|
||||
pipeline is created with `"main", 7, ...` - both must change.
|
||||
|
||||
Gotchas:
|
||||
- Subgroup variance interacts with the EXISTING variant matrix: the GDN comp
|
||||
already ships shmem / cluster / nocluster variants keyed on subgroup size and
|
||||
relies on `S_V % COLS_PER_WG == 0`. The OP-A/B read/write redirect must be
|
||||
applied across ALL of those variants, and re-validated per vendor.
|
||||
- In-place race: GLSL must read the full column shard into local registers before
|
||||
the ring write (same pattern); confirm the SPIR-V memory model is not relied on
|
||||
for cross-invocation ordering (it is not - blocks are disjoint per (seq,head)).
|
||||
OP C/D need the explicit window-to-local copy.
|
||||
- Discriminated SSM_CONV: `supports_op` returns `op->src[0]->type == F32` with NO
|
||||
discriminator check; GDN loops `src[0..5]` F32 with NO `src[6]`/`src[7]` check.
|
||||
Both must be tightened. This is the backend where the 0030 hazard is most
|
||||
concrete (a present plain-conv kernel + a permissive supports_op = silent
|
||||
miscompute) - Vulkan is the exact case 0030 was written for.
|
||||
- conv-update is per-channel (one invocation per channel) so it is
|
||||
subgroup-AGNOSTIC; only the GDN recurrence carries the subgroup-width burden.
|
||||
- Vulkan's `ssm_conv.comp` ALREADY has APPLY_SILU + APPLY_BIAS specialization
|
||||
constants, so the silu-fold half of OP C is partly precedented here (unlike
|
||||
Metal); the ring write-back + tap-window assembly are still new.
|
||||
|
||||
### 2.3 SYCL (single-source DPC++)
|
||||
|
||||
Authoring model: plain C++ `.cpp`/`.hpp` per op (`gated_delta_net.cpp`,
|
||||
`ssm_conv.cpp`); a SYCL `queue.parallel_for` over an `nd_range` with
|
||||
`reqd_sub_group_size(WARP_SIZE)`; sub-group reductions (`warp_reduce_sum`);
|
||||
`supports_op` in `ggml-sycl.cpp`. NO separate shader-compile step (single
|
||||
source).
|
||||
|
||||
Effort: EASIEST to author. ~250-350 LOC. The SYCL op handlers + kernels are
|
||||
near-VERBATIM mirrors of the CUDA ones (`launch_gated_delta_net<KDA,keep_rs>`,
|
||||
`s_shard`, `curr_state`, `state = dst + attn_score_elems`, `warp_reduce_sum`) -
|
||||
a dpct/SYCLomatic-style port. The CUDA diffs in 0018/0019/0021/0028 would port
|
||||
almost line-for-line: add the `state_dst` param, the `ids`/`rs_head` params, the
|
||||
read-base select, the two tiny gather kernels, and the new conv-update kernel.
|
||||
No pipeline/push-constant/binding bookkeeping.
|
||||
|
||||
Gotchas:
|
||||
- In-place race: the `s_shard[]` / window arrays are per-work-item private, so
|
||||
the register-snapshot-before-write pattern carries over directly. Safe.
|
||||
- Discriminated SSM_CONV: `supports_op` checks `src[0]`/`src[1]` F32 with NO
|
||||
discriminator check; GDN returns a BARE `true` (the MOST permissive, so the
|
||||
hazard is worst here). Both must be tightened, and `ggml_sycl_op_ssm_conv` /
|
||||
`ggml_sycl_op_gated_delta_net` must branch on the extra src.
|
||||
- Bit-exactness: `WARP_SIZE` is compile-fixed (Intel sub-group 8/16/32), same
|
||||
situation as CUDA; the fused variant matches SYCL's own non-fused path by
|
||||
construction. conv-update is per-channel -> subgroup-agnostic.
|
||||
|
||||
### 2.4 Common wiring (all three) + the 0030 emission-gate change
|
||||
|
||||
Per backend, four wiring touch-points beyond the kernel body:
|
||||
1. `supports_op`: tighten the `GGML_OP_SSM_CONV` and `GGML_OP_GATED_DELTA_NET`
|
||||
entries so the discriminated/extra-src node is reported supported ONLY when
|
||||
the new kernel handles it (and rejected otherwise, instead of today's
|
||||
silently-true-for-the-plain-kernel).
|
||||
2. op handler: branch on `src[3]`/`src[4]` (conv) and `src[6]`/`src[7]` (GDN) to
|
||||
dispatch the fused kernel.
|
||||
3. pipeline/kernel registration (Vulkan: + push-constant struct + descriptor
|
||||
bindings; Metal: + kargs fields + pipeline name; SYCL: just the new functions).
|
||||
4. The patch-0030 gate in `src/llama-context.cpp`.
|
||||
|
||||
The 0030 change today is a hard allow-list: any non-CPU compute backend whose reg
|
||||
name is not `"CUDA"`/`"ROCm"`/`"MUSA"` forces `fused_gdn_ar = fused_gdn_ch =
|
||||
auto_fgdn = false`. As each backend gains kernels this must become capability-
|
||||
driven, in one of two ways:
|
||||
- minimal: add the backend's reg name (e.g. `"Metal"`) to the allow-list once its
|
||||
kernels + tightened supports_op ship; OR
|
||||
- clean (recommended upstream form): DELETE the name allow-list and make
|
||||
`supports_op` authoritative - have the `auto_fgdn` resolution probe
|
||||
`ggml_backend_dev_supports_op` on a representative node that carries the
|
||||
discriminated `src[]` slots. Then routing falls out of the normal scheduler
|
||||
fallback and no backend name is ever hard-coded. This also fixes 0030's stated
|
||||
weakness that the upstream `auto_fgdn` check only inspects GATED_DELTA_NET
|
||||
nodes and covered the discriminated SSM_CONV only incidentally.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
## 3. Bit-exactness per backend (the md5 gate question)
|
||||
|
||||
Feasible on ALL THREE, and not actually constraining, because of how the gate is
|
||||
scoped:
|
||||
|
||||
- The series md5 gate is a CUDA-vs-CPU comparison; each GPU backend ALREADY has
|
||||
its own f32 reduction order (Metal `simd_sum`, Vulkan subgroup reduce, SYCL
|
||||
`warp_reduce_sum`) that differs from CUDA's and from CPU's. There is no
|
||||
cross-backend md5 and none is expected.
|
||||
- The relevant per-backend invariant is: the FUSED variant must equal that
|
||||
backend's OWN non-fused path. The fusions change only the read source
|
||||
(gather -> indexed read; the gather is a value-preserving memcpy) and the write
|
||||
target (appended output -> in-place cache slot). They do NOT touch the
|
||||
per-column FMA/reduce order. So the fused op is bit-identical to the
|
||||
non-fused op on the same backend BY CONSTRUCTION.
|
||||
- Two arithmetic details each port MUST preserve exactly: (a) the conv
|
||||
ascending-tap order plus the `+0.0f` that matches plain `ssm_conv`'s
|
||||
`sumf += b` with b==0; (b) the existing GDN per-column subgroup reduce (do not
|
||||
re-order it). Get those right and `test-backend-ops` (backendX-vs-CPU, already
|
||||
registered for SSM_CONV / SSM_CONV_UPDATE / SSM_CONV_UPDATE_IDS /
|
||||
GATED_DELTA_NET) is the per-backend gate.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
## 4. Upstream path and ranked recommendation
|
||||
|
||||
### Ops-first, then one PR per backend (NOT one big PR)
|
||||
|
||||
Recommended sequence:
|
||||
|
||||
1. PR #1 - OPS (already essentially done, upstreamable as-is): the `ggml.h`/
|
||||
`ggml.c` builders, the CPU reference kernels, the CUDA kernels, the
|
||||
`test-backend-ops` cases, and the capability-driven gate (the clean
|
||||
`supports_op`-authoritative version of 0030). This is independently mergeable
|
||||
and mirrors how llama.cpp lands new ops (CPU + CUDA first; GDN itself landed
|
||||
that way).
|
||||
2. PR #2 - Metal kernels + wiring.
|
||||
3. PR #3 - SYCL kernels + wiring.
|
||||
4. PR #4 - Vulkan kernels + wiring.
|
||||
|
||||
Do NOT bundle the backends: each needs its own hardware to validate
|
||||
`test-backend-ops`, reviewers are backend-specialized, and a regression in one
|
||||
must not block the others.
|
||||
|
||||
### Value x effort ranking (which backend first)
|
||||
|
||||
| backend | user base / value | author effort | bit-exact difficulty | net rank |
|
||||
|---------|----------------------------|---------------|----------------------|----------|
|
||||
| Metal | HIGH (Apple Silicon = largest non-CUDA LocalAI base; unified memory makes the no-copy / no-gather plumbing wins map directly) | MEDIUM | LOWEST (fixed 32 simdgroup) | **1st** |
|
||||
| SYCL | LOW-MED (Intel GPU) | LOWEST (near-verbatim CUDA mirror) | LOW | **2nd** |
|
||||
| Vulkan | HIGHEST breadth (AMD + Intel + cross-vendor) | HIGHEST (shaders-gen + variant matrix + subgroup variance + descriptor growth) | MEDIUM (per-vendor subgroup validation) | **3rd** |
|
||||
|
||||
Recommendation: **Metal first.** It banks the biggest user-facing decode win at
|
||||
medium effort, the base GDN + conv kernels already exist, and Apple's fixed
|
||||
simdgroup width makes bit-exactness the simplest. **SYCL second** as a cheap,
|
||||
nearly mechanical follow-on (the port is a line-for-line CUDA mirror, so it is
|
||||
low-cost insurance even though the Intel-GPU audience is smaller). **Vulkan last**
|
||||
as the high-effort / high-breadth capstone - it reaches the widest hardware
|
||||
(AMD + Intel + anything with a Vulkan driver), but the shader-gen pipeline, the
|
||||
existing variant matrix, the subgroup-width variance, and the per-vendor
|
||||
validation burden make it the right capstone once the pattern is proven on
|
||||
Metal + SYCL.
|
||||
|
||||
A reasonable cheaper variant: ship Metal + SYCL together right after the ops PR
|
||||
(both are register-snapshot ports with no shader-gen step) and treat Vulkan as a
|
||||
separate later effort.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
## 5. Summary
|
||||
|
||||
- GDN-compute and plain SSM_CONV kernels ALREADY EXIST on Metal, Vulkan and SYCL
|
||||
(the README's "no Vulkan kernel" line is stale). The Qwen3.6 hybrids run on all
|
||||
three today via the non-fused path; Layer-2 is about the decode SPEEDUP.
|
||||
- Per backend the NEW work is: redirect the GDN state write (OP A) + add the ids
|
||||
read (OP B) to the existing GDN kernel, write ONE new conv-update kernel
|
||||
(OP C) + its ids variant (OP D), add two tiny gather kernels, and tighten
|
||||
supports_op + the op-handler branch + (Vulkan) the pipeline/push-constant/
|
||||
descriptor wiring. The builders, CPU refs, model graph and tests are shared and
|
||||
already done.
|
||||
- Bit-exactness is feasible everywhere and per-backend by construction (the
|
||||
fusions redirect addresses, not the f32 reduction order); `test-backend-ops`
|
||||
(backendX-vs-CPU) is the gate.
|
||||
- Sequence: ops-first PR (incl. the capability-driven replacement for 0030's
|
||||
name allow-list), then Metal, then SYCL, then Vulkan.
|
||||
Reference in New Issue
Block a user