Files
LocalAI/backend/cpp/llama-cpp-localai-paged/docs/UPSTREAM_LAYER2_SCOPE.md
Ettore Di Giacinto 08b754f910 chore(paged): keep patches/ patch-only; README to backend root, docs to docs/
The llama-cpp-localai-paged patches/ dir had accumulated docs, plots, a csv,
dev .cpp harnesses, and a dead FP4-MoE kernel scaffold after an earlier git-mv.
Restore the invariant that patches/ holds only the .patch series.

Moves:
- patches/paged/README.md -> README.md (canonical doc at the backend root)
- patches/paged/{PIN_SYNC_c299a92c,PAGED_BITEXACT_NOTE,LOCALAI_LLAMACPP_BACKEND_PLAN,UPSTREAM_LAYER2_SCOPE}.md,
  final_benchmark.csv, qwen36_*.png, paged-burst-bench.cpp, paged-reclaim-unit.cpp -> docs/
- patches/README.md -> docs/PATCH_MAINTENANCE.md (unique patch-regen recipe not in the canonical README)

Deletes:
- patches/BENCHMARKS.md (superseded by README section 4 + the dev-notes section)
- patches/kernel/ (dead FP4-MoE scaffold, never in the 0001-0030 apply glob, zero refs repo-wide)

Repoint every reference to the moved files: README internal links (docs/ + the
.github links drop from 5x ../ to 3x ../), .agents/llama-cpp-localai-paged-backend.md,
.github/scripts/paged-canary-apply.sh, .github/workflows/llama-cpp-paged-canary.yml,
the wrapper Makefile, backend/cpp/llama-cpp/grpc-server.cpp, backend/index.yaml,
docs/content/features/backends.md, gallery/index.yaml.

The build apply glob PAGED_PATCHES_DIR/0*.patch (PAGED_PATCHES_DIR := .../patches/paged)
is unchanged and still resolves to the 28 patches.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 13:20:05 +00:00

20 KiB

Layer-2 upstream scope: native fused-GDN kernels for Metal / Vulkan / SYCL

Source-only analysis (no GPU, no build) of what it would take to give the gated-DeltaNet (GDN / SSM) decode fusions native kernels on the non-CUDA compute backends, so the patch-series decode win extends past CUDA-family hardware.

In our changeset (patches 0018-0030) these fusions ship with CUDA native kernels

  • CPU reference kernels ONLY; patch 0030 force-gates them OFF on Metal / Vulkan / SYCL (a CPU-fallback fused op would regress via the device round-trip, and a backend that ran the plain op on the discriminated node would silently miscompute). "Layer 2" is the upstream work that adds the missing native kernels.

This doc was written against the ggml backend trees in backend/cpp/llama-cpp-paged-dev (upstream base #24732, one commit OLDER than the series pin c299a92c #25045, with only the two paged-KV patches applied - neither touches GDN/SSM). So every "kernel already exists" statement below is a conservative lower bound: the pin has at least these kernels.


0. Headline finding (correct a stale assumption first)

The series README (section 4c) says "the gated-DeltaNet op has no Vulkan kernel upstream, so the Qwen3.6 hybrid models assert / fall back and don't run there." That is now stale. All three backends already carry the BASE compute ops:

op Metal Vulkan SYCL
GGML_OP_GATED_DELTA_NET kernel_gated_delta_net_impl (f32, NSG 1/2/4) gated_delta_net.comp (d16/32/64/128 x kda, shmem/cluster/nocluster variants) gated_delta_net.cpp (launch_gated_delta_net<KDA,keep_rs>)
GGML_OP_SSM_CONV kernel_ssm_conv_f32_f32 (+ _4, + batched) ssm_conv.comp (+ APPLY_BIAS, APPLY_SILU specialization consts) ssm_conv.cpp (kernel_ssm_conv)
GGML_OP_SSM_SCAN yes ssm_scan.comp (mamba2) ssm_scan.cpp (mamba2)

Verified: Vulkan gated_delta_net.comp was last touched at the upstream base commit (#24732), not by any LocalAI patch. So the GDN COMPUTE op is present on Metal, Vulkan AND SYCL. The Qwen3.6 hybrids therefore DO run on all three today (via the upstream non-fused path that 0030 routes to). The Layer-2 value-add is the decode SPEEDUP from the fusions, NOT enabling the model to run at all.

Consequence: the GDN-compute op being "partly there" is true on every backend, not just Metal. What is still missing per backend is only the FUSION plumbing (in-place write-back target, the ids gather read, and the conv-update kernel) - a materially smaller scope than "port GDN from scratch."


1. Per-op semantics (the four fusions to port)

All four reuse an existing GGML_OP enum with extra src[] slots as a discriminator; none adds a new enum value. f32 throughout. The arithmetic core is IDENTICAL to the upstream non-fused op; only the read source and/or the write target are redirected. That single fact drives the whole bit-exactness story (section 3).

OP A - ggml_gated_delta_net_inplace (patch 0018)

  • Enum GGML_OP_GATED_DELTA_NET, discriminated by a non-null src[6] = state_dst (a contiguous [S_v*S_v*H, n_seqs] view into the recurrent-state cache at kv_head). K == 1 only.
  • Semantics: run the standard GDN recurrence, but write the FINAL recurrent state directly into state_dst instead of appending it to the op output. The op output then carries only the attention scores. Removes the per-layer per-step ~full-state D2D copy-back (the 0018 win).
  • Race (in-place read == write): each (seq, head) block owns a disjoint cache slot. The kernel loads the whole prior state s0 into per-thread registers (s_shard on CUDA, ls[NSG] on Metal, the column shard on Vulkan/SYCL) BEFORE the ring write, so reading and writing the same slot is safe.

OP B - ggml_gated_delta_net_inplace_ids (patch 0019)

  • Adds src[5] = FULL state cache [S_v,S_v,H,n_rs_slots], src[7] = ids (I32, per-seq source slot == the recurrent-state s_copy), op_param[1] = rs_head (destination base slot). Still has the OP-A src[6] in-place target.
  • Semantics: read each sequence's prior state directly from cache[ids[seq]] (mirrors ggml_ssm_scan's ids source), eliminating the ggml_get_rows materialization. Combined with OP A the op now reads AND writes the cache in place.
  • Race: identity sequences (ids[s] == rs_head + s, the steady AR-decode case) read s0 in place from the destination slot (safe via the register snapshot above). Non-identity sequences (reorder / rs_zero remap) are first copied by a TINY separate gather kernel (gdn_gather_nonident, one block/seq) into a DISJOINT scratch that the recurrence then reads, so the recurrence never reads a slot another block is writing. Value-preserving memcpy -> bit-identical to the get_rows path.

OP C - ggml_ssm_conv_update_inplace (patch 0021)

  • Enum GGML_OP_SSM_CONV, discriminated by a non-null src[3] = conv_state_dst ([(K-1)*channels, n_seqs] in-place ring view). src[0] = conv_states [K-1, channels, n_seqs], src[1] = conv_kernel [K, channels], src[2] = x_cur [channels, 1, n_seqs]. op_param[0] = fuse_silu.
  • Semantics (decode, n_seq_tokens == 1): per (channel, sequence) assemble the width-K conv window in registers from the K-1 cached taps + the current token, compute the depthwise conv with the SAME ascending-tap FMA order as plain ssm_conv (tap0*w0 + ... + xc*w_{K-1}, then +0.0f to match plain conv's sumf += b with b==0), optionally fold SiLU, write the conv output [channels,1,n_seqs], and write the 1-token-shifted ring state back in place. Replaces the 4-op decode conv chain (transpose + concat + conv + silu + ring cpy).
  • Race: read source (gathered taps) and write target (cache view) are disjoint buffers -> race-free by construction, no ids/identity logic.

OP D - ggml_ssm_conv_update_inplace_ids (patch 0028)

  • Same enum, discriminated by a non-null src[4] = ids; src[0] becomes the FULL conv cache [K-1, channels, n_cells]; op_param[1] = rs_head.
  • Semantics: gather-free conv-update - read each sequence's prior taps from cache[ids[s]] in-kernel (no get_rows). Identity reads in place from conv_state_dst; non-identity gathered into a disjoint scratch first by a tiny ssm_conv_gather_nonident kernel. The window is copied to a local array BEFORE the (possibly aliasing) ring write so the identity read==write slot is correct. Bit-identical to get_rows + OP C.

Net new kernels vs reuse, per op

  • OP A: NOT a new compute kernel - a write-target redirection of the EXISTING GDN kernel + 1 buffer binding + a supports_op/op-handler branch.
  • OP B: the GDN kernel gains a per-seq read-base select (identity vs scratch) + 1 ids binding + rs_head param + 1 tiny gather kernel.
  • OP C: a GENUINELY NEW kernel on each backend. The existing ssm_conv computes a windowed reduction over a PRE-concatenated input; it does not assemble the window from cached taps + the current token, fold silu, or write the shifted ring state. This is the largest net-new piece.
  • OP D: the OP-C kernel gains the read-base select + 1 ids binding + rs_head + 1 tiny conv gather kernel.

The ggml.h / ggml.c builders, the CPU reference kernels, the model-graph emission (delta-net-base.cpp, qwen35*), and the test-backend-ops cases are SHARED and already done by patches 0018/0019/0021/0028. The only NEW per-backend work is the kernel(s) + the backend wiring.


2. Per-backend: authoring model, effort, gotchas, wiring

2.1 Metal (MSL)

Authoring model: .metal MSL source (ggml-metal.metal), function-constant specialization (e.g. FC_GATED_DELTA_NET), kernels templated on NSG; host glue split across ggml-metal-ops.cpp (ggml_metal_op_* encode), the pipeline lookup in ggml-metal-device.cpp/.m, the kargs struct in ggml-metal-impl.h, and supports_op in ggml-metal-device.m. Threadgroup model; Apple GPU simdgroup width is a FIXED 32, simd_sum for the per-column reduce.

Effort: MEDIUM. ~350-500 LOC. The GDN and plain-ssm_conv kernels already exist and are ergonomic to extend. OP A is a write-base redirect of the existing kernel_gated_delta_net_impl (its tail already does dst_state = dst + attn_size + state_out_base; dst_state[is] = ls[j] after loading ls[] into registers - just point dst_state at the state_dst buffer and add the binding). OP C is the one net-new MSL kernel (Metal has NO bias/silu ssm_conv variant today - only plain + _4 + batched - so the silu-fold and ring write are both new). Host glue spans 3-4 files.

Gotchas:

  • In-place race: the existing kernel ALREADY snapshots the state column into ls[NSG] registers before writing, so OP A/B are safe with no barrier; OP C/D must mirror the float window[K] local-copy-before-write that CPU/CUDA use.
  • Discriminated SSM_CONV: supports_op for GGML_OP_SSM_CONV currently returns has_simdgroup_reduction with NO check of src[3]/src[4]; GDN returns has_simdgroup_reduction && src[2]->ne[0] % 32 == 0 with NO check of src[6]/src[7]. Both must be tightened (accept the discriminated variant only once the kernel exists) AND ggml_metal_op_ssm_conv / ggml_metal_op_gated_delta_net must branch on the extra src to pick the kernel.
  • Bit-exactness: fixed 32-wide simdgroup makes this the SIMPLEST of the three - the fused variant only redirects addresses, so it is bit-identical to Metal's own non-fused path by construction (the conv per-channel FMA needs the exact ascending order + the +0.0f).
  • The kargs struct grows by the state_dst / ids / rs_head fields; a new pipeline name (or a function-constant branch) distinguishes the variants.

2.2 Vulkan (GLSL .comp -> SPIR-V)

Authoring model: GLSL .comp in vulkan-shaders/, compiled at build time by vulkan-shaders-gen into embedded SPIR-V byte arrays (gated_delta_net_f32_data etc.); pipeline creation in ggml-vulkan.cpp declares the binding count + push-constant size; a push-constant struct per op; host dispatch ggml_vk_* binds subbuffers; supports_op in the device support function. Subgroup size VARIES by vendor (NVIDIA 32, AMD 64, Intel 8/16/32).

Effort: HARDEST. ~450-650 LOC + the most build/host glue. Same kernel logic as Metal/SYCL, but every new shader or variant requires: the shaders-gen regen, a new ggml_vk_create_pipeline registration with an explicit binding count and push-constant size, a new/extended push-constant struct (add rs_head), and GROWING the descriptor binding set from the current 7 (src[0..5] + dst) to 8-9 (state_dst, ids). The GDN host dispatch hardcodes a 6-src bind loop and the pipeline is created with "main", 7, ... - both must change.

Gotchas:

  • Subgroup variance interacts with the EXISTING variant matrix: the GDN comp already ships shmem / cluster / nocluster variants keyed on subgroup size and relies on S_V % COLS_PER_WG == 0. The OP-A/B read/write redirect must be applied across ALL of those variants, and re-validated per vendor.
  • In-place race: GLSL must read the full column shard into local registers before the ring write (same pattern); confirm the SPIR-V memory model is not relied on for cross-invocation ordering (it is not - blocks are disjoint per (seq,head)). OP C/D need the explicit window-to-local copy.
  • Discriminated SSM_CONV: supports_op returns op->src[0]->type == F32 with NO discriminator check; GDN loops src[0..5] F32 with NO src[6]/src[7] check. Both must be tightened. This is the backend where the 0030 hazard is most concrete (a present plain-conv kernel + a permissive supports_op = silent miscompute) - Vulkan is the exact case 0030 was written for.
  • conv-update is per-channel (one invocation per channel) so it is subgroup-AGNOSTIC; only the GDN recurrence carries the subgroup-width burden.
  • Vulkan's ssm_conv.comp ALREADY has APPLY_SILU + APPLY_BIAS specialization constants, so the silu-fold half of OP C is partly precedented here (unlike Metal); the ring write-back + tap-window assembly are still new.

2.3 SYCL (single-source DPC++)

Authoring model: plain C++ .cpp/.hpp per op (gated_delta_net.cpp, ssm_conv.cpp); a SYCL queue.parallel_for over an nd_range with reqd_sub_group_size(WARP_SIZE); sub-group reductions (warp_reduce_sum); supports_op in ggml-sycl.cpp. NO separate shader-compile step (single source).

Effort: EASIEST to author. ~250-350 LOC. The SYCL op handlers + kernels are near-VERBATIM mirrors of the CUDA ones (launch_gated_delta_net<KDA,keep_rs>, s_shard, curr_state, state = dst + attn_score_elems, warp_reduce_sum) - a dpct/SYCLomatic-style port. The CUDA diffs in 0018/0019/0021/0028 would port almost line-for-line: add the state_dst param, the ids/rs_head params, the read-base select, the two tiny gather kernels, and the new conv-update kernel. No pipeline/push-constant/binding bookkeeping.

Gotchas:

  • In-place race: the s_shard[] / window arrays are per-work-item private, so the register-snapshot-before-write pattern carries over directly. Safe.
  • Discriminated SSM_CONV: supports_op checks src[0]/src[1] F32 with NO discriminator check; GDN returns a BARE true (the MOST permissive, so the hazard is worst here). Both must be tightened, and ggml_sycl_op_ssm_conv / ggml_sycl_op_gated_delta_net must branch on the extra src.
  • Bit-exactness: WARP_SIZE is compile-fixed (Intel sub-group 8/16/32), same situation as CUDA; the fused variant matches SYCL's own non-fused path by construction. conv-update is per-channel -> subgroup-agnostic.

2.4 Common wiring (all three) + the 0030 emission-gate change

Per backend, four wiring touch-points beyond the kernel body:

  1. supports_op: tighten the GGML_OP_SSM_CONV and GGML_OP_GATED_DELTA_NET entries so the discriminated/extra-src node is reported supported ONLY when the new kernel handles it (and rejected otherwise, instead of today's silently-true-for-the-plain-kernel).
  2. op handler: branch on src[3]/src[4] (conv) and src[6]/src[7] (GDN) to dispatch the fused kernel.
  3. pipeline/kernel registration (Vulkan: + push-constant struct + descriptor bindings; Metal: + kargs fields + pipeline name; SYCL: just the new functions).
  4. The patch-0030 gate in src/llama-context.cpp.

The 0030 change today is a hard allow-list: any non-CPU compute backend whose reg name is not "CUDA"/"ROCm"/"MUSA" forces fused_gdn_ar = fused_gdn_ch = auto_fgdn = false. As each backend gains kernels this must become capability- driven, in one of two ways:

  • minimal: add the backend's reg name (e.g. "Metal") to the allow-list once its kernels + tightened supports_op ship; OR
  • clean (recommended upstream form): DELETE the name allow-list and make supports_op authoritative - have the auto_fgdn resolution probe ggml_backend_dev_supports_op on a representative node that carries the discriminated src[] slots. Then routing falls out of the normal scheduler fallback and no backend name is ever hard-coded. This also fixes 0030's stated weakness that the upstream auto_fgdn check only inspects GATED_DELTA_NET nodes and covered the discriminated SSM_CONV only incidentally.

3. Bit-exactness per backend (the md5 gate question)

Feasible on ALL THREE, and not actually constraining, because of how the gate is scoped:

  • The series md5 gate is a CUDA-vs-CPU comparison; each GPU backend ALREADY has its own f32 reduction order (Metal simd_sum, Vulkan subgroup reduce, SYCL warp_reduce_sum) that differs from CUDA's and from CPU's. There is no cross-backend md5 and none is expected.
  • The relevant per-backend invariant is: the FUSED variant must equal that backend's OWN non-fused path. The fusions change only the read source (gather -> indexed read; the gather is a value-preserving memcpy) and the write target (appended output -> in-place cache slot). They do NOT touch the per-column FMA/reduce order. So the fused op is bit-identical to the non-fused op on the same backend BY CONSTRUCTION.
  • Two arithmetic details each port MUST preserve exactly: (a) the conv ascending-tap order plus the +0.0f that matches plain ssm_conv's sumf += b with b==0; (b) the existing GDN per-column subgroup reduce (do not re-order it). Get those right and test-backend-ops (backendX-vs-CPU, already registered for SSM_CONV / SSM_CONV_UPDATE / SSM_CONV_UPDATE_IDS / GATED_DELTA_NET) is the per-backend gate.

4. Upstream path and ranked recommendation

Ops-first, then one PR per backend (NOT one big PR)

Recommended sequence:

  1. PR #1 - OPS (already essentially done, upstreamable as-is): the ggml.h/ ggml.c builders, the CPU reference kernels, the CUDA kernels, the test-backend-ops cases, and the capability-driven gate (the clean supports_op-authoritative version of 0030). This is independently mergeable and mirrors how llama.cpp lands new ops (CPU + CUDA first; GDN itself landed that way).
  2. PR #2 - Metal kernels + wiring.
  3. PR #3 - SYCL kernels + wiring.
  4. PR #4 - Vulkan kernels + wiring.

Do NOT bundle the backends: each needs its own hardware to validate test-backend-ops, reviewers are backend-specialized, and a regression in one must not block the others.

Value x effort ranking (which backend first)

backend user base / value author effort bit-exact difficulty net rank
Metal HIGH (Apple Silicon = largest non-CUDA LocalAI base; unified memory makes the no-copy / no-gather plumbing wins map directly) MEDIUM LOWEST (fixed 32 simdgroup) 1st
SYCL LOW-MED (Intel GPU) LOWEST (near-verbatim CUDA mirror) LOW 2nd
Vulkan HIGHEST breadth (AMD + Intel + cross-vendor) HIGHEST (shaders-gen + variant matrix + subgroup variance + descriptor growth) MEDIUM (per-vendor subgroup validation) 3rd

Recommendation: Metal first. It banks the biggest user-facing decode win at medium effort, the base GDN + conv kernels already exist, and Apple's fixed simdgroup width makes bit-exactness the simplest. SYCL second as a cheap, nearly mechanical follow-on (the port is a line-for-line CUDA mirror, so it is low-cost insurance even though the Intel-GPU audience is smaller). Vulkan last as the high-effort / high-breadth capstone - it reaches the widest hardware (AMD + Intel + anything with a Vulkan driver), but the shader-gen pipeline, the existing variant matrix, the subgroup-width variance, and the per-vendor validation burden make it the right capstone once the pattern is proven on Metal + SYCL.

A reasonable cheaper variant: ship Metal + SYCL together right after the ops PR (both are register-snapshot ports with no shader-gen step) and treat Vulkan as a separate later effort.


5. Summary

  • GDN-compute and plain SSM_CONV kernels ALREADY EXIST on Metal, Vulkan and SYCL (the README's "no Vulkan kernel" line is stale). The Qwen3.6 hybrids run on all three today via the non-fused path; Layer-2 is about the decode SPEEDUP.
  • Per backend the NEW work is: redirect the GDN state write (OP A) + add the ids read (OP B) to the existing GDN kernel, write ONE new conv-update kernel (OP C) + its ids variant (OP D), add two tiny gather kernels, and tighten supports_op + the op-handler branch + (Vulkan) the pipeline/push-constant/ descriptor wiring. The builders, CPU refs, model graph and tests are shared and already done.
  • Bit-exactness is feasible everywhere and per-backend by construction (the fusions redirect addresses, not the f32 reduction order); test-backend-ops (backendX-vs-CPU) is the gate.
  • Sequence: ops-first PR (incl. the capability-driven replacement for 0030's name allow-list), then Metal, then SYCL, then Vulkan.