feat(paged): backend-gate fused GDN/discriminated SSM_CONV emission (patch 0030)

Closes audit RISKY-1 (the one latent silent-miscompute hazard). The fused/in-place Gated Delta Net op (0018/0019/0026) and the discriminated SSM_CONV decode op (0021/0028, which REUSE GGML_OP_SSM_CONV / GGML_OP_GATED_DELTA_NET via a non-null src[3]/src[4] discriminator) are CUDA+CPU-only but were emitted DEFAULT-ON (cparams.fused_gdn_ar/ch=true, auto_fgdn=true) with no backend guard. A backend that supports plain SSM_CONV but ignores the discriminator (Vulkan/SYCL/Metal) would run the wrong plain conv => silent corruption. Fix: in llama_context::sched_reserve(), before the auto_fgdn resolution, force fused_gdn_ar = fused_gdn_ch = auto_fgdn = false when any non-CPU compute backend is not CUDA-family (reg name not "CUDA"/"ROCm"/"MUSA"). Every emission site keys off these flags, so the graph falls back to the upstream non-fused plain ggml_ssm_conv + ggml_silu path that every backend handles. On CUDA the reg name is "CUDA", the flags are left untouched, and the decode graph is byte-identical. Mirror of DGX paged patch 0030; adds FUSED_OP_BACKEND_GATE_RESULTS.md. Verified GPU-free: reconstructed pin 9d5d882d + paged 0001-0029 + 0030, CPU-only build (GGML_CUDA=OFF) of libllama + test-backend-ops links with 0 errors; 0030 applies cleanly via git apply and patch -p1. test-backend-ops correctness for SSM_CONV/SSM_CONV_UPDATE(_IDS)/GATED_DELTA_NET is CUDA0-vs-CPU (pending DGX, tunnel offline this session); registered test cases will exercise it. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 09:57:14 -04:00 · 2026-06-27 07:32:49 +00:00
parent 2332587fdc
commit 621a20d2b5
2 changed files with 202 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/0030-fused-op-backend-gate.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0030-fused-op-backend-gate.patch
@@ -0,0 +1,106 @@
+From a095f4ebeefafd16dd54c514eb86148fa46daef3 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Sat, 27 Jun 2026 07:30:43 +0000
+Subject: [PATCH] feat(paged): backend-gate fused GDN/discriminated SSM_CONV
+ emission (patch 0030)
+
+Closes the latent silent-miscompute hazard (audit RISKY-1). The fused/in-place
+Gated Delta Net op (0018/0019/0026: ggml_gated_delta_net_inplace[_ids][_hybrid])
+and the discriminated SSM_CONV decode op (0021/0028: ggml_ssm_conv_update_inplace
+[_ids], which REUSE GGML_OP_SSM_CONV / GGML_OP_GATED_DELTA_NET with extra src
+slots - a non-null src[3]/src[4] ring/ids discriminator) are emitted DEFAULT-ON
+(cparams.fused_gdn_ar/ch=true, auto_fgdn=true) but are implemented for the
+CUDA-family TU (CUDA / HIP "ROCm" / "MUSA", hipified ggml-cuda) and the CPU
+reference ONLY.
+
+The hazard: a compute backend that supports PLAIN GGML_OP_SSM_CONV but ignores
+the src[3]/src[4] discriminator (Vulkan/SYCL/Metal) reports supports_op==true for
+the node and the scheduler assigns the discriminated conv to it; it then runs the
+wrong plain conv => SILENT corruption (not a crash). The upstream auto_fgdn
+device-mismatch resolution only inspects GATED_DELTA_NET nodes, so the
+discriminated-SSM_CONV safety was only incidentally covered (it happened to share
+backend coverage with the GDN op); it becomes live the moment a non-CUDA paged
+build of a gated-DeltaNet model exists.
+
+FIX: gate the fused-op emission on the active compute backend type. Before the
+auto_fgdn resolution in llama_context::sched_reserve(), if any non-CPU compute
+backend is not CUDA-family (reg name != "CUDA"/"ROCm"/"MUSA"), force
+fused_gdn_ar = fused_gdn_ch = auto_fgdn = false. Every emission site keys off
+these flags (conv_decode_fused = ... && fused_gdn_ar; fused = ... fused_gdn_ar/ch),
+so disabling them routes the graph to the upstream non-fused path: a PLAIN
+ggml_ssm_conv (no discriminator) + ggml_silu, which every backend handles
+correctly. This makes the discriminated-op safety explicit and decoupled from the
+GDN-op device-mismatch heuristic.
+
+INVARIANT (CUDA byte-identical): on a CUDA backend the reg name is "CUDA", so
+fgdn_backend_ok stays true, the flags are left untouched, and the emitted decode
+graph is unchanged - byte-identical to pre-0030. The fix only changes behavior on
+non-CUDA/non-CPU backends.
+
+GATE compile: CPU-only build (GGML_CUDA=OFF) of the full series (pin 9d5d882d +
+0001-0029 + this) links libllama.so and test-backend-ops with 0 errors; the
+edited llama-context.cpp compiles clean (uses only already-included <cstring> +
+backend-reg API already used in this TU). test-backend-ops correctness for
+SSM_CONV / SSM_CONV_UPDATE / SSM_CONV_UPDATE_IDS / GATED_DELTA_NET is a
+CUDA0-vs-CPU comparison (CPU-only run skips CPU-vs-CPU); the test cases are
+registered and exercised on the CUDA DGX run.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ src/llama-context.cpp | 39 +++++++++++++++++++++++++++++++++++++++
+ 1 file changed, 39 insertions(+)
+
+diff --git a/src/llama-context.cpp b/src/llama-context.cpp
+index ad7939e..c408eef 100644
+--- a/src/llama-context.cpp
+++ b/src/llama-context.cpp
+@@ -521,6 +521,45 @@ void llama_context::sched_reserve() {
+         cparams.auto_fa = false;
+     }
+ 
+    // RISKY-1 guard: the fused/in-place Gated Delta Net op and the discriminated
+    // SSM_CONV (which reuse GGML_OP_GATED_DELTA_NET / GGML_OP_SSM_CONV with extra
+    // src slots - a non-null src[3]/src[4] ring/ids discriminator) are only
+    // implemented for the CUDA-family backends (CUDA / HIP "ROCm" / "MUSA" - all
+    // built from the hipified ggml-cuda TU) and the CPU reference. Any other
+    // compute backend (Vulkan/SYCL/Metal/...) that supports *plain* SSM_CONV but
+    // ignores the discriminator src would silently run the WRONG conv. The
+    // upstream auto_fgdn device-mismatch check below only inspects
+    // GATED_DELTA_NET nodes, so couple the discriminated-SSM_CONV safety
+    // explicitly to the backend type here: keep the fused path enabled only when
+    // every non-CPU compute backend is CUDA-family. On CUDA this leaves the flags
+    // untouched, so the emitted decode graph is byte-identical.
+    if (cparams.fused_gdn_ar || cparams.fused_gdn_ch) {
+        bool fgdn_backend_ok = true;
+        for (auto & backend : backends) {
+            ggml_backend_dev_t dev = ggml_backend_get_device(backend.get());
+            if (!dev || ggml_backend_dev_type(dev) == GGML_BACKEND_DEVICE_TYPE_CPU) {
+                // CPU reference handles the fused/discriminated ops
+                continue;
+            }
+            ggml_backend_reg_t reg  = ggml_backend_dev_backend_reg(dev);
+            const char *       name = reg ? ggml_backend_reg_name(reg) : "";
+            // GGML_CUDA_NAME is "CUDA" / "ROCm" (HIP) / "MUSA"; all three are the
+            // same ggml-cuda TU that carries the discriminated-op handling.
+            if (strcmp(name, "CUDA") != 0 && strcmp(name, "ROCm") != 0 && strcmp(name, "MUSA") != 0) {
+                fgdn_backend_ok = false;
+                break;
+            }
+        }
+
+        if (!fgdn_backend_ok) {
+            cparams.fused_gdn_ar = false;
+            cparams.fused_gdn_ch = false;
+            cparams.auto_fgdn    = false;
+            LLAMA_LOG_INFO("%s: fused Gated Delta Net / discriminated SSM_CONV disabled "
+                    "(compute backend is not CUDA/HIP/CPU)\n", __func__);
+        }
+    }
+
+     if (cparams.auto_fgdn) {
+         LLAMA_LOG_INFO("%s: resolving fused Gated Delta Net support:\n", __func__);
+ 
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/FUSED_OP_BACKEND_GATE_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/FUSED_OP_BACKEND_GATE_RESULTS.md
@@ -0,0 +1,96 @@
+# Patch 0030 - fused-op backend gate (audit RISKY-1 fix) - RESULTS
+
+Closes the single latent silent-miscompute hazard from `ARCH_GENERALITY_AUDIT.md`
+(RISKY-1): the fused GDN / discriminated-SSM_CONV decode ops are CUDA+CPU-only but
+were emitted DEFAULT-ON with no backend guard.
+
+## The hazard
+
+- `cparams.fused_gdn_ar = fused_gdn_ch = auto_fgdn = true` are set unconditionally
+  in the `llama_context` constructor (`src/llama-context.cpp`).
+- Patches 0018/0019/0026 add `ggml_gated_delta_net_inplace[_ids][_hybrid]`
+  (reuse `GGML_OP_GATED_DELTA_NET` with extra src slots).
+- Patches 0021/0028 add `ggml_ssm_conv_update_inplace[_ids]` which **reuse
+  `GGML_OP_SSM_CONV`, discriminated by a non-null `src[3]`/`src[4]`** (ring/ids).
+- Both families have CUDA + CPU kernels only. No `supports_op` change was made for
+  the discriminated variants.
+- A backend that supports **plain** `SSM_CONV` but ignores the discriminator
+  (Vulkan/SYCL/Metal) returns `supports_op==true` for the node; the scheduler
+  assigns the discriminated conv to it; it runs the **wrong plain conv** =>
+  SILENT corruption (not a crash).
+- The upstream `auto_fgdn` resolution only inspects `GATED_DELTA_NET` nodes, so the
+  discriminated-`SSM_CONV` safety was only **incidentally** covered (GDN-op and
+  discriminated-conv happened to share backend coverage). It goes live the moment a
+  non-CUDA paged build of a gated-DeltaNet model exists.
+
+## The fix (emission gate, not supports_op)
+
+Chosen route: **gate the emission on the active compute backend type.** The
+`supports_op` route would require editing every other backend's per-device
+`supports_op` (Vulkan/SYCL/Metal/...) to reject the discriminated `SSM_CONV` -
+invasive, fragile, and not centrally exposed by the ggml backend interface. The
+emission gate is self-contained in the fork's own code.
+
+`src/llama-context.cpp`, in `llama_context::sched_reserve()`, immediately before
+the existing `if (cparams.auto_fgdn)` resolution block: if any **non-CPU** compute
+backend has a reg name other than `"CUDA"` / `"ROCm"` (HIP) / `"MUSA"` (the three
+`GGML_CUDA_NAME` values - all the same hipified ggml-cuda TU that carries the
+discriminated-op handling), force
+`fused_gdn_ar = fused_gdn_ch = auto_fgdn = false`.
+
+Every emission site keys off these flags:
+`conv_decode_fused = (n_seq_tokens==1) && (n_rs_seq==0) && fused_gdn_ar`
+(qwen35/qwen35moe/qwen3next + `build_conv_state_fused`) and
+`fused = (n_seq_tokens==1) ? fused_gdn_ar : fused_gdn_ch` (delta-net-base). With
+the flags false the graph takes the upstream non-fused branch: a **plain
+`ggml_ssm_conv` (no discriminator) + `ggml_silu`**, which every backend handles
+correctly.
+
+## CUDA byte-identical invariant
+
+On a CUDA backend the reg name is `"CUDA"`, so `fgdn_backend_ok` stays true, the
+flags are left untouched, and the emitted decode graph is unchanged. The fix only
+changes behavior on non-CUDA/non-CPU backends. CUDA decode graph is byte-identical
+to pre-0030 **by construction** (no flag flips on CUDA), so the existing greedy
+md5 gates are unaffected on the validated GB10 target.
+
+## Verification
+
+- COMPILE (GPU-free, done on a CPU box): reconstructed the exact source state
+  (upstream pin `9d5d882d` + paged patches `0001-0029`, .md docs stripped) and
+  applied 0030. CPU-only build (`-DGGML_CUDA=OFF`) of `llama` + `test-backend-ops`
+  links `libllama.so` and the test binary with **0 errors**; the edited
+  `llama-context.cpp` compiles clean (uses only the already-included `<cstring>`
+  and the backend-reg API already used in this TU:
+  `ggml_backend_dev_backend_reg` / `ggml_backend_reg_name` /
+  `ggml_backend_dev_type`).
+- 0030 applies cleanly on a fresh pin+0001-0029 tree via both `git apply --check`
+  (Makefile path) and `patch -p1 -N` (prepare.sh path).
+- test-backend-ops correctness is a **CUDA0-vs-CPU** comparison; a CPU-only run
+  skips CPU-vs-CPU by design ("Skipping CPU backend"). The relevant test cases are
+  registered and will be exercised by the DGX CUDA run:
+  `test_ssm_conv` / `test_ssm_conv_update` (SSM_CONV_UPDATE) /
+  `test_ssm_conv_update_ids` (SSM_CONV_UPDATE_IDS) /
+  `test_gated_delta_net` (+ `_hybrid`).
+
+## Pending on the DGX (GPU)
+
+The CUDA-side confirmation could not be run from the CPU box (the DGX cloudflared
+tunnel `jp-6.prem.io` was returning `websocket: bad handshake` for the whole
+session - origin offline). To run on the DGX `~/llama-paged-dev` (branch `paged`)
+once reachable, then commit 0030 there too:
+
+```
+test-backend-ops test -o SSM_CONV
+test-backend-ops test -o SSM_CONV_UPDATE
+test-backend-ops test -o SSM_CONV_UPDATE_IDS
+test-backend-ops test -o GATED_DELTA_NET   # expect: 2/2 backends passed, OK
+```
+
+Greedy md5 (only if >40GB VRAM free; must equal the established baselines):
+`q36-27b-nvfp4 == 5951a5b4d624ce891e22ab5fca9bc439`,
+`q36-35b-a3b-nvfp4 == 07db32c2bcb78d17a43ed18bc22705cd`. Since 0030 does not flip
+any flag on CUDA, the md5 is unchanged by code-path argument; the run is a
+belt-and-suspenders confirmation, not a correctness dependency.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]