feat(paged): backend-gate fused GDN/discriminated SSM_CONV emission (patch 0030)

Closes audit RISKY-1 (the one latent silent-miscompute hazard). The fused/in-place
Gated Delta Net op (0018/0019/0026) and the discriminated SSM_CONV decode op
(0021/0028, which REUSE GGML_OP_SSM_CONV / GGML_OP_GATED_DELTA_NET via a non-null
src[3]/src[4] discriminator) are CUDA+CPU-only but were emitted DEFAULT-ON
(cparams.fused_gdn_ar/ch=true, auto_fgdn=true) with no backend guard. A backend
that supports plain SSM_CONV but ignores the discriminator (Vulkan/SYCL/Metal)
would run the wrong plain conv => silent corruption.

Fix: in llama_context::sched_reserve(), before the auto_fgdn resolution, force
fused_gdn_ar = fused_gdn_ch = auto_fgdn = false when any non-CPU compute backend
is not CUDA-family (reg name not "CUDA"/"ROCm"/"MUSA"). Every emission site keys
off these flags, so the graph falls back to the upstream non-fused plain
ggml_ssm_conv + ggml_silu path that every backend handles. On CUDA the reg name is
"CUDA", the flags are left untouched, and the decode graph is byte-identical.

Mirror of DGX paged patch 0030; adds FUSED_OP_BACKEND_GATE_RESULTS.md.

Verified GPU-free: reconstructed pin 9d5d882d + paged 0001-0029 + 0030, CPU-only
build (GGML_CUDA=OFF) of libllama + test-backend-ops links with 0 errors; 0030
applies cleanly via git apply and patch -p1. test-backend-ops correctness for
SSM_CONV/SSM_CONV_UPDATE(_IDS)/GATED_DELTA_NET is CUDA0-vs-CPU (pending DGX,
tunnel offline this session); registered test cases will exercise it.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-27 07:32:49 +00:00
parent 2332587fdc
commit 621a20d2b5
2 changed files with 202 additions and 0 deletions

View File

@@ -0,0 +1,106 @@
From a095f4ebeefafd16dd54c514eb86148fa46daef3 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Sat, 27 Jun 2026 07:30:43 +0000
Subject: [PATCH] feat(paged): backend-gate fused GDN/discriminated SSM_CONV
emission (patch 0030)
Closes the latent silent-miscompute hazard (audit RISKY-1). The fused/in-place
Gated Delta Net op (0018/0019/0026: ggml_gated_delta_net_inplace[_ids][_hybrid])
and the discriminated SSM_CONV decode op (0021/0028: ggml_ssm_conv_update_inplace
[_ids], which REUSE GGML_OP_SSM_CONV / GGML_OP_GATED_DELTA_NET with extra src
slots - a non-null src[3]/src[4] ring/ids discriminator) are emitted DEFAULT-ON
(cparams.fused_gdn_ar/ch=true, auto_fgdn=true) but are implemented for the
CUDA-family TU (CUDA / HIP "ROCm" / "MUSA", hipified ggml-cuda) and the CPU
reference ONLY.
The hazard: a compute backend that supports PLAIN GGML_OP_SSM_CONV but ignores
the src[3]/src[4] discriminator (Vulkan/SYCL/Metal) reports supports_op==true for
the node and the scheduler assigns the discriminated conv to it; it then runs the
wrong plain conv => SILENT corruption (not a crash). The upstream auto_fgdn
device-mismatch resolution only inspects GATED_DELTA_NET nodes, so the
discriminated-SSM_CONV safety was only incidentally covered (it happened to share
backend coverage with the GDN op); it becomes live the moment a non-CUDA paged
build of a gated-DeltaNet model exists.
FIX: gate the fused-op emission on the active compute backend type. Before the
auto_fgdn resolution in llama_context::sched_reserve(), if any non-CPU compute
backend is not CUDA-family (reg name != "CUDA"/"ROCm"/"MUSA"), force
fused_gdn_ar = fused_gdn_ch = auto_fgdn = false. Every emission site keys off
these flags (conv_decode_fused = ... && fused_gdn_ar; fused = ... fused_gdn_ar/ch),
so disabling them routes the graph to the upstream non-fused path: a PLAIN
ggml_ssm_conv (no discriminator) + ggml_silu, which every backend handles
correctly. This makes the discriminated-op safety explicit and decoupled from the
GDN-op device-mismatch heuristic.
INVARIANT (CUDA byte-identical): on a CUDA backend the reg name is "CUDA", so
fgdn_backend_ok stays true, the flags are left untouched, and the emitted decode
graph is unchanged - byte-identical to pre-0030. The fix only changes behavior on
non-CUDA/non-CPU backends.
GATE compile: CPU-only build (GGML_CUDA=OFF) of the full series (pin 9d5d882d +
0001-0029 + this) links libllama.so and test-backend-ops with 0 errors; the
edited llama-context.cpp compiles clean (uses only already-included <cstring> +
backend-reg API already used in this TU). test-backend-ops correctness for
SSM_CONV / SSM_CONV_UPDATE / SSM_CONV_UPDATE_IDS / GATED_DELTA_NET is a
CUDA0-vs-CPU comparison (CPU-only run skips CPU-vs-CPU); the test cases are
registered and exercised on the CUDA DGX run.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
src/llama-context.cpp | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)
diff --git a/src/llama-context.cpp b/src/llama-context.cpp
index ad7939e..c408eef 100644
--- a/src/llama-context.cpp
+++ b/src/llama-context.cpp
@@ -521,6 +521,45 @@ void llama_context::sched_reserve() {
cparams.auto_fa = false;
}
+ // RISKY-1 guard: the fused/in-place Gated Delta Net op and the discriminated
+ // SSM_CONV (which reuse GGML_OP_GATED_DELTA_NET / GGML_OP_SSM_CONV with extra
+ // src slots - a non-null src[3]/src[4] ring/ids discriminator) are only
+ // implemented for the CUDA-family backends (CUDA / HIP "ROCm" / "MUSA" - all
+ // built from the hipified ggml-cuda TU) and the CPU reference. Any other
+ // compute backend (Vulkan/SYCL/Metal/...) that supports *plain* SSM_CONV but
+ // ignores the discriminator src would silently run the WRONG conv. The
+ // upstream auto_fgdn device-mismatch check below only inspects
+ // GATED_DELTA_NET nodes, so couple the discriminated-SSM_CONV safety
+ // explicitly to the backend type here: keep the fused path enabled only when
+ // every non-CPU compute backend is CUDA-family. On CUDA this leaves the flags
+ // untouched, so the emitted decode graph is byte-identical.
+ if (cparams.fused_gdn_ar || cparams.fused_gdn_ch) {
+ bool fgdn_backend_ok = true;
+ for (auto & backend : backends) {
+ ggml_backend_dev_t dev = ggml_backend_get_device(backend.get());
+ if (!dev || ggml_backend_dev_type(dev) == GGML_BACKEND_DEVICE_TYPE_CPU) {
+ // CPU reference handles the fused/discriminated ops
+ continue;
+ }
+ ggml_backend_reg_t reg = ggml_backend_dev_backend_reg(dev);
+ const char * name = reg ? ggml_backend_reg_name(reg) : "";
+ // GGML_CUDA_NAME is "CUDA" / "ROCm" (HIP) / "MUSA"; all three are the
+ // same ggml-cuda TU that carries the discriminated-op handling.
+ if (strcmp(name, "CUDA") != 0 && strcmp(name, "ROCm") != 0 && strcmp(name, "MUSA") != 0) {
+ fgdn_backend_ok = false;
+ break;
+ }
+ }
+
+ if (!fgdn_backend_ok) {
+ cparams.fused_gdn_ar = false;
+ cparams.fused_gdn_ch = false;
+ cparams.auto_fgdn = false;
+ LLAMA_LOG_INFO("%s: fused Gated Delta Net / discriminated SSM_CONV disabled "
+ "(compute backend is not CUDA/HIP/CPU)\n", __func__);
+ }
+ }
+
if (cparams.auto_fgdn) {
LLAMA_LOG_INFO("%s: resolving fused Gated Delta Net support:\n", __func__);
--
2.43.0

View File

@@ -0,0 +1,96 @@
# Patch 0030 - fused-op backend gate (audit RISKY-1 fix) - RESULTS
Closes the single latent silent-miscompute hazard from `ARCH_GENERALITY_AUDIT.md`
(RISKY-1): the fused GDN / discriminated-SSM_CONV decode ops are CUDA+CPU-only but
were emitted DEFAULT-ON with no backend guard.
## The hazard
- `cparams.fused_gdn_ar = fused_gdn_ch = auto_fgdn = true` are set unconditionally
in the `llama_context` constructor (`src/llama-context.cpp`).
- Patches 0018/0019/0026 add `ggml_gated_delta_net_inplace[_ids][_hybrid]`
(reuse `GGML_OP_GATED_DELTA_NET` with extra src slots).
- Patches 0021/0028 add `ggml_ssm_conv_update_inplace[_ids]` which **reuse
`GGML_OP_SSM_CONV`, discriminated by a non-null `src[3]`/`src[4]`** (ring/ids).
- Both families have CUDA + CPU kernels only. No `supports_op` change was made for
the discriminated variants.
- A backend that supports **plain** `SSM_CONV` but ignores the discriminator
(Vulkan/SYCL/Metal) returns `supports_op==true` for the node; the scheduler
assigns the discriminated conv to it; it runs the **wrong plain conv** =>
SILENT corruption (not a crash).
- The upstream `auto_fgdn` resolution only inspects `GATED_DELTA_NET` nodes, so the
discriminated-`SSM_CONV` safety was only **incidentally** covered (GDN-op and
discriminated-conv happened to share backend coverage). It goes live the moment a
non-CUDA paged build of a gated-DeltaNet model exists.
## The fix (emission gate, not supports_op)
Chosen route: **gate the emission on the active compute backend type.** The
`supports_op` route would require editing every other backend's per-device
`supports_op` (Vulkan/SYCL/Metal/...) to reject the discriminated `SSM_CONV` -
invasive, fragile, and not centrally exposed by the ggml backend interface. The
emission gate is self-contained in the fork's own code.
`src/llama-context.cpp`, in `llama_context::sched_reserve()`, immediately before
the existing `if (cparams.auto_fgdn)` resolution block: if any **non-CPU** compute
backend has a reg name other than `"CUDA"` / `"ROCm"` (HIP) / `"MUSA"` (the three
`GGML_CUDA_NAME` values - all the same hipified ggml-cuda TU that carries the
discriminated-op handling), force
`fused_gdn_ar = fused_gdn_ch = auto_fgdn = false`.
Every emission site keys off these flags:
`conv_decode_fused = (n_seq_tokens==1) && (n_rs_seq==0) && fused_gdn_ar`
(qwen35/qwen35moe/qwen3next + `build_conv_state_fused`) and
`fused = (n_seq_tokens==1) ? fused_gdn_ar : fused_gdn_ch` (delta-net-base). With
the flags false the graph takes the upstream non-fused branch: a **plain
`ggml_ssm_conv` (no discriminator) + `ggml_silu`**, which every backend handles
correctly.
## CUDA byte-identical invariant
On a CUDA backend the reg name is `"CUDA"`, so `fgdn_backend_ok` stays true, the
flags are left untouched, and the emitted decode graph is unchanged. The fix only
changes behavior on non-CUDA/non-CPU backends. CUDA decode graph is byte-identical
to pre-0030 **by construction** (no flag flips on CUDA), so the existing greedy
md5 gates are unaffected on the validated GB10 target.
## Verification
- COMPILE (GPU-free, done on a CPU box): reconstructed the exact source state
(upstream pin `9d5d882d` + paged patches `0001-0029`, .md docs stripped) and
applied 0030. CPU-only build (`-DGGML_CUDA=OFF`) of `llama` + `test-backend-ops`
links `libllama.so` and the test binary with **0 errors**; the edited
`llama-context.cpp` compiles clean (uses only the already-included `<cstring>`
and the backend-reg API already used in this TU:
`ggml_backend_dev_backend_reg` / `ggml_backend_reg_name` /
`ggml_backend_dev_type`).
- 0030 applies cleanly on a fresh pin+0001-0029 tree via both `git apply --check`
(Makefile path) and `patch -p1 -N` (prepare.sh path).
- test-backend-ops correctness is a **CUDA0-vs-CPU** comparison; a CPU-only run
skips CPU-vs-CPU by design ("Skipping CPU backend"). The relevant test cases are
registered and will be exercised by the DGX CUDA run:
`test_ssm_conv` / `test_ssm_conv_update` (SSM_CONV_UPDATE) /
`test_ssm_conv_update_ids` (SSM_CONV_UPDATE_IDS) /
`test_gated_delta_net` (+ `_hybrid`).
## Pending on the DGX (GPU)
The CUDA-side confirmation could not be run from the CPU box (the DGX cloudflared
tunnel `jp-6.prem.io` was returning `websocket: bad handshake` for the whole
session - origin offline). To run on the DGX `~/llama-paged-dev` (branch `paged`)
once reachable, then commit 0030 there too:
```
test-backend-ops test -o SSM_CONV
test-backend-ops test -o SSM_CONV_UPDATE
test-backend-ops test -o SSM_CONV_UPDATE_IDS
test-backend-ops test -o GATED_DELTA_NET # expect: 2/2 backends passed, OK
```
Greedy md5 (only if >40GB VRAM free; must equal the established baselines):
`q36-27b-nvfp4 == 5951a5b4d624ce891e22ab5fca9bc439`,
`q36-35b-a3b-nvfp4 == 07db32c2bcb78d17a43ed18bc22705cd`. Since 0030 does not flip
any flag on CUDA, the md5 is unchanged by code-path argument; the run is a
belt-and-suspenders confirmation, not a correctness dependency.
Assisted-by: Claude:opus-4.8 [Claude Code]