From 621a20d2b513708bea10167ab533c36d1a4944dc Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Sat, 27 Jun 2026 07:32:49 +0000 Subject: [PATCH] feat(paged): backend-gate fused GDN/discriminated SSM_CONV emission (patch 0030) Closes audit RISKY-1 (the one latent silent-miscompute hazard). The fused/in-place Gated Delta Net op (0018/0019/0026) and the discriminated SSM_CONV decode op (0021/0028, which REUSE GGML_OP_SSM_CONV / GGML_OP_GATED_DELTA_NET via a non-null src[3]/src[4] discriminator) are CUDA+CPU-only but were emitted DEFAULT-ON (cparams.fused_gdn_ar/ch=true, auto_fgdn=true) with no backend guard. A backend that supports plain SSM_CONV but ignores the discriminator (Vulkan/SYCL/Metal) would run the wrong plain conv => silent corruption. Fix: in llama_context::sched_reserve(), before the auto_fgdn resolution, force fused_gdn_ar = fused_gdn_ch = auto_fgdn = false when any non-CPU compute backend is not CUDA-family (reg name not "CUDA"/"ROCm"/"MUSA"). Every emission site keys off these flags, so the graph falls back to the upstream non-fused plain ggml_ssm_conv + ggml_silu path that every backend handles. On CUDA the reg name is "CUDA", the flags are left untouched, and the decode graph is byte-identical. Mirror of DGX paged patch 0030; adds FUSED_OP_BACKEND_GATE_RESULTS.md. Verified GPU-free: reconstructed pin 9d5d882d + paged 0001-0029 + 0030, CPU-only build (GGML_CUDA=OFF) of libllama + test-backend-ops links with 0 errors; 0030 applies cleanly via git apply and patch -p1. test-backend-ops correctness for SSM_CONV/SSM_CONV_UPDATE(_IDS)/GATED_DELTA_NET is CUDA0-vs-CPU (pending DGX, tunnel offline this session); registered test cases will exercise it. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .../paged/0030-fused-op-backend-gate.patch | 106 ++++++++++++++++++ .../paged/FUSED_OP_BACKEND_GATE_RESULTS.md | 96 ++++++++++++++++ 2 files changed, 202 insertions(+) create mode 100644 backend/cpp/llama-cpp/patches/paged/0030-fused-op-backend-gate.patch create mode 100644 backend/cpp/llama-cpp/patches/paged/FUSED_OP_BACKEND_GATE_RESULTS.md diff --git a/backend/cpp/llama-cpp/patches/paged/0030-fused-op-backend-gate.patch b/backend/cpp/llama-cpp/patches/paged/0030-fused-op-backend-gate.patch new file mode 100644 index 000000000..8d3ad8f43 --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/0030-fused-op-backend-gate.patch @@ -0,0 +1,106 @@ +From a095f4ebeefafd16dd54c514eb86148fa46daef3 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Sat, 27 Jun 2026 07:30:43 +0000 +Subject: [PATCH] feat(paged): backend-gate fused GDN/discriminated SSM_CONV + emission (patch 0030) + +Closes the latent silent-miscompute hazard (audit RISKY-1). The fused/in-place +Gated Delta Net op (0018/0019/0026: ggml_gated_delta_net_inplace[_ids][_hybrid]) +and the discriminated SSM_CONV decode op (0021/0028: ggml_ssm_conv_update_inplace +[_ids], which REUSE GGML_OP_SSM_CONV / GGML_OP_GATED_DELTA_NET with extra src +slots - a non-null src[3]/src[4] ring/ids discriminator) are emitted DEFAULT-ON +(cparams.fused_gdn_ar/ch=true, auto_fgdn=true) but are implemented for the +CUDA-family TU (CUDA / HIP "ROCm" / "MUSA", hipified ggml-cuda) and the CPU +reference ONLY. + +The hazard: a compute backend that supports PLAIN GGML_OP_SSM_CONV but ignores +the src[3]/src[4] discriminator (Vulkan/SYCL/Metal) reports supports_op==true for +the node and the scheduler assigns the discriminated conv to it; it then runs the +wrong plain conv => SILENT corruption (not a crash). The upstream auto_fgdn +device-mismatch resolution only inspects GATED_DELTA_NET nodes, so the +discriminated-SSM_CONV safety was only incidentally covered (it happened to share +backend coverage with the GDN op); it becomes live the moment a non-CUDA paged +build of a gated-DeltaNet model exists. + +FIX: gate the fused-op emission on the active compute backend type. Before the +auto_fgdn resolution in llama_context::sched_reserve(), if any non-CPU compute +backend is not CUDA-family (reg name != "CUDA"/"ROCm"/"MUSA"), force +fused_gdn_ar = fused_gdn_ch = auto_fgdn = false. Every emission site keys off +these flags (conv_decode_fused = ... && fused_gdn_ar; fused = ... fused_gdn_ar/ch), +so disabling them routes the graph to the upstream non-fused path: a PLAIN +ggml_ssm_conv (no discriminator) + ggml_silu, which every backend handles +correctly. This makes the discriminated-op safety explicit and decoupled from the +GDN-op device-mismatch heuristic. + +INVARIANT (CUDA byte-identical): on a CUDA backend the reg name is "CUDA", so +fgdn_backend_ok stays true, the flags are left untouched, and the emitted decode +graph is unchanged - byte-identical to pre-0030. The fix only changes behavior on +non-CUDA/non-CPU backends. + +GATE compile: CPU-only build (GGML_CUDA=OFF) of the full series (pin 9d5d882d + +0001-0029 + this) links libllama.so and test-backend-ops with 0 errors; the +edited llama-context.cpp compiles clean (uses only already-included + +backend-reg API already used in this TU). test-backend-ops correctness for +SSM_CONV / SSM_CONV_UPDATE / SSM_CONV_UPDATE_IDS / GATED_DELTA_NET is a +CUDA0-vs-CPU comparison (CPU-only run skips CPU-vs-CPU); the test cases are +registered and exercised on the CUDA DGX run. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + src/llama-context.cpp | 39 +++++++++++++++++++++++++++++++++++++++ + 1 file changed, 39 insertions(+) + +diff --git a/src/llama-context.cpp b/src/llama-context.cpp +index ad7939e..c408eef 100644 +--- a/src/llama-context.cpp ++++ b/src/llama-context.cpp +@@ -521,6 +521,45 @@ void llama_context::sched_reserve() { + cparams.auto_fa = false; + } + ++ // RISKY-1 guard: the fused/in-place Gated Delta Net op and the discriminated ++ // SSM_CONV (which reuse GGML_OP_GATED_DELTA_NET / GGML_OP_SSM_CONV with extra ++ // src slots - a non-null src[3]/src[4] ring/ids discriminator) are only ++ // implemented for the CUDA-family backends (CUDA / HIP "ROCm" / "MUSA" - all ++ // built from the hipified ggml-cuda TU) and the CPU reference. Any other ++ // compute backend (Vulkan/SYCL/Metal/...) that supports *plain* SSM_CONV but ++ // ignores the discriminator src would silently run the WRONG conv. The ++ // upstream auto_fgdn device-mismatch check below only inspects ++ // GATED_DELTA_NET nodes, so couple the discriminated-SSM_CONV safety ++ // explicitly to the backend type here: keep the fused path enabled only when ++ // every non-CPU compute backend is CUDA-family. On CUDA this leaves the flags ++ // untouched, so the emitted decode graph is byte-identical. ++ if (cparams.fused_gdn_ar || cparams.fused_gdn_ch) { ++ bool fgdn_backend_ok = true; ++ for (auto & backend : backends) { ++ ggml_backend_dev_t dev = ggml_backend_get_device(backend.get()); ++ if (!dev || ggml_backend_dev_type(dev) == GGML_BACKEND_DEVICE_TYPE_CPU) { ++ // CPU reference handles the fused/discriminated ops ++ continue; ++ } ++ ggml_backend_reg_t reg = ggml_backend_dev_backend_reg(dev); ++ const char * name = reg ? ggml_backend_reg_name(reg) : ""; ++ // GGML_CUDA_NAME is "CUDA" / "ROCm" (HIP) / "MUSA"; all three are the ++ // same ggml-cuda TU that carries the discriminated-op handling. ++ if (strcmp(name, "CUDA") != 0 && strcmp(name, "ROCm") != 0 && strcmp(name, "MUSA") != 0) { ++ fgdn_backend_ok = false; ++ break; ++ } ++ } ++ ++ if (!fgdn_backend_ok) { ++ cparams.fused_gdn_ar = false; ++ cparams.fused_gdn_ch = false; ++ cparams.auto_fgdn = false; ++ LLAMA_LOG_INFO("%s: fused Gated Delta Net / discriminated SSM_CONV disabled " ++ "(compute backend is not CUDA/HIP/CPU)\n", __func__); ++ } ++ } ++ + if (cparams.auto_fgdn) { + LLAMA_LOG_INFO("%s: resolving fused Gated Delta Net support:\n", __func__); + +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp/patches/paged/FUSED_OP_BACKEND_GATE_RESULTS.md b/backend/cpp/llama-cpp/patches/paged/FUSED_OP_BACKEND_GATE_RESULTS.md new file mode 100644 index 000000000..27bf10829 --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/FUSED_OP_BACKEND_GATE_RESULTS.md @@ -0,0 +1,96 @@ +# Patch 0030 - fused-op backend gate (audit RISKY-1 fix) - RESULTS + +Closes the single latent silent-miscompute hazard from `ARCH_GENERALITY_AUDIT.md` +(RISKY-1): the fused GDN / discriminated-SSM_CONV decode ops are CUDA+CPU-only but +were emitted DEFAULT-ON with no backend guard. + +## The hazard + +- `cparams.fused_gdn_ar = fused_gdn_ch = auto_fgdn = true` are set unconditionally + in the `llama_context` constructor (`src/llama-context.cpp`). +- Patches 0018/0019/0026 add `ggml_gated_delta_net_inplace[_ids][_hybrid]` + (reuse `GGML_OP_GATED_DELTA_NET` with extra src slots). +- Patches 0021/0028 add `ggml_ssm_conv_update_inplace[_ids]` which **reuse + `GGML_OP_SSM_CONV`, discriminated by a non-null `src[3]`/`src[4]`** (ring/ids). +- Both families have CUDA + CPU kernels only. No `supports_op` change was made for + the discriminated variants. +- A backend that supports **plain** `SSM_CONV` but ignores the discriminator + (Vulkan/SYCL/Metal) returns `supports_op==true` for the node; the scheduler + assigns the discriminated conv to it; it runs the **wrong plain conv** => + SILENT corruption (not a crash). +- The upstream `auto_fgdn` resolution only inspects `GATED_DELTA_NET` nodes, so the + discriminated-`SSM_CONV` safety was only **incidentally** covered (GDN-op and + discriminated-conv happened to share backend coverage). It goes live the moment a + non-CUDA paged build of a gated-DeltaNet model exists. + +## The fix (emission gate, not supports_op) + +Chosen route: **gate the emission on the active compute backend type.** The +`supports_op` route would require editing every other backend's per-device +`supports_op` (Vulkan/SYCL/Metal/...) to reject the discriminated `SSM_CONV` - +invasive, fragile, and not centrally exposed by the ggml backend interface. The +emission gate is self-contained in the fork's own code. + +`src/llama-context.cpp`, in `llama_context::sched_reserve()`, immediately before +the existing `if (cparams.auto_fgdn)` resolution block: if any **non-CPU** compute +backend has a reg name other than `"CUDA"` / `"ROCm"` (HIP) / `"MUSA"` (the three +`GGML_CUDA_NAME` values - all the same hipified ggml-cuda TU that carries the +discriminated-op handling), force +`fused_gdn_ar = fused_gdn_ch = auto_fgdn = false`. + +Every emission site keys off these flags: +`conv_decode_fused = (n_seq_tokens==1) && (n_rs_seq==0) && fused_gdn_ar` +(qwen35/qwen35moe/qwen3next + `build_conv_state_fused`) and +`fused = (n_seq_tokens==1) ? fused_gdn_ar : fused_gdn_ch` (delta-net-base). With +the flags false the graph takes the upstream non-fused branch: a **plain +`ggml_ssm_conv` (no discriminator) + `ggml_silu`**, which every backend handles +correctly. + +## CUDA byte-identical invariant + +On a CUDA backend the reg name is `"CUDA"`, so `fgdn_backend_ok` stays true, the +flags are left untouched, and the emitted decode graph is unchanged. The fix only +changes behavior on non-CUDA/non-CPU backends. CUDA decode graph is byte-identical +to pre-0030 **by construction** (no flag flips on CUDA), so the existing greedy +md5 gates are unaffected on the validated GB10 target. + +## Verification + +- COMPILE (GPU-free, done on a CPU box): reconstructed the exact source state + (upstream pin `9d5d882d` + paged patches `0001-0029`, .md docs stripped) and + applied 0030. CPU-only build (`-DGGML_CUDA=OFF`) of `llama` + `test-backend-ops` + links `libllama.so` and the test binary with **0 errors**; the edited + `llama-context.cpp` compiles clean (uses only the already-included `` + and the backend-reg API already used in this TU: + `ggml_backend_dev_backend_reg` / `ggml_backend_reg_name` / + `ggml_backend_dev_type`). +- 0030 applies cleanly on a fresh pin+0001-0029 tree via both `git apply --check` + (Makefile path) and `patch -p1 -N` (prepare.sh path). +- test-backend-ops correctness is a **CUDA0-vs-CPU** comparison; a CPU-only run + skips CPU-vs-CPU by design ("Skipping CPU backend"). The relevant test cases are + registered and will be exercised by the DGX CUDA run: + `test_ssm_conv` / `test_ssm_conv_update` (SSM_CONV_UPDATE) / + `test_ssm_conv_update_ids` (SSM_CONV_UPDATE_IDS) / + `test_gated_delta_net` (+ `_hybrid`). + +## Pending on the DGX (GPU) + +The CUDA-side confirmation could not be run from the CPU box (the DGX cloudflared +tunnel `jp-6.prem.io` was returning `websocket: bad handshake` for the whole +session - origin offline). To run on the DGX `~/llama-paged-dev` (branch `paged`) +once reachable, then commit 0030 there too: + +``` +test-backend-ops test -o SSM_CONV +test-backend-ops test -o SSM_CONV_UPDATE +test-backend-ops test -o SSM_CONV_UPDATE_IDS +test-backend-ops test -o GATED_DELTA_NET # expect: 2/2 backends passed, OK +``` + +Greedy md5 (only if >40GB VRAM free; must equal the established baselines): +`q36-27b-nvfp4 == 5951a5b4d624ce891e22ab5fca9bc439`, +`q36-35b-a3b-nvfp4 == 07db32c2bcb78d17a43ed18bc22705cd`. Since 0030 does not flip +any flag on CUDA, the md5 is unchanged by code-path argument; the run is a +belt-and-suspenders confirmation, not a correctness dependency. + +Assisted-by: Claude:opus-4.8 [Claude Code]