mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-27 09:57:14 -04:00
feat(paged): backend-gate fused GDN/discriminated SSM_CONV emission (patch 0030)
Closes audit RISKY-1 (the one latent silent-miscompute hazard). The fused/in-place Gated Delta Net op (0018/0019/0026) and the discriminated SSM_CONV decode op (0021/0028, which REUSE GGML_OP_SSM_CONV / GGML_OP_GATED_DELTA_NET via a non-null src[3]/src[4] discriminator) are CUDA+CPU-only but were emitted DEFAULT-ON (cparams.fused_gdn_ar/ch=true, auto_fgdn=true) with no backend guard. A backend that supports plain SSM_CONV but ignores the discriminator (Vulkan/SYCL/Metal) would run the wrong plain conv => silent corruption. Fix: in llama_context::sched_reserve(), before the auto_fgdn resolution, force fused_gdn_ar = fused_gdn_ch = auto_fgdn = false when any non-CPU compute backend is not CUDA-family (reg name not "CUDA"/"ROCm"/"MUSA"). Every emission site keys off these flags, so the graph falls back to the upstream non-fused plain ggml_ssm_conv + ggml_silu path that every backend handles. On CUDA the reg name is "CUDA", the flags are left untouched, and the decode graph is byte-identical. Mirror of DGX paged patch 0030; adds FUSED_OP_BACKEND_GATE_RESULTS.md. Verified GPU-free: reconstructed pin 9d5d882d + paged 0001-0029 + 0030, CPU-only build (GGML_CUDA=OFF) of libllama + test-backend-ops links with 0 errors; 0030 applies cleanly via git apply and patch -p1. test-backend-ops correctness for SSM_CONV/SSM_CONV_UPDATE(_IDS)/GATED_DELTA_NET is CUDA0-vs-CPU (pending DGX, tunnel offline this session); registered test cases will exercise it. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -0,0 +1,106 @@
|
||||
From a095f4ebeefafd16dd54c514eb86148fa46daef3 Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Sat, 27 Jun 2026 07:30:43 +0000
|
||||
Subject: [PATCH] feat(paged): backend-gate fused GDN/discriminated SSM_CONV
|
||||
emission (patch 0030)
|
||||
|
||||
Closes the latent silent-miscompute hazard (audit RISKY-1). The fused/in-place
|
||||
Gated Delta Net op (0018/0019/0026: ggml_gated_delta_net_inplace[_ids][_hybrid])
|
||||
and the discriminated SSM_CONV decode op (0021/0028: ggml_ssm_conv_update_inplace
|
||||
[_ids], which REUSE GGML_OP_SSM_CONV / GGML_OP_GATED_DELTA_NET with extra src
|
||||
slots - a non-null src[3]/src[4] ring/ids discriminator) are emitted DEFAULT-ON
|
||||
(cparams.fused_gdn_ar/ch=true, auto_fgdn=true) but are implemented for the
|
||||
CUDA-family TU (CUDA / HIP "ROCm" / "MUSA", hipified ggml-cuda) and the CPU
|
||||
reference ONLY.
|
||||
|
||||
The hazard: a compute backend that supports PLAIN GGML_OP_SSM_CONV but ignores
|
||||
the src[3]/src[4] discriminator (Vulkan/SYCL/Metal) reports supports_op==true for
|
||||
the node and the scheduler assigns the discriminated conv to it; it then runs the
|
||||
wrong plain conv => SILENT corruption (not a crash). The upstream auto_fgdn
|
||||
device-mismatch resolution only inspects GATED_DELTA_NET nodes, so the
|
||||
discriminated-SSM_CONV safety was only incidentally covered (it happened to share
|
||||
backend coverage with the GDN op); it becomes live the moment a non-CUDA paged
|
||||
build of a gated-DeltaNet model exists.
|
||||
|
||||
FIX: gate the fused-op emission on the active compute backend type. Before the
|
||||
auto_fgdn resolution in llama_context::sched_reserve(), if any non-CPU compute
|
||||
backend is not CUDA-family (reg name != "CUDA"/"ROCm"/"MUSA"), force
|
||||
fused_gdn_ar = fused_gdn_ch = auto_fgdn = false. Every emission site keys off
|
||||
these flags (conv_decode_fused = ... && fused_gdn_ar; fused = ... fused_gdn_ar/ch),
|
||||
so disabling them routes the graph to the upstream non-fused path: a PLAIN
|
||||
ggml_ssm_conv (no discriminator) + ggml_silu, which every backend handles
|
||||
correctly. This makes the discriminated-op safety explicit and decoupled from the
|
||||
GDN-op device-mismatch heuristic.
|
||||
|
||||
INVARIANT (CUDA byte-identical): on a CUDA backend the reg name is "CUDA", so
|
||||
fgdn_backend_ok stays true, the flags are left untouched, and the emitted decode
|
||||
graph is unchanged - byte-identical to pre-0030. The fix only changes behavior on
|
||||
non-CUDA/non-CPU backends.
|
||||
|
||||
GATE compile: CPU-only build (GGML_CUDA=OFF) of the full series (pin 9d5d882d +
|
||||
0001-0029 + this) links libllama.so and test-backend-ops with 0 errors; the
|
||||
edited llama-context.cpp compiles clean (uses only already-included <cstring> +
|
||||
backend-reg API already used in this TU). test-backend-ops correctness for
|
||||
SSM_CONV / SSM_CONV_UPDATE / SSM_CONV_UPDATE_IDS / GATED_DELTA_NET is a
|
||||
CUDA0-vs-CPU comparison (CPU-only run skips CPU-vs-CPU); the test cases are
|
||||
registered and exercised on the CUDA DGX run.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||||
---
|
||||
src/llama-context.cpp | 39 +++++++++++++++++++++++++++++++++++++++
|
||||
1 file changed, 39 insertions(+)
|
||||
|
||||
diff --git a/src/llama-context.cpp b/src/llama-context.cpp
|
||||
index ad7939e..c408eef 100644
|
||||
--- a/src/llama-context.cpp
|
||||
+++ b/src/llama-context.cpp
|
||||
@@ -521,6 +521,45 @@ void llama_context::sched_reserve() {
|
||||
cparams.auto_fa = false;
|
||||
}
|
||||
|
||||
+ // RISKY-1 guard: the fused/in-place Gated Delta Net op and the discriminated
|
||||
+ // SSM_CONV (which reuse GGML_OP_GATED_DELTA_NET / GGML_OP_SSM_CONV with extra
|
||||
+ // src slots - a non-null src[3]/src[4] ring/ids discriminator) are only
|
||||
+ // implemented for the CUDA-family backends (CUDA / HIP "ROCm" / "MUSA" - all
|
||||
+ // built from the hipified ggml-cuda TU) and the CPU reference. Any other
|
||||
+ // compute backend (Vulkan/SYCL/Metal/...) that supports *plain* SSM_CONV but
|
||||
+ // ignores the discriminator src would silently run the WRONG conv. The
|
||||
+ // upstream auto_fgdn device-mismatch check below only inspects
|
||||
+ // GATED_DELTA_NET nodes, so couple the discriminated-SSM_CONV safety
|
||||
+ // explicitly to the backend type here: keep the fused path enabled only when
|
||||
+ // every non-CPU compute backend is CUDA-family. On CUDA this leaves the flags
|
||||
+ // untouched, so the emitted decode graph is byte-identical.
|
||||
+ if (cparams.fused_gdn_ar || cparams.fused_gdn_ch) {
|
||||
+ bool fgdn_backend_ok = true;
|
||||
+ for (auto & backend : backends) {
|
||||
+ ggml_backend_dev_t dev = ggml_backend_get_device(backend.get());
|
||||
+ if (!dev || ggml_backend_dev_type(dev) == GGML_BACKEND_DEVICE_TYPE_CPU) {
|
||||
+ // CPU reference handles the fused/discriminated ops
|
||||
+ continue;
|
||||
+ }
|
||||
+ ggml_backend_reg_t reg = ggml_backend_dev_backend_reg(dev);
|
||||
+ const char * name = reg ? ggml_backend_reg_name(reg) : "";
|
||||
+ // GGML_CUDA_NAME is "CUDA" / "ROCm" (HIP) / "MUSA"; all three are the
|
||||
+ // same ggml-cuda TU that carries the discriminated-op handling.
|
||||
+ if (strcmp(name, "CUDA") != 0 && strcmp(name, "ROCm") != 0 && strcmp(name, "MUSA") != 0) {
|
||||
+ fgdn_backend_ok = false;
|
||||
+ break;
|
||||
+ }
|
||||
+ }
|
||||
+
|
||||
+ if (!fgdn_backend_ok) {
|
||||
+ cparams.fused_gdn_ar = false;
|
||||
+ cparams.fused_gdn_ch = false;
|
||||
+ cparams.auto_fgdn = false;
|
||||
+ LLAMA_LOG_INFO("%s: fused Gated Delta Net / discriminated SSM_CONV disabled "
|
||||
+ "(compute backend is not CUDA/HIP/CPU)\n", __func__);
|
||||
+ }
|
||||
+ }
|
||||
+
|
||||
if (cparams.auto_fgdn) {
|
||||
LLAMA_LOG_INFO("%s: resolving fused Gated Delta Net support:\n", __func__);
|
||||
|
||||
--
|
||||
2.43.0
|
||||
|
||||
@@ -0,0 +1,96 @@
|
||||
# Patch 0030 - fused-op backend gate (audit RISKY-1 fix) - RESULTS
|
||||
|
||||
Closes the single latent silent-miscompute hazard from `ARCH_GENERALITY_AUDIT.md`
|
||||
(RISKY-1): the fused GDN / discriminated-SSM_CONV decode ops are CUDA+CPU-only but
|
||||
were emitted DEFAULT-ON with no backend guard.
|
||||
|
||||
## The hazard
|
||||
|
||||
- `cparams.fused_gdn_ar = fused_gdn_ch = auto_fgdn = true` are set unconditionally
|
||||
in the `llama_context` constructor (`src/llama-context.cpp`).
|
||||
- Patches 0018/0019/0026 add `ggml_gated_delta_net_inplace[_ids][_hybrid]`
|
||||
(reuse `GGML_OP_GATED_DELTA_NET` with extra src slots).
|
||||
- Patches 0021/0028 add `ggml_ssm_conv_update_inplace[_ids]` which **reuse
|
||||
`GGML_OP_SSM_CONV`, discriminated by a non-null `src[3]`/`src[4]`** (ring/ids).
|
||||
- Both families have CUDA + CPU kernels only. No `supports_op` change was made for
|
||||
the discriminated variants.
|
||||
- A backend that supports **plain** `SSM_CONV` but ignores the discriminator
|
||||
(Vulkan/SYCL/Metal) returns `supports_op==true` for the node; the scheduler
|
||||
assigns the discriminated conv to it; it runs the **wrong plain conv** =>
|
||||
SILENT corruption (not a crash).
|
||||
- The upstream `auto_fgdn` resolution only inspects `GATED_DELTA_NET` nodes, so the
|
||||
discriminated-`SSM_CONV` safety was only **incidentally** covered (GDN-op and
|
||||
discriminated-conv happened to share backend coverage). It goes live the moment a
|
||||
non-CUDA paged build of a gated-DeltaNet model exists.
|
||||
|
||||
## The fix (emission gate, not supports_op)
|
||||
|
||||
Chosen route: **gate the emission on the active compute backend type.** The
|
||||
`supports_op` route would require editing every other backend's per-device
|
||||
`supports_op` (Vulkan/SYCL/Metal/...) to reject the discriminated `SSM_CONV` -
|
||||
invasive, fragile, and not centrally exposed by the ggml backend interface. The
|
||||
emission gate is self-contained in the fork's own code.
|
||||
|
||||
`src/llama-context.cpp`, in `llama_context::sched_reserve()`, immediately before
|
||||
the existing `if (cparams.auto_fgdn)` resolution block: if any **non-CPU** compute
|
||||
backend has a reg name other than `"CUDA"` / `"ROCm"` (HIP) / `"MUSA"` (the three
|
||||
`GGML_CUDA_NAME` values - all the same hipified ggml-cuda TU that carries the
|
||||
discriminated-op handling), force
|
||||
`fused_gdn_ar = fused_gdn_ch = auto_fgdn = false`.
|
||||
|
||||
Every emission site keys off these flags:
|
||||
`conv_decode_fused = (n_seq_tokens==1) && (n_rs_seq==0) && fused_gdn_ar`
|
||||
(qwen35/qwen35moe/qwen3next + `build_conv_state_fused`) and
|
||||
`fused = (n_seq_tokens==1) ? fused_gdn_ar : fused_gdn_ch` (delta-net-base). With
|
||||
the flags false the graph takes the upstream non-fused branch: a **plain
|
||||
`ggml_ssm_conv` (no discriminator) + `ggml_silu`**, which every backend handles
|
||||
correctly.
|
||||
|
||||
## CUDA byte-identical invariant
|
||||
|
||||
On a CUDA backend the reg name is `"CUDA"`, so `fgdn_backend_ok` stays true, the
|
||||
flags are left untouched, and the emitted decode graph is unchanged. The fix only
|
||||
changes behavior on non-CUDA/non-CPU backends. CUDA decode graph is byte-identical
|
||||
to pre-0030 **by construction** (no flag flips on CUDA), so the existing greedy
|
||||
md5 gates are unaffected on the validated GB10 target.
|
||||
|
||||
## Verification
|
||||
|
||||
- COMPILE (GPU-free, done on a CPU box): reconstructed the exact source state
|
||||
(upstream pin `9d5d882d` + paged patches `0001-0029`, .md docs stripped) and
|
||||
applied 0030. CPU-only build (`-DGGML_CUDA=OFF`) of `llama` + `test-backend-ops`
|
||||
links `libllama.so` and the test binary with **0 errors**; the edited
|
||||
`llama-context.cpp` compiles clean (uses only the already-included `<cstring>`
|
||||
and the backend-reg API already used in this TU:
|
||||
`ggml_backend_dev_backend_reg` / `ggml_backend_reg_name` /
|
||||
`ggml_backend_dev_type`).
|
||||
- 0030 applies cleanly on a fresh pin+0001-0029 tree via both `git apply --check`
|
||||
(Makefile path) and `patch -p1 -N` (prepare.sh path).
|
||||
- test-backend-ops correctness is a **CUDA0-vs-CPU** comparison; a CPU-only run
|
||||
skips CPU-vs-CPU by design ("Skipping CPU backend"). The relevant test cases are
|
||||
registered and will be exercised by the DGX CUDA run:
|
||||
`test_ssm_conv` / `test_ssm_conv_update` (SSM_CONV_UPDATE) /
|
||||
`test_ssm_conv_update_ids` (SSM_CONV_UPDATE_IDS) /
|
||||
`test_gated_delta_net` (+ `_hybrid`).
|
||||
|
||||
## Pending on the DGX (GPU)
|
||||
|
||||
The CUDA-side confirmation could not be run from the CPU box (the DGX cloudflared
|
||||
tunnel `jp-6.prem.io` was returning `websocket: bad handshake` for the whole
|
||||
session - origin offline). To run on the DGX `~/llama-paged-dev` (branch `paged`)
|
||||
once reachable, then commit 0030 there too:
|
||||
|
||||
```
|
||||
test-backend-ops test -o SSM_CONV
|
||||
test-backend-ops test -o SSM_CONV_UPDATE
|
||||
test-backend-ops test -o SSM_CONV_UPDATE_IDS
|
||||
test-backend-ops test -o GATED_DELTA_NET # expect: 2/2 backends passed, OK
|
||||
```
|
||||
|
||||
Greedy md5 (only if >40GB VRAM free; must equal the established baselines):
|
||||
`q36-27b-nvfp4 == 5951a5b4d624ce891e22ab5fca9bc439`,
|
||||
`q36-35b-a3b-nvfp4 == 07db32c2bcb78d17a43ed18bc22705cd`. Since 0030 does not flip
|
||||
any flag on CUDA, the md5 is unchanged by code-path argument; the run is a
|
||||
belt-and-suspenders confirmation, not a correctness dependency.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
Reference in New Issue
Block a user