Commit Graph

7112 Commits

Author SHA1 Message Date
Ettore Di Giacinto
cced07c7fe docs(paged): add MTP shape trace patch
Add the next incremental llama.cpp patch for default-off speculative batch-shape tracing and record the Phase 18 red/green and inference-gate results.

Assisted-by: Codex:gpt-5
2026-07-01 02:54:29 +00:00
Ettore Di Giacinto
6e35476340 docs(paged): scope MTP graph-shape follow-up
Record Phase 17 source inspection: MTP verification rows change hard graph dimensions, padding is not a safe shortcut, and any future work should start with shape instrumentation before an opt-in scheduler experiment.

Assisted-by: Codex:gpt-5
2026-07-01 02:37:21 +00:00
Ettore Di Giacinto
ae76d42a96 docs(paged): profile MTP graph reuse loss
Record Phase 16 nsys evidence that current MTP serving loses paged decode graph reuse and increases GPU work, explaining the Phase 15 serving regression.

Assisted-by: Codex:gpt-5
2026-07-01 02:32:49 +00:00
Ettore Di Giacinto
4d171e62bb docs(paged): reject MTP serving lever
Add the repeatable MTP serving A/B runner and record Phase 15 results showing current llama-server MTP regresses GB10 serving throughput despite passing inference gates.

Assisted-by: Codex:gpt-5
2026-07-01 02:29:28 +00:00
Ettore Di Giacinto
70394364a3 docs(paged): gate MTP rollback safety
Record Phase 14 MTP rollback evidence, normalized greedy-prefix checks, and canonical inference gates.

Assisted-by: Codex:gpt-5
2026-07-01 02:15:11 +00:00
Ettore Di Giacinto
e169058e73 chore(paged): add DGX inference gate runner
Add a reusable paged llama.cpp gate script for DGX work. It checks docker/local-ai-worker/GPU lock state, runs the canonical MoE and dense transcript md5 gates, and runs selected test-backend-ops filters.

Verified on dgx.casa: MoE 8cb0ce23777bf55f92f63d0292c756b0, dense 5951a5b4d624ce891e22ab5fca9bc439, MUL_MAT_ID 806/806. Artifact: /home/mudler/bench/paged_inference_gates/20260701_040048.

Assisted-by: Codex:gpt-5
2026-07-01 02:01:55 +00:00
Ettore Di Giacinto
ede23df333 docs(paged): close W4A16 pad checklist
Mark the rejected-branch disposition as not taken because Phase 4 was kept as patch 0050 with recorded md5, op, perf, and mirror gates.

Assisted-by: Codex:gpt-5
2026-07-01 01:58:22 +00:00
Ettore Di Giacinto
abc70c209e docs(paged): close ragged MoE dispatch shortcut
Record the Phase 8 safety rerun, canonical transcript md5 gates, full and ragged MUL_MAT_ID op gates, and the no-production-patch decision for metadata-only fused dispatch work.

Assisted-by: Codex:gpt-5
2026-07-01 01:57:45 +00:00
Ettore Di Giacinto
2074b4fb5b docs(paged): reject GDN global Ai32 prototype
Record the default-off Global-Ai32 implementation, exact md5 gates, GB10 A/B regression, rejected diff artifact, and the resulting stop decision for GDN kernel work on GB10.

Assisted-by: Codex:gpt-5
2026-07-01 01:51:53 +00:00
Ettore Di Giacinto
adabd11919 docs(paged): scope GDN global Ai32 prototype
Record the shared-A/Ai GB10 cost model, the GO decision for one default-off f32 Ai prototype, and the Phase 13 implementation plan.

Assisted-by: Codex:gpt-5
2026-07-01 01:38:51 +00:00
Ettore Di Giacinto
1b5ae227eb docs(paged): reject GDN M5 QS-early phase
Record the Phase 11 default-off QS-early GDN experiment, its canonical md5 gates, the same-session GB10 A/B regression, and the rejected diff artifact.

Assisted-by: Codex:gpt-5
2026-07-01 01:29:44 +00:00
Ettore Di Giacinto
24e778de47 docs(paged): scope GDN M5 state-boundary phase
Add the Phase 11 design and implementation plan for a default-off C16 M5 QS-early GDN experiment after rejecting C32 slabs.

Assisted-by: Codex:gpt-5
2026-07-01 01:21:05 +00:00
Ettore Di Giacinto
3da3b169fb docs(paged): reject GDN C32 slab phase
Record the default-off C32 slab experiment, its md5 gates, the dense tail-row fix, and the performance regression that rejects the source patch.

Assisted-by: Codex:gpt-5
2026-07-01 01:15:00 +00:00
Ettore Di Giacinto
ff3ad84191 docs(paged): record GDN C32 slab baseline
Record the Phase 10 current-M5 prefill baseline and the source inspection finding that C32 M5 needs a real U-staging implementation rather than a launcher-only shortcut.

Assisted-by: Codex:gpt-5
2026-07-01 00:58:54 +00:00
Ettore Di Giacinto
9bbe02c161 fix(paged): gate MTP backend sampling
Record the Phase 9 MTP smoke gate, mirror the fork patch that disables backend sampling for MTP drafts, and scope the follow-up C32 slab GDN prefill phase.

Assisted-by: Codex:gpt-5
2026-07-01 00:54:25 +00:00
Ettore Di Giacinto
b862e2c568 docs(paged): stop ragged dispatch source shortcut
Assisted-by: Codex:gpt-5
2026-07-01 00:42:36 +00:00
Ettore Di Giacinto
b009de0ee0 test(paged): mirror ragged MoE dispatch gate
Assisted-by: Codex:gpt-5
2026-07-01 00:41:21 +00:00
Ettore Di Giacinto
89ef3a4020 docs(paged): record ragged MoE profile gate
Assisted-by: Codex:gpt-5
2026-07-01 00:35:21 +00:00
Ettore Di Giacinto
ef14748f06 docs(paged): scope ragged MoE dispatch phase
Assisted-by: Codex:gpt-5
2026-07-01 00:26:01 +00:00
Ettore Di Giacinto
b6885aa446 docs(paged): reject weighted combine fusion candidate
Assisted-by: Codex:gpt-5
2026-07-01 00:20:53 +00:00
Ettore Di Giacinto
4b6fc0fa1c test(paged): mirror MoE weighted combine gate
Assisted-by: Codex:gpt-5
2026-06-30 23:51:52 +00:00
Ettore Di Giacinto
22a93ce1a3 docs(paged): select weighted combine candidate
Assisted-by: Codex:gpt-5
2026-06-30 23:47:34 +00:00
Ettore Di Giacinto
3cf7fa1715 docs(paged): reject swiglu down fusion candidate
Assisted-by: Codex:gpt-5
2026-06-30 23:41:38 +00:00
Ettore Di Giacinto
d0fa463eac test(paged): mirror MoE swiglu down gate
Mirror the llama.cpp Phase 7 test gate for the merged MoE gate_up/SWIGLU/down chain and record the DGX md5/op gate evidence.

Assisted-by: Codex:gpt-5
2026-06-30 23:20:52 +00:00
Ettore Di Giacinto
34c4b5ce8d docs(paged): scope phase7 serving candidates
Mark the Phase 6 serving classifier complete, preserve the old parity final as historical, and scope Phase 7 source candidates with explicit md5 and op gates.

Assisted-by: Codex:gpt-5
2026-06-30 23:12:09 +00:00
Ettore Di Giacinto
b647460dee docs(paged): record phase6 serving classifier
Record both-engine serving nsys buckets, rejected sampler short-circuit, and rejected GDN/MMQ env grids for the GB10 parity work.

Assisted-by: Codex:gpt-5
2026-06-30 23:04:15 +00:00
Ettore Di Giacinto
f9e015d8e2 docs(paged): record W4A16 Wq padding rejection
Record the Phase 5 Wq shared-memory padding experiment, its gates, sub-threshold benchmark gain, and the decision to ship no 0051 patch.

Assisted-by: Codex:gpt-5
2026-06-30 22:23:14 +00:00
Ettore Di Giacinto
85c88320ef patches(paged): pad W4A16 A shared tile stride
Mirror fork commit d9b9be0be as patch 0050 and record the Phase 4 W4A16 shared-memory padding gates, benchmarks, and mirror verification.

Assisted-by: Codex:gpt-5
2026-06-30 22:15:21 +00:00
Ettore Di Giacinto
8b413d1cbd docs(paged): record W4A16 scale broadcast rejection
Record the Phase 3 scale-broadcast experiment, its md5 and MUL_MAT_ID gates, the prefill regression, and the decision to ship no 0050 patch.

Assisted-by: Codex:gpt-5
2026-06-30 22:06:17 +00:00
Ettore Di Giacinto
c5f2545cdd patches(paged): tune W4A16 grouped tile shape
Mirror fork commit 7dfa0e175 as patch 0049 and record the Phase 2 GB10 W4A16 shape sweep, md5 gates, MUL_MAT_ID checks, and mirror verification.

Assisted-by: Codex:gpt-5
2026-06-30 21:57:42 +00:00
Ettore Di Giacinto
d8edc615e7 patches(paged): mirror W4A16 packed metadata
Mirror the fork-first W4A16 packed tile metadata commit into the LocalAI paged patch series, record the Phase 1 benchmark result, and keep the implementation plan checked off.

Assisted-by: Codex:gpt-5
2026-06-30 21:21:53 +00:00
Ettore Di Giacinto
1c0709b700 docs(paged): record W4A16 phase1 kill gate
Record the clean forced W4A16 baseline, default comparison, selected metadata target, and completed plan checkpoint for the GB10 parity reopen.

Assisted-by: Codex:gpt-5
2026-06-30 20:40:40 +00:00
Ettore Di Giacinto
337ebb8a37 docs(paged): record phase0 decode repro
Record comparable graph-node-traced paged and vLLM decode difference-method artifacts for the GB10 parity reopen.

Assisted-by: Codex:gpt-5
2026-06-30 20:35:43 +00:00
Ettore Di Giacinto
ef5d4af203 docs(paged): record phase0 prefill baseline
Record clean-source MoE and dense prefill baselines for the GB10 parity reopen and mark the plan checkpoint complete.

Assisted-by: Codex:gpt-5
2026-06-30 20:22:18 +00:00
Ettore Di Giacinto
a9a2efb296 docs(paged): record phase0 clean build gates
Record the clean DGX build retry, binary provenance, canonical greedy md5 gates, and completed plan steps for the GB10 parity reopen.

Assisted-by: Codex:gpt-5
2026-06-30 20:19:14 +00:00
Ettore Di Giacinto
b1a1b721bd docs(paged): record GB10 parity artifact gaps
Add the read-only DGX artifact review for the Phase 0 parity reopen, including supported paged measurements and missing vLLM difference-method evidence.

Assisted-by: Codex:gpt-5
2026-06-30 15:55:16 +00:00
Ettore Di Giacinto
b3cfdfac4a docs(paged): record GB10 parity source provenance
Add the clean llama.cpp fork state, base merge point, patch count, and tree-match result for the Phase 0 parity reopen workflow.

Assisted-by: Codex:gpt-5
2026-06-30 15:54:23 +00:00
Ettore Di Giacinto
6ac06734e9 docs(paged): start GB10 parity phase0 record
Create the Phase 0 results record for the parity reopen workflow, including preflight, provenance, and baseline sections.

Assisted-by: Codex:gpt-5
2026-06-30 15:51:57 +00:00
Ettore Di Giacinto
d288a0300f docs(paged): add GB10 parity implementation plan
Add the Superpowers implementation plan for the GB10 parity reopen, including Phase 0 provenance, decode repro, W4A16 kill gates, and later kernel workstream entry criteria.

Assisted-by: Codex:gpt-5
2026-06-30 15:50:01 +00:00
Ettore Di Giacinto
f8d7b026cf docs(paged): scope GB10 parity reopen plan
Add a phased follow-up spec for challenging the GB10 vLLM-parity closure, including provenance gates, W4A16/GDN/MoE workstreams, and subagent ownership boundaries.

Assisted-by: Codex:gpt-5
2026-06-30 15:44:11 +00:00
Ettore Di Giacinto
de34cd5954 docs(paged): refresh parity handoff state
Reconcile the paged backend pin prose with the current Makefile pin, mark the 0044 patch tracking note as resolved, and add DGX Docker worker idleness to the benchmark preflight.

Assisted-by: Codex:gpt-5
2026-06-30 15:27:44 +00:00
Ettore Di Giacinto
1b9176c2c8 docs(paged): codify fork-first patch workflow as mandatory policy
The fork mudler/llama.cpp branch localai-paged is the canonical source of
truth for all paged-backend kernel/patch work. Always update it FIRST: commit
the change on the fork branch and push it, then regenerate the LocalAI patch
series (backend/cpp/llama-cpp-localai-paged/patches/paged/) from the fork via
git format-patch so the series is a 1:1 drift-free mirror of the branch. Never
edit the LocalAI patch files directly, and never add a patch with no
corresponding fork-branch commit. The series is a derivative; the fork is the
source. The fork branch is also where the build and the per-path bit-exact md5
gate actually run, so it is the only place a change is truly validated.

Codified in two places:
- .agents/llama-cpp-localai-paged-backend.md: new "Fork-first workflow
  (MANDATORY)" section at the top of the patch/pin-sync material, plus the
  "Encapsulating your work" bullet now points at it.
- backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md: strengthened the
  hard-gate (section 2.5) into "Fork-first is MANDATORY", and corrected a stale
  numbering example (fork 51168c5ee "patch 0044" maps to worktree 0044, not the
  f32-only M5 which is worktree 0047).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-30 15:12:36 +00:00
Ettore Di Giacinto
2033086f60 patches(paged): track 0044 GatedRMSNorm patch, sync LocalAI series to fork 51168c5
The fork mudler/llama.cpp branch localai-paged is the canonical source of
truth for the paged-backend patch series. This file is the git format-patch
of fork commit 51168c5ee ("feat(paged): fused gated RMSNorm + SiLU gate-mul
CUDA op (patch 0044)"), verified byte-identical to that commit's format-patch
output. The full on-disk series applies clean in numeric order on the pinned
base and the resulting tree is byte-identical to the fork commit tree (tree
hash a73d759350277532a14e853e1fe78f08bbb74ce8), so the LocalAI series is a
drift-free 1:1 mirror of the fork branch.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-30 15:10:13 +00:00
Ettore Di Giacinto
8bb47e5a8a docs(paged): correct PARITY_HANDOFF ahead/behind + note dense CDEF gate md5
Ground-check follow-up to 2431090ff. Two factual corrections:

- Section 7 worktree line had the ahead/behind counts swapped ("25 ahead,
  197 behind"); the branch is actually ~199 ahead / 25 behind origin/master.
- Discrepancy item 5 flagged only the MoE CDEF PAGED_GATE_MD5 (0921716...);
  the dense run is symmetric (COMBINED_DEFINITIVE.txt records ecfe924d... for
  dense, which likewise differs from the canonical dense gate 5951a5b4). Both
  CDEF values come from combined_definitive.sh's own gate command, not the
  canonical bit-exact gate in section 3.3, so neither is sanctioned and both
  must be KL-validated.

Everything else in the handoff verified accurate: fork branch localai-paged
HEAD 51168c5ee (patch 0044) on dgx:~/llama-paged-fork, dev-tree HEAD a7d439e,
all md5/KL numbers, the 86%/1078/924 decode record, bench env, and all
referenced file/artifact paths.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-30 14:49:06 +00:00
Ettore Di Giacinto
2431090ff3 docs(paged): future-agent vLLM-parity HANDOFF guide (GB10, how-to companion to FINAL)
Adds docs/PARITY_HANDOFF.md: the operational how-to for an agent with zero
context picking up the GB10 vLLM-parity work. Complements VLLM_PARITY_FINAL.md
(the why/record) with TL;DR state, the hard gates (per-path bit-exact md5,
KL-gate, no LLAMA_MAX_BATCH_TOKENS, fork-is-canonical), a copy-pasteable
operational quickstart (ssh/lock/build/bench + the --cuda-graph-trace=node
decode-profiling rule that caused 4 wrong analyses), the complete tested-and-
rejected lever map, methodology lessons, the three forward directions, and a
key file/artifact index with the open discrepancies to reconcile.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-30 14:42:44 +00:00
Ettore Di Giacinto
baf1025245 docs(paged): correct decode-serving record to ~86% GPU-steady parity (graph-node-traced)
The decode-serving section characterized the high-N gap as "BW-floored, vLLM
pays equally / 56-68%". A clean uncontended graph-node-traced profile
(dgx ~/highN_prof2 + ~/highN_vllm, 2026-06-30) shows that was a profiling
artifact: decode runs as a replayed CUDA graph, and nsys without
--cuda-graph-trace=node collapses each replay into one opaque launch, so every
prior decode decomposition (159 us/tok, "host-bound", "5.4x more efficient")
was wrong. Corrected via --cuda-graph-trace=node + the ntg=64-minus-ntg=16
difference method.

Real picture (paged npl=256): 99% GPU-busy (idle 1.4%), NOT host-bound. GDN
recurrent scan 553 us/tok (51%, linear in batch, dominant), NVFP4 expert GEMM
254 (23%), bf16 proj 73 (7%), elementwise 57, SSM conv 31. Gap reconciled:
vLLM-server 1177 -> vLLM true GPU-steady 1078 (chunked-prefill overlap inflates
its window ~8pt) -> llama GPU-steady 924 (= 86% of 1078) -> llama-server 718
(61%, the ~17pt S3-recoverable serving graph-reuse overhead). So vs vLLM's true
GPU-steady decode we are ~86%, not 56%. GDN is a shared BW floor where paged
leads (83% vs 79% of 273 GB/s peak; both 1.17-1.18x for 2x batch).

The residual ~14pt is vLLM's mature fused kernels (Marlin MoE +11ms, Triton
elementwise +10ms); both ggml fusions rejected: act-quant-into-MMQ -79.4%
(ggml MMQ re-quantizes y per row-tile x stream-k split, no single-pass tiling),
norm+quant+silu infeasible via ggml_cuda_can_fuse. Added rejected levers:
Q8_0/FP8 projection (regime error, closes <=6%; vLLM FP8-proj confirmed from
hf_quant_config.json MIXED_PRECISION), the two decode fusions; refined BV-block
GDN occupancy to -1.04% (wave-hidden).

Revised verdict: PREFILL genuinely capped (36-43%, not graph-replayed so real);
DECODE-SERVING near-parity ~86% of vLLM true GPU-steady (headline 56% was a
measurement/operating-point artifact). GB10-vs-datacenter framing kept.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-30 14:16:06 +00:00
Ettore Di Giacinto
6edbb56b06 docs(paged): definitive vLLM-parity final-state record (GB10, CLOSED)
Add docs/VLLM_PARITY_FINAL.md: the standing, never-re-litigate record of the
exhaustive GB10 (sm_121) vLLM-parity investigation for the Qwen3.6 NVFP4 hybrid
models. Captures the definitive same-session both-engine benchmark (prefill
S_PP, decode/serving per-seq + aggregate, TTFT, PEAK_GB, paged-as-%-of-vLLM for
both the MoE 35B-A3B and dense 27B models), the complete lever map (every
prefill-GEMM, prefill-GDN, decode and serving/engine attempt with its verdict
and key number), the structural floors (LPDDR5x bandwidth, FP4-MMQ optimality,
GDN O(C^2) intra-chunk + serial recurrence, vLLM's HBM-tuned FLA/Marlin), the
shipped bit-exact wins, and the parity verdict: parity is a hardware ceiling on
GB10, not missing optimizations; the path to parity is datacenter Blackwell.

Every number cites its artifact (dgx:~/bench/COMBINED_DEFINITIVE.txt, the
marlin_gate / gdn_p1_ab A/Bs, PREFILL_GEMM_RESULTS, VLLM_PARITY_LEVER_MAP,
DECODE_SERVING_SCOPE, the patch headers); figures not pinned to an artifact are
marked estimated. Add a section-9 summary + link in the backend README.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-30 11:57:36 +00:00
Ettore Di Giacinto
bd100dd20a fix(paged): repair the patch series, sync to the fork branch (drop dev-tree 0044/0045, add f32-only M5 as 0047)
The 0044/0045 patches were exported from the old bf16/hybrid dev tree and no
longer apply on the f32-only series (0026 ssm_bf16_tau is dropped), so the
build broke at `git apply`. Re-sync the vendored series to the now
feature-complete fork branch mudler/llama.cpp:localai-paged, which is the
canonical source (pin 0ed235ea + the paged patch commits in order).

- git rm the dev-tree-based 0044 (GDN M5, bf16-machinery base) and 0045
  (Marlin W4A16 offline-repack, not part of the fork branch).
- Add the fork branch's newest commit (2c32ab8b7, "GDN M5 tensor-core
  chunked-scan prefill, f32-only re-port") as 0047, generated with a single
  git format-patch off that branch. It sequences after 0046 (its parent on
  the branch) and recovers the prefill win 0044 encoded (+3.5% S_PP @npp512,
  +17.7% @npp2048), bit-exact per-path (test-backend-ops GATED_DELTA_NET
  46/46 default and force-M5; greedy md5 default-on == M5-forced == canonical).
- Track patch 0046 (dense-prefill geometry gate), which was on disk but never
  committed, so the series is complete in git.
- README: patch-table header 0001-0046 -> 0001-0047, replace the 0044 row with
  the f32-only 0047 row, fix the dangling 0044 prose references, note the
  bf16 M6/M7/M8 variants are not part of this f32-only series, and add a
  maintenance bullet that the series is now generated from the fork branch so
  there is no more patch-export drift.

Verified: on a pristine llama.cpp at pin 0ed235ea the full series 0001-0043,
0046, 0047 applies clean in sorted order with the Makefile's exact
`git apply --verbose` method (37/37 OK), and the resulting tree is
byte-identical to the fork branch tip 2c32ab8b7.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-30 07:54:46 +00:00
Ettore Di Giacinto
be65438eac docs(paged): record MoE-prefill engine-gap decomposition + GEMM-port negatives (default-off)
nsys cross-engine decomposition: the MoE prefill 64% gap vs vLLM is engine plumbing, not the kernel (GPU 97% busy, 443 vs 197 us/tok). Three buckets: per-expert W4A4 M-fragmentation (58%), GDN scan (24%), f32<->bf16 casts (15%). Offline-repack (0045) and verbatim vLLM-marlin port both trail FP4-MMQ via wrapper overhead, kept default-off as recorded negatives.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-29 17:20:07 +00:00
Ettore Di Giacinto
7b38c6b2a3 feat(paged): GDN M5 tensor-core chunked-scan prefill, default-on under paged KV (patch 0044)
Land the tensor-core forms of the chunked gated-DeltaNet prefill scan (0031)
as a single GDN_TC-selected build and ship the M5 variant (full TC form-T
solve + state-update mma) default-ON when LLAMA_KV_PAGED is set.

The dispatch defaults GDN_TC=5 and GDN_CHUNK_MIN=64 under paged KV (both
env-overridable; OFF/INT_MAX when not paged, so stock/non-paged stays
regression-free). GDN_CHUNK_MIN is the per-call engage threshold and stays > 1
so decode (1 tok/call) keeps the sequential recurrence; 64 was tuned from a
{1,32,64,128,256} sweep (32/64/128 all win on prefill, 256 barely fires because
the MoE-prefill per-call count is < 256, 1 collapses decode S_TG ~25%).

Measured GB10, q36-35b-a3b-nvfp4, LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1,
llama-batched-bench -ngl 99 -fa on -ntg 4 -npl 32:
  -npp 512  S_PP 2208.96 -> 2286.5 t/s  (+3.5%, mean of 3 interleaved A/B)
  -npp 2048 S_PP 2021.5  -> 2379.8 t/s  (+17.7%)
Decode S_TG unchanged (~399 vs ~397 t/s, within noise).

Bit-exactness (per-path greedy md5, n=48 --temp 0 --seed 1, paged): default-on
== M5-forced == canonical on the gate prompt - MoE 8cb0ce23, dense 5951a5b4.
test-backend-ops GATED_DELTA_NET 94/94 vs CPU with M5 forced (incl. multi-chunk
up to n_tokens=256). On a long MoE prompt the default (M5 fires at >=64 tokens)
and the sequential path agree word-for-word until one benign greedy token-flip;
dense is byte-identical. The chunked scan is a NEW per-path result (different FP
reduction order), NMSE-validated benign.

CUDA-only, gencode arch=compute_121a,code=sm_121a (GB10 / sm_121a). README
sections 3 (0044 row, 0031 superseded note) and 5 (dev-notes verdict) updated.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-29 06:42:11 +00:00