LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-07-03 04:46:54 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	cced07c7fe	docs(paged): add MTP shape trace patch Add the next incremental llama.cpp patch for default-off speculative batch-shape tracing and record the Phase 18 red/green and inference-gate results. Assisted-by: Codex:gpt-5	2026-07-01 02:54:29 +00:00
Ettore Di Giacinto	6e35476340	docs(paged): scope MTP graph-shape follow-up Record Phase 17 source inspection: MTP verification rows change hard graph dimensions, padding is not a safe shortcut, and any future work should start with shape instrumentation before an opt-in scheduler experiment. Assisted-by: Codex:gpt-5	2026-07-01 02:37:21 +00:00
Ettore Di Giacinto	ae76d42a96	docs(paged): profile MTP graph reuse loss Record Phase 16 nsys evidence that current MTP serving loses paged decode graph reuse and increases GPU work, explaining the Phase 15 serving regression. Assisted-by: Codex:gpt-5	2026-07-01 02:32:49 +00:00
Ettore Di Giacinto	4d171e62bb	docs(paged): reject MTP serving lever Add the repeatable MTP serving A/B runner and record Phase 15 results showing current llama-server MTP regresses GB10 serving throughput despite passing inference gates. Assisted-by: Codex:gpt-5	2026-07-01 02:29:28 +00:00
Ettore Di Giacinto	70394364a3	docs(paged): gate MTP rollback safety Record Phase 14 MTP rollback evidence, normalized greedy-prefix checks, and canonical inference gates. Assisted-by: Codex:gpt-5	2026-07-01 02:15:11 +00:00
Ettore Di Giacinto	e169058e73	chore(paged): add DGX inference gate runner Add a reusable paged llama.cpp gate script for DGX work. It checks docker/local-ai-worker/GPU lock state, runs the canonical MoE and dense transcript md5 gates, and runs selected test-backend-ops filters. Verified on dgx.casa: MoE 8cb0ce23777bf55f92f63d0292c756b0, dense 5951a5b4d624ce891e22ab5fca9bc439, MUL_MAT_ID 806/806. Artifact: /home/mudler/bench/paged_inference_gates/20260701_040048. Assisted-by: Codex:gpt-5	2026-07-01 02:01:55 +00:00
Ettore Di Giacinto	ede23df333	docs(paged): close W4A16 pad checklist Mark the rejected-branch disposition as not taken because Phase 4 was kept as patch 0050 with recorded md5, op, perf, and mirror gates. Assisted-by: Codex:gpt-5	2026-07-01 01:58:22 +00:00
Ettore Di Giacinto	abc70c209e	docs(paged): close ragged MoE dispatch shortcut Record the Phase 8 safety rerun, canonical transcript md5 gates, full and ragged MUL_MAT_ID op gates, and the no-production-patch decision for metadata-only fused dispatch work. Assisted-by: Codex:gpt-5	2026-07-01 01:57:45 +00:00
Ettore Di Giacinto	2074b4fb5b	docs(paged): reject GDN global Ai32 prototype Record the default-off Global-Ai32 implementation, exact md5 gates, GB10 A/B regression, rejected diff artifact, and the resulting stop decision for GDN kernel work on GB10. Assisted-by: Codex:gpt-5	2026-07-01 01:51:53 +00:00
Ettore Di Giacinto	adabd11919	docs(paged): scope GDN global Ai32 prototype Record the shared-A/Ai GB10 cost model, the GO decision for one default-off f32 Ai prototype, and the Phase 13 implementation plan. Assisted-by: Codex:gpt-5	2026-07-01 01:38:51 +00:00
Ettore Di Giacinto	1b5ae227eb	docs(paged): reject GDN M5 QS-early phase Record the Phase 11 default-off QS-early GDN experiment, its canonical md5 gates, the same-session GB10 A/B regression, and the rejected diff artifact. Assisted-by: Codex:gpt-5	2026-07-01 01:29:44 +00:00
Ettore Di Giacinto	24e778de47	docs(paged): scope GDN M5 state-boundary phase Add the Phase 11 design and implementation plan for a default-off C16 M5 QS-early GDN experiment after rejecting C32 slabs. Assisted-by: Codex:gpt-5	2026-07-01 01:21:05 +00:00
Ettore Di Giacinto	3da3b169fb	docs(paged): reject GDN C32 slab phase Record the default-off C32 slab experiment, its md5 gates, the dense tail-row fix, and the performance regression that rejects the source patch. Assisted-by: Codex:gpt-5	2026-07-01 01:15:00 +00:00
Ettore Di Giacinto	ff3ad84191	docs(paged): record GDN C32 slab baseline Record the Phase 10 current-M5 prefill baseline and the source inspection finding that C32 M5 needs a real U-staging implementation rather than a launcher-only shortcut. Assisted-by: Codex:gpt-5	2026-07-01 00:58:54 +00:00
Ettore Di Giacinto	9bbe02c161	fix(paged): gate MTP backend sampling Record the Phase 9 MTP smoke gate, mirror the fork patch that disables backend sampling for MTP drafts, and scope the follow-up C32 slab GDN prefill phase. Assisted-by: Codex:gpt-5	2026-07-01 00:54:25 +00:00
Ettore Di Giacinto	b862e2c568	docs(paged): stop ragged dispatch source shortcut Assisted-by: Codex:gpt-5	2026-07-01 00:42:36 +00:00
Ettore Di Giacinto	b009de0ee0	test(paged): mirror ragged MoE dispatch gate Assisted-by: Codex:gpt-5	2026-07-01 00:41:21 +00:00
Ettore Di Giacinto	89ef3a4020	docs(paged): record ragged MoE profile gate Assisted-by: Codex:gpt-5	2026-07-01 00:35:21 +00:00
Ettore Di Giacinto	ef14748f06	docs(paged): scope ragged MoE dispatch phase Assisted-by: Codex:gpt-5	2026-07-01 00:26:01 +00:00
Ettore Di Giacinto	b6885aa446	docs(paged): reject weighted combine fusion candidate Assisted-by: Codex:gpt-5	2026-07-01 00:20:53 +00:00
Ettore Di Giacinto	4b6fc0fa1c	test(paged): mirror MoE weighted combine gate Assisted-by: Codex:gpt-5	2026-06-30 23:51:52 +00:00
Ettore Di Giacinto	22a93ce1a3	docs(paged): select weighted combine candidate Assisted-by: Codex:gpt-5	2026-06-30 23:47:34 +00:00
Ettore Di Giacinto	3cf7fa1715	docs(paged): reject swiglu down fusion candidate Assisted-by: Codex:gpt-5	2026-06-30 23:41:38 +00:00
Ettore Di Giacinto	d0fa463eac	test(paged): mirror MoE swiglu down gate Mirror the llama.cpp Phase 7 test gate for the merged MoE gate_up/SWIGLU/down chain and record the DGX md5/op gate evidence. Assisted-by: Codex:gpt-5	2026-06-30 23:20:52 +00:00
Ettore Di Giacinto	34c4b5ce8d	docs(paged): scope phase7 serving candidates Mark the Phase 6 serving classifier complete, preserve the old parity final as historical, and scope Phase 7 source candidates with explicit md5 and op gates. Assisted-by: Codex:gpt-5	2026-06-30 23:12:09 +00:00
Ettore Di Giacinto	b647460dee	docs(paged): record phase6 serving classifier Record both-engine serving nsys buckets, rejected sampler short-circuit, and rejected GDN/MMQ env grids for the GB10 parity work. Assisted-by: Codex:gpt-5	2026-06-30 23:04:15 +00:00
Ettore Di Giacinto	f9e015d8e2	docs(paged): record W4A16 Wq padding rejection Record the Phase 5 Wq shared-memory padding experiment, its gates, sub-threshold benchmark gain, and the decision to ship no 0051 patch. Assisted-by: Codex:gpt-5	2026-06-30 22:23:14 +00:00
Ettore Di Giacinto	85c88320ef	patches(paged): pad W4A16 A shared tile stride Mirror fork commit d9b9be0be as patch 0050 and record the Phase 4 W4A16 shared-memory padding gates, benchmarks, and mirror verification. Assisted-by: Codex:gpt-5	2026-06-30 22:15:21 +00:00
Ettore Di Giacinto	8b413d1cbd	docs(paged): record W4A16 scale broadcast rejection Record the Phase 3 scale-broadcast experiment, its md5 and MUL_MAT_ID gates, the prefill regression, and the decision to ship no 0050 patch. Assisted-by: Codex:gpt-5	2026-06-30 22:06:17 +00:00
Ettore Di Giacinto	c5f2545cdd	patches(paged): tune W4A16 grouped tile shape Mirror fork commit 7dfa0e175 as patch 0049 and record the Phase 2 GB10 W4A16 shape sweep, md5 gates, MUL_MAT_ID checks, and mirror verification. Assisted-by: Codex:gpt-5	2026-06-30 21:57:42 +00:00
Ettore Di Giacinto	d8edc615e7	patches(paged): mirror W4A16 packed metadata Mirror the fork-first W4A16 packed tile metadata commit into the LocalAI paged patch series, record the Phase 1 benchmark result, and keep the implementation plan checked off. Assisted-by: Codex:gpt-5	2026-06-30 21:21:53 +00:00
Ettore Di Giacinto	1c0709b700	docs(paged): record W4A16 phase1 kill gate Record the clean forced W4A16 baseline, default comparison, selected metadata target, and completed plan checkpoint for the GB10 parity reopen. Assisted-by: Codex:gpt-5	2026-06-30 20:40:40 +00:00
Ettore Di Giacinto	337ebb8a37	docs(paged): record phase0 decode repro Record comparable graph-node-traced paged and vLLM decode difference-method artifacts for the GB10 parity reopen. Assisted-by: Codex:gpt-5	2026-06-30 20:35:43 +00:00
Ettore Di Giacinto	ef5d4af203	docs(paged): record phase0 prefill baseline Record clean-source MoE and dense prefill baselines for the GB10 parity reopen and mark the plan checkpoint complete. Assisted-by: Codex:gpt-5	2026-06-30 20:22:18 +00:00
Ettore Di Giacinto	a9a2efb296	docs(paged): record phase0 clean build gates Record the clean DGX build retry, binary provenance, canonical greedy md5 gates, and completed plan steps for the GB10 parity reopen. Assisted-by: Codex:gpt-5	2026-06-30 20:19:14 +00:00
Ettore Di Giacinto	b1a1b721bd	docs(paged): record GB10 parity artifact gaps Add the read-only DGX artifact review for the Phase 0 parity reopen, including supported paged measurements and missing vLLM difference-method evidence. Assisted-by: Codex:gpt-5	2026-06-30 15:55:16 +00:00
Ettore Di Giacinto	b3cfdfac4a	docs(paged): record GB10 parity source provenance Add the clean llama.cpp fork state, base merge point, patch count, and tree-match result for the Phase 0 parity reopen workflow. Assisted-by: Codex:gpt-5	2026-06-30 15:54:23 +00:00
Ettore Di Giacinto	6ac06734e9	docs(paged): start GB10 parity phase0 record Create the Phase 0 results record for the parity reopen workflow, including preflight, provenance, and baseline sections. Assisted-by: Codex:gpt-5	2026-06-30 15:51:57 +00:00
Ettore Di Giacinto	d288a0300f	docs(paged): add GB10 parity implementation plan Add the Superpowers implementation plan for the GB10 parity reopen, including Phase 0 provenance, decode repro, W4A16 kill gates, and later kernel workstream entry criteria. Assisted-by: Codex:gpt-5	2026-06-30 15:50:01 +00:00
Ettore Di Giacinto	f8d7b026cf	docs(paged): scope GB10 parity reopen plan Add a phased follow-up spec for challenging the GB10 vLLM-parity closure, including provenance gates, W4A16/GDN/MoE workstreams, and subagent ownership boundaries. Assisted-by: Codex:gpt-5	2026-06-30 15:44:11 +00:00
Ettore Di Giacinto	de34cd5954	docs(paged): refresh parity handoff state Reconcile the paged backend pin prose with the current Makefile pin, mark the 0044 patch tracking note as resolved, and add DGX Docker worker idleness to the benchmark preflight. Assisted-by: Codex:gpt-5	2026-06-30 15:27:44 +00:00
Ettore Di Giacinto	1b9176c2c8	docs(paged): codify fork-first patch workflow as mandatory policy The fork mudler/llama.cpp branch localai-paged is the canonical source of truth for all paged-backend kernel/patch work. Always update it FIRST: commit the change on the fork branch and push it, then regenerate the LocalAI patch series (backend/cpp/llama-cpp-localai-paged/patches/paged/) from the fork via git format-patch so the series is a 1:1 drift-free mirror of the branch. Never edit the LocalAI patch files directly, and never add a patch with no corresponding fork-branch commit. The series is a derivative; the fork is the source. The fork branch is also where the build and the per-path bit-exact md5 gate actually run, so it is the only place a change is truly validated. Codified in two places: - .agents/llama-cpp-localai-paged-backend.md: new "Fork-first workflow (MANDATORY)" section at the top of the patch/pin-sync material, plus the "Encapsulating your work" bullet now points at it. - backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md: strengthened the hard-gate (section 2.5) into "Fork-first is MANDATORY", and corrected a stale numbering example (fork 51168c5ee "patch 0044" maps to worktree 0044, not the f32-only M5 which is worktree 0047). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 15:12:36 +00:00
Ettore Di Giacinto	2033086f60	patches(paged): track 0044 GatedRMSNorm patch, sync LocalAI series to fork 51168c5 The fork mudler/llama.cpp branch localai-paged is the canonical source of truth for the paged-backend patch series. This file is the git format-patch of fork commit 51168c5ee ("feat(paged): fused gated RMSNorm + SiLU gate-mul CUDA op (patch 0044)"), verified byte-identical to that commit's format-patch output. The full on-disk series applies clean in numeric order on the pinned base and the resulting tree is byte-identical to the fork commit tree (tree hash a73d759350277532a14e853e1fe78f08bbb74ce8), so the LocalAI series is a drift-free 1:1 mirror of the fork branch. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 15:10:13 +00:00
Ettore Di Giacinto	8bb47e5a8a	docs(paged): correct PARITY_HANDOFF ahead/behind + note dense CDEF gate md5 Ground-check follow-up to `2431090ff`. Two factual corrections: - Section 7 worktree line had the ahead/behind counts swapped ("25 ahead, 197 behind"); the branch is actually ~199 ahead / 25 behind origin/master. - Discrepancy item 5 flagged only the MoE CDEF PAGED_GATE_MD5 (0921716...); the dense run is symmetric (COMBINED_DEFINITIVE.txt records ecfe924d... for dense, which likewise differs from the canonical dense gate 5951a5b4). Both CDEF values come from combined_definitive.sh's own gate command, not the canonical bit-exact gate in section 3.3, so neither is sanctioned and both must be KL-validated. Everything else in the handoff verified accurate: fork branch localai-paged HEAD 51168c5ee (patch 0044) on dgx:~/llama-paged-fork, dev-tree HEAD a7d439e, all md5/KL numbers, the 86%/1078/924 decode record, bench env, and all referenced file/artifact paths. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 14:49:06 +00:00
Ettore Di Giacinto	2431090ff3	docs(paged): future-agent vLLM-parity HANDOFF guide (GB10, how-to companion to FINAL) Adds docs/PARITY_HANDOFF.md: the operational how-to for an agent with zero context picking up the GB10 vLLM-parity work. Complements VLLM_PARITY_FINAL.md (the why/record) with TL;DR state, the hard gates (per-path bit-exact md5, KL-gate, no LLAMA_MAX_BATCH_TOKENS, fork-is-canonical), a copy-pasteable operational quickstart (ssh/lock/build/bench + the --cuda-graph-trace=node decode-profiling rule that caused 4 wrong analyses), the complete tested-and- rejected lever map, methodology lessons, the three forward directions, and a key file/artifact index with the open discrepancies to reconcile. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 14:42:44 +00:00
Ettore Di Giacinto	baf1025245	docs(paged): correct decode-serving record to ~86% GPU-steady parity (graph-node-traced) The decode-serving section characterized the high-N gap as "BW-floored, vLLM pays equally / 56-68%". A clean uncontended graph-node-traced profile (dgx ~/highN_prof2 + ~/highN_vllm, 2026-06-30) shows that was a profiling artifact: decode runs as a replayed CUDA graph, and nsys without --cuda-graph-trace=node collapses each replay into one opaque launch, so every prior decode decomposition (159 us/tok, "host-bound", "5.4x more efficient") was wrong. Corrected via --cuda-graph-trace=node + the ntg=64-minus-ntg=16 difference method. Real picture (paged npl=256): 99% GPU-busy (idle 1.4%), NOT host-bound. GDN recurrent scan 553 us/tok (51%, linear in batch, dominant), NVFP4 expert GEMM 254 (23%), bf16 proj 73 (7%), elementwise 57, SSM conv 31. Gap reconciled: vLLM-server 1177 -> vLLM true GPU-steady 1078 (chunked-prefill overlap inflates its window ~8pt) -> llama GPU-steady 924 (= 86% of 1078) -> llama-server 718 (61%, the ~17pt S3-recoverable serving graph-reuse overhead). So vs vLLM's true GPU-steady decode we are ~86%, not 56%. GDN is a shared BW floor where paged leads (83% vs 79% of 273 GB/s peak; both 1.17-1.18x for 2x batch). The residual ~14pt is vLLM's mature fused kernels (Marlin MoE +11ms, Triton elementwise +10ms); both ggml fusions rejected: act-quant-into-MMQ -79.4% (ggml MMQ re-quantizes y per row-tile x stream-k split, no single-pass tiling), norm+quant+silu infeasible via ggml_cuda_can_fuse. Added rejected levers: Q8_0/FP8 projection (regime error, closes <=6%; vLLM FP8-proj confirmed from hf_quant_config.json MIXED_PRECISION), the two decode fusions; refined BV-block GDN occupancy to -1.04% (wave-hidden). Revised verdict: PREFILL genuinely capped (36-43%, not graph-replayed so real); DECODE-SERVING near-parity ~86% of vLLM true GPU-steady (headline 56% was a measurement/operating-point artifact). GB10-vs-datacenter framing kept. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 14:16:06 +00:00
Ettore Di Giacinto	6edbb56b06	docs(paged): definitive vLLM-parity final-state record (GB10, CLOSED) Add docs/VLLM_PARITY_FINAL.md: the standing, never-re-litigate record of the exhaustive GB10 (sm_121) vLLM-parity investigation for the Qwen3.6 NVFP4 hybrid models. Captures the definitive same-session both-engine benchmark (prefill S_PP, decode/serving per-seq + aggregate, TTFT, PEAK_GB, paged-as-%-of-vLLM for both the MoE 35B-A3B and dense 27B models), the complete lever map (every prefill-GEMM, prefill-GDN, decode and serving/engine attempt with its verdict and key number), the structural floors (LPDDR5x bandwidth, FP4-MMQ optimality, GDN O(C^2) intra-chunk + serial recurrence, vLLM's HBM-tuned FLA/Marlin), the shipped bit-exact wins, and the parity verdict: parity is a hardware ceiling on GB10, not missing optimizations; the path to parity is datacenter Blackwell. Every number cites its artifact (dgx:~/bench/COMBINED_DEFINITIVE.txt, the marlin_gate / gdn_p1_ab A/Bs, PREFILL_GEMM_RESULTS, VLLM_PARITY_LEVER_MAP, DECODE_SERVING_SCOPE, the patch headers); figures not pinned to an artifact are marked estimated. Add a section-9 summary + link in the backend README. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 11:57:36 +00:00
Ettore Di Giacinto	bd100dd20a	fix(paged): repair the patch series, sync to the fork branch (drop dev-tree 0044/0045, add f32-only M5 as 0047) The 0044/0045 patches were exported from the old bf16/hybrid dev tree and no longer apply on the f32-only series (0026 ssm_bf16_tau is dropped), so the build broke at `git apply`. Re-sync the vendored series to the now feature-complete fork branch mudler/llama.cpp:localai-paged, which is the canonical source (pin 0ed235ea + the paged patch commits in order). - git rm the dev-tree-based 0044 (GDN M5, bf16-machinery base) and 0045 (Marlin W4A16 offline-repack, not part of the fork branch). - Add the fork branch's newest commit (2c32ab8b7, "GDN M5 tensor-core chunked-scan prefill, f32-only re-port") as 0047, generated with a single git format-patch off that branch. It sequences after 0046 (its parent on the branch) and recovers the prefill win 0044 encoded (+3.5% S_PP @npp512, +17.7% @npp2048), bit-exact per-path (test-backend-ops GATED_DELTA_NET 46/46 default and force-M5; greedy md5 default-on == M5-forced == canonical). - Track patch 0046 (dense-prefill geometry gate), which was on disk but never committed, so the series is complete in git. - README: patch-table header 0001-0046 -> 0001-0047, replace the 0044 row with the f32-only 0047 row, fix the dangling 0044 prose references, note the bf16 M6/M7/M8 variants are not part of this f32-only series, and add a maintenance bullet that the series is now generated from the fork branch so there is no more patch-export drift. Verified: on a pristine llama.cpp at pin 0ed235ea the full series 0001-0043, 0046, 0047 applies clean in sorted order with the Makefile's exact `git apply --verbose` method (37/37 OK), and the resulting tree is byte-identical to the fork branch tip 2c32ab8b7. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 07:54:46 +00:00
Ettore Di Giacinto	be65438eac	docs(paged): record MoE-prefill engine-gap decomposition + GEMM-port negatives (default-off) nsys cross-engine decomposition: the MoE prefill 64% gap vs vLLM is engine plumbing, not the kernel (GPU 97% busy, 443 vs 197 us/tok). Three buckets: per-expert W4A4 M-fragmentation (58%), GDN scan (24%), f32<->bf16 casts (15%). Offline-repack (0045) and verbatim vLLM-marlin port both trail FP4-MMQ via wrapper overhead, kept default-off as recorded negatives. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-29 17:20:07 +00:00
Ettore Di Giacinto	7b38c6b2a3	feat(paged): GDN M5 tensor-core chunked-scan prefill, default-on under paged KV (patch 0044) Land the tensor-core forms of the chunked gated-DeltaNet prefill scan (0031) as a single GDN_TC-selected build and ship the M5 variant (full TC form-T solve + state-update mma) default-ON when LLAMA_KV_PAGED is set. The dispatch defaults GDN_TC=5 and GDN_CHUNK_MIN=64 under paged KV (both env-overridable; OFF/INT_MAX when not paged, so stock/non-paged stays regression-free). GDN_CHUNK_MIN is the per-call engage threshold and stays > 1 so decode (1 tok/call) keeps the sequential recurrence; 64 was tuned from a {1,32,64,128,256} sweep (32/64/128 all win on prefill, 256 barely fires because the MoE-prefill per-call count is < 256, 1 collapses decode S_TG ~25%). Measured GB10, q36-35b-a3b-nvfp4, LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1, llama-batched-bench -ngl 99 -fa on -ntg 4 -npl 32: -npp 512 S_PP 2208.96 -> 2286.5 t/s (+3.5%, mean of 3 interleaved A/B) -npp 2048 S_PP 2021.5 -> 2379.8 t/s (+17.7%) Decode S_TG unchanged (~399 vs ~397 t/s, within noise). Bit-exactness (per-path greedy md5, n=48 --temp 0 --seed 1, paged): default-on == M5-forced == canonical on the gate prompt - MoE 8cb0ce23, dense 5951a5b4. test-backend-ops GATED_DELTA_NET 94/94 vs CPU with M5 forced (incl. multi-chunk up to n_tokens=256). On a long MoE prompt the default (M5 fires at >=64 tokens) and the sequential path agree word-for-word until one benign greedy token-flip; dense is byte-identical. The chunked scan is a NEW per-path result (different FP reduction order), NMSE-validated benign. CUDA-only, gencode arch=compute_121a,code=sm_121a (GB10 / sm_121a). README sections 3 (0044 row, 0031 superseded note) and 5 (dev-notes verdict) updated. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-29 06:42:11 +00:00

1 2 3 4 5 ...

7112 Commits