LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-27 09:57:14 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	9b0e4e544c	docs(paged): residual-assess FINAL - MoE at bit-exact ceiling, hunt DONE Conclude the MoE-parity hunt. The two remaining sub-levers in the 20.3-vs-13.8 ms projection bucket are both bit-changing or at the BW floor: - convert-glue (3.24 ms/step, measured: 1.73 input f32->bf16 + 1.52 output bf16->f32): NOT bit-exact eliminable. ggml-cuda.cu:1663-1690 rounds the f32 GEMM accumulator to bf16 (CUDA_R_16BF dst) then widens to f32; that bf16-rounded value is load-bearing for the shipped md5. Removing the round-trip (f32-direct output, bf16 residual stream, or NVFP4 weights) all rebaseline md5. A precision boundary, like lever 4. - bf16 projection GEMM (17.27 ms/step): BW-bound at the LPDDR5x floor (~4.7 GB/step at 273 GB/s; M=128 -> 128 FLOP/byte vs >900 ridge). nvjet already TMA-streams the weights; cutlass reads the same bytes. No kernel lever; only fewer bytes (quantize) helps - rejected on quality. Corrects the body premise that vLLM runs these projections as NVFP4-Marlin: vLLM runs the same nvidia-modelopt checkpoint that keeps them BF16, so the projection bucket is a matched-precision gap, not a quant gap. Realistic bit-exact MoE ceiling ~86-88% of vLLM; shipped lever 1 (86.3%) is at it. No one-more-lever for MoE. Only clean win left is DENSE (+0.41% lever 5), gated behind resolving the paged-MoE baseline md5 drift. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 01:05:06 +00:00
Ettore Di Giacinto	e3f8149f3b	docs(paged): lever-4 KL-gate FAIL - NVFP4 MoE projections cost ~6% PPL, no-ship Re-quantizing the MoE GGUF's bf16 GDN/attn projections to NVFP4 (the lever-4 scope hypothesis) fails the KL gate on every axis vs the shipping NVFP4 baseline: PPL +6.51% (FULL) / +6.15% (CONS) against a <1% gate, mean KLD-to-f16 0.164/0.172 vs baseline 0.137, top-1 argmax agreement down ~2.2-2.6 points. Both projq variants rejected; in_proj_ba being kept bf16 (CONS) recovered almost nothing, so the damage is in the bulk attn/GDN projections. Root cause: the bf16 projections are a deliberate modelopt precision choice, not a provenance accident. vLLM runs the same modelopt checkpoint, so it keeps these projections bf16 too - the baseline GGUF already matches vLLM. The ~20.3ms projection-GEMM bucket is the price of high-precision projections that vLLM also pays; it is not the llama-vs-vLLM lever it appeared to be. The speed win is only purchasable with a 6% PPL regression. MoE stays at 86.3% of vLLM @ npl128. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 23:36:38 +00:00
Ettore Di Giacinto	9a1be79f04	docs(paged): lever-4 scope - NVFP4 the still-bf16 MoE GDN/attn projections (+6.5ms bucket) Scopes lever 4 (read-only, no GPU) on top of the flat levers 2+3. Root cause: the MoE GGUF (nvidia modelopt, 241 NVFP4 tensors) quantized only the experts and left the GDN/attn linear projections in BF16, while the dense GGUF (unsloth, 304 NVFP4 tensors) already has them NVFP4 (proven: dense ssm_out runs FP4 MMQ; dense decode at 96.6% of vLLM). Lever 4 = re-quantize the MoE GGUF's bf16 GDN/attn projections to NVFP4, the same move vLLM makes on the identical weights - the +6.5ms projections bucket, the largest single banked MoE gain available. Path: offline re-quantize to a new GGUF variant (expanded --tensor-type); zero kernel code - the loader sidecar-scale path + tuned mul_mat_q<NVFP4> are already in tree and proven by the dense GGUF. Bit-changing => KL-gate, not md5. KL expected to pass (per-step non-accumulating weight quant, unlike the failed bf16-state; experts already W4A4-clean); lm_head is the one risky tensor (gate on argmax-agreement). Expected ~+4-6.5ms => MoE 86.3% -> ~88-91% of vLLM. Recommend a separate OPT-IN gallery variant (preserve the bit-exact default; promote to default only if the KL gate is clean). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 22:57:05 +00:00
Ettore Di Giacinto	62c407ed55	docs(paged): lever1 gather-fusion bench landed - checkpoint + attribution (patch 0028) Anchors the rigorous same-session A/B validation of patch 0028 (residual conv-state tap k_get_rows fusion) on this worktree branch with sign-off attribution. The regenerated 0028 patch + bench-updated LEVER1_GATHER_RESULTS.md first landed via a concurrent origin/master merge (`c1f1d1e8e`) that swept the staged files; this records the provenance and the bench summary in the checkpoint. Gate (bit-exact, greedy --temp 0 --seed 1 -n 48): dense q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439, MoE q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd (both == baseline; base == lever1). decode_agg npl128: dense 369.95 -> 377.83 t/s (+2.13%, 96.6% of vLLM), MoE 763.47 -> 777.95 t/s (+1.90%, 86.3% of vLLM). nsys MoE decode: k_get_rows_float 17334 -> 15414 inst (-1920), 358.37 -> 133.52 ms, step -3.13 ms. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 21:41:45 +00:00
Ettore Di Giacinto	c1f1d1e8ea	Merge remote-tracking branch 'origin/master' into worktree-feat+paged-attention # Conflicts: # gallery/index.yaml	2026-06-26 21:38:56 +00:00
Ettore Di Giacinto	79edfd26a3	feat(gallery): -paged suffix rename + qwopus NVFP4-MTP paged variants Rename the two base NVFP4 entries to a consistent -paged suffix (qwen3.6-27b-nvfp4 -> qwen3.6-27b-nvfp4-paged, qwen3.6-35b-a3b-nvfp4 -> qwen3.6-35b-a3b-nvfp4-paged) so all four base/MTP paged entries share the naming convention. Update the two matching examples in the backend plan doc. Add qwopus3.6-27b-v2-mtp-nvfp4-paged and qwopus3.6-27b-coder-mtp-nvfp4-paged: verbatim copies of the stock qwopus NVFP4-MTP entries (same GGUF uri/sha256, sampling, template, tags, function block) rewired onto the LocalAI paged-attention stack (backend llama-cpp-localai-paged; f16, flash_attention, 131072 context, 99 gpu_layers, batch 512; paged_kv + max_batch_tokens:512 + kv_unified:false + parallel:128). The stock entries are left untouched. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 21:26:14 +00:00
Ettore Di Giacinto	b1667b48ea	feat(paged): qwen35 recurrent-state gather fusion (patch 0028) Fuse the residual k_get_rows_float in the gated-DeltaNet decode path (the biggest single kernel vLLM lacks per MOE_GAP_VS_VLLM.md, ~5.2 ms/step MoE). 0019 fused the SSM-state gather, 0021 fused the conv compute but kept a build_rs gather for the conv taps; nsys located that conv-state tap gather (n_embd_r=24576 floats x 128 seqs, ~720 x ~115 us per 24-step window) as the last k_get_rows in the GDN path. New op ggml_ssm_conv_update_inplace_ids reads each sequence's prior conv taps from cache[ids[s]] in-kernel (identity in place from the write slot, non-identity via a disjoint scratch), mirroring the 0019 in-place + ids fusion. Bit-exact: read VALUES unchanged, only the read path changes. Helps both dense and MoE (shared GDN conv). GATE test-backend-ops (CUDA0 2/2): SSM_CONV_UPDATE_IDS, SSM_CONV_UPDATE, SSM_CONV, GATED_DELTA_NET, GET_ROWS all PASS. GATE greedy md5 (-temp 0 -seed 1 -n 48) BYTE-IDENTICAL both models: q36-27b-nvfp4 5951a5b4..., q36-35b-a3b-nvfp4 07db32c2... nsys: k_get_rows<float,float> 10174 -> 9454 instances, 186.3 -> 102.8 ms (720 conv gathers eliminated, replaced by a ~1.1 us no-op gather). Built and gated on the DGX llama tree (branch paged, commit 944636c, f32 default). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 20:59:59 +00:00
Ettore Di Giacinto	6c6a925213	docs(paged): MoE-vs-vLLM DECIDE synthesis - reject W4A16 Marlin, the GEMM is a llama win Cross-agent synthesis on top of the both-engine nsys decomposition (`3b5957157`): settle the user's "can we do what vLLM does on MoE?" question with the three converging investigations (groundtruth measurement + vllm-marlin source-read + marlin-port feasibility). Verdict: vLLM's ~15% MoE-decode lead is NOT the Marlin GEMM (that bucket is a -1.7 ms llama WIN: native FP4-MMA W4A4 47.3 vs Marlin W4A16 50.0 at the ragged tiny-M decode shape, both at the LPDDR5x BW floor). The gap is bf16 dense-projection bandwidth (+6.5), recurrence state-gather plumbing (+6.6, led by k_get_rows 5.2), graph/stream-overlap overhead (~+7), W4A4 act-quant tax (+3.3), and router/glue (+5.4). A W4A16/Marlin grouped MoE GEMM is REJECTED (default and opt-in): it would regress the 27% GEMM bucket to half-rate bf16 MMA, re-enter the GB10 occupancy wall the dense scaffold already STOPPED at, and its entire intrinsic upside is the ~2% act-quant tax - smaller than the bit-exact +1.9% the 0025 re-graph already banked, and closeable bit-exactly by fusing the act-quant. Recommended build (none a new MoE GEMM): (1) fuse the k_get_rows SSM-state gather (bit-exact, ~+5, biggest single-kernel win); (2) extend CUDA-graph coverage + stream overlap (bit-exact, ~+7); (3) fuse the W4A4 act-quant into RMSNorm/SiLU (bit-exact, +3.3); (4) NVFP4-quantize the still-bf16 GDN/attn projections + lm_head (bit-changing, +6.5, the same NVFP4-dense-quant move vLLM makes). Bit-exact levers alone reach ~94% of vLLM; with the projection quant ~96-97%, parity-or-better physically in reach since both heaviest kernels (SSM core, MoE GEMM) are already llama wins. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 20:14:30 +00:00
Ettore Di Giacinto	3b59571579	docs(paged): both-engine MoE decode decomposition - the 15% is NOT the Marlin GEMM Ground-truth side-by-side per-kernel ms/step of the MoE decode gap on DGX GB10. llama (752 t/s, step 169.8ms) vs vLLM graphs-on (901-equiv, step 142.0ms): 27.8ms gap. Headline: the grouped MoE-expert GEMM is a llama WIN - native FP4-MMA W4A4 47.3ms vs vLLM Marlin W4A16 50.0ms at the tiny-M decode shape. A Marlin-style W4A16 MoE GEMM would be slower; it is not the lever (extends the w4a16-marlin DENSE verdict). The 15% lives elsewhere: bf16 projections + convert glue (+6.5ms), recurrence state-gather plumbing (+6.6ms, led by k_get_rows 5.2ms), graph coverage + stream overlap (~+7ms), W4A4 act-quant tax (+3.3ms), router/glue (+5.4ms). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 20:11:40 +00:00
Ettore Di Giacinto	b3d3323105	feat(paged): wire ssm_bf16_tau model option for hybrid SSM-state fast mode Patch 0026 added the hybrid per-head bf16 SSM-state opt-in as the ssm_hybrid_tau_thresh cparam + the --ssm-bf16-tau CLI flag (default 0 = bit-exact f32). Expose it per-model via the LocalAI gallery/model YAML `options:` list, mirroring the paged_kv / max_batch_tokens setenv hooks. - grpc-server.cpp: new `ssm_bf16_tau` (alias `ssm_hybrid_tau`) option -> setenv(LLAMA_SSM_BF16_TAU) when the value parses to a positive float. It does NOT reference the paged-only common_params field, so the turboquant fork (which lacks patch 0026) stays byte-clean. - patch 0026 (common.cpp common_context_params_to_llama): getenv fallback feeds cparams.ssm_hybrid_tau_thresh from LLAMA_SSM_BF16_TAU only when the --ssm-bf16-tau CLI flag is unset (0). Absent/non-positive env => untouched, so stock stays bit-exact; the CLI flag takes precedence when set. - docs: backend/index.yaml note, docs backends.md, gallery header NOTE (referencing A_HYBRID_SSM_RESULTS.md; the 2 NVFP4 entries stay bit-exact). Byte-safe when unset: with no ssm_bf16_tau option the env is never touched and the default f32 bit-exact recurrence is preserved. Verified the parse + consume code paths with a standalone compile-and-run (option string -> LLAMA_SSM_BF16_TAU -> tau, plus 0 / garbage / CLI-precedence / unset cases). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 19:51:00 +00:00
Ettore Di Giacinto	9c1c2a6a16	docs(paged): B-3 mmq_y-down warp-remap NEGATIVE - bit-exact MoE ceiling ~85% of vLLM B-3 (the 0017-deferred mmq_y-down warp-remap of the NVFP4 grouped FP4-MMA mul_mat_q) was built bit-exact on the clean 0025 base and measured: the grouped GEMM kernel itself runs -1.3% (occupancy did rise via the nwarps=4 warp-remap / 128 threads-per-CTA), but end-to-end MoE decode is FLAT (npl128 +0.4%, npl32 +0.3%, within noise) because the stream-k fixup grows +42% (mmq_y=64 doubles the row-tiles) and the step is SSM/BW-bound. md5 PASS both models, test-backend-ops MUL_MAT 1146/1146 + MUL_MAT_ID 806/806 PASS. No patch 0028; DGX dev tree reverted to pristine 0025. Assessment: the bit-exact MoE GEMM/launch track is exhausted (B-1 re-graph banked ~82->85%; B-2 and B-3 are 0). Honest bit-exact MoE ceiling = ~85% of vLLM @npl128. The residual is the structural Marlin-NvFp4 grouped-GEMM gap that no bit-exact lever closes. Recommend shipping the ~85% bit-exact default and exposing the held 0026 bf16-SSM as a default-off opt-in (it reaches ~95% but is non-bit-exact and fails the MoE KL gate). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 19:10:24 +00:00
Ettore Di Giacinto	1f857f179e	docs(paged): B-2 down_proj act-quant retune RESULT - negative (no headroom) B-2 / M1 (SPEEDUP_HUNT rank #2): bit-exact block/grid/occupancy retune of quantize_mmq_nvfp4 (the MoE down_proj activation-quant, ~2% of the MoE decode step). Built+measured on a clean 0025 base (DGX GB10 sm_121), then reverted - it does not lift. Finding: the existing blockDim.x=128 is ALREADY the kernel-level optimum for quantize_mmq_nvfp4 on GB10. nsys (8193 invocations): block=128 total 117.4M ns is the fastest; 64 +8.7%, 192 +9.9%, 256 +6.9%. End-to-end MoE decode_agg is flat within 0.4% noise across all block sizes {32..256} (npl32 ~438, npl128 ~751 t/s). The act-quant is ~2% of a BW-bound step, so even a perfect kernel caps the win at ~2%, and 128 is already optimal => measured 0%. Same outcome as patch 0015 (M-tile) and 0017 (MINBLOCKS): no occupancy headroom on this 256-tiny-expert BW-bound model. Bit-exactness proven: md5 identical at block 64/128/256 for both models (the per-thread quant body is untouched; thread->output map is invariant to blockDim.x). Gate at default: dense 5951a5b4 == ref, MoE 07db32c2 == ref, MUL_MAT 1146/1146, MUL_MAT_ID 806/806 PASS. MoE stays ~85% of vLLM @npl128 / ~87% @npl32 - still well below vLLM, so the remaining MoE lever is B-3 (mmq_y-down warp-remap on the grouped FP4 GEMM). No patch 0027; dev tree reverted to pristine 0025. Full data in B_MOE_RESULTS.md. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 18:31:51 +00:00
Ettore Di Giacinto	33dfe7fd41	feat(paged): qwen35 hybrid per-head f32/bf16 SSM state - carry fix + gate sweep (patch 0026) Regenerate patch 0026 with the hybrid-decode carry fix and record the KL/throughput gate-sweep results. Fix: clear(data=true) zeroes the whole recurrent buffer including the head_slot maps, which were uploaded only once at construction; after the post-warmup reset every head read head_slot==0 (f32-local-0), collapsing the split and producing incoherent decode. Persist head_slot_host and re-upload via upload_head_slots() after every buffer clear. Hybrid decode is now coherent and the cross-op state carry is byte-exact (write==read, both partitions). Gate result: de-risk PASS (test-backend-ops 84/84; T=0 md5 == 0023 baseline, both models). Ship gate FAILS - no T_thresh meets MeanKLD<1e-3 AND same-top-p>=99.5% with a meaningful speedup. The premise that the bf16 error concentrates in long-memory heads is refuted: KL scales with the bf16 head count and saturates ~0.06/~91% (MoE saturates at the minimal split). The carry is byte-exact, so this is genuine bf16 sensitivity, not a bug. The byte-saving lever is real (dense +12.4%, MoE +11.5% decode @npl128 at T=128) but cannot meet the strict KL bar. Shipped default-off (f32, bit-exact opt-out); hybrid is opt-in only and not recommended in the gallery config. Full tables in A_HYBRID_SSM_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 17:44:05 +00:00
Ettore Di Giacinto	fe5bd3f53d	feat(paged): qwen35 hybrid per-head f32/bf16 SSM state (patch 0026) Lever A patch + build/de-risk results. Splits the persisted gated-DeltaNet recurrent state per head: f32 on long-memory heads (where bf16 rounding does not contract and the KL error concentrates), bf16 on fast-decaying heads, classified at model load by tau_h = 1/(\|ssm_a\|*softplus(ssm_dt)). Default ssm_hybrid_tau_thresh = 0.0 keeps every head f32 (bit-exact opt-out). De-risk gates BOTH PASS: test-backend-ops GATED_DELTA_NET CUDA0 OK (incl 32 hybrid mixed CUDA-vs-CPU cases); default all-f32 greedy md5 == 0023 baseline both models (dense 5951a5b4d624ce891e22ab5fca9bc439, MoE 07db32c2bcb78d17a43ed18bc22705cd). Known open issue (opt-in hybrid only; default unaffected): hybrid-ON model decode (ids in-place path) is incoherent; classifier/cache/kernel-params verified correct, bug isolated to the ids in-place cross-step state path. See A_HYBRID_SSM_RESULTS.md. Not ready for the GateSweep until fixed. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code]	2026-06-26 16:21:33 +00:00
Ettore Di Giacinto	6bfca146d6	docs(paged): speedup-hunt C section + final RANK + PLAN synthesis Append lever C (structural dense residual: lm_head + scheduling) findings and the master RANK + PLAN section to SPEEDUP_HUNT.md. Per-lever scorecard (gain x tractability x gate), ranked build order, the concrete A build plan for the hybrid per-head f32/bf16 SSM state cache, and the ordered B/C/D queue with each one's build trigger. Verdict: ship the MoE re-graph (patch 0025, measured +1.9-4.4%, both gates PASSED) now; build A as the lead (only lever ABOVE vLLM on dense, KL-gated, ~430-454 t/s = 103-108% of vLLM); bank B-2/B-3 on MoE; C last (<1% bit-exact, dead-end); D opt-in-only and dense-only behind the same KL gate bf16-SSM failed. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 14:56:53 +00:00
Ettore Di Giacinto	4d3fecd524	docs(paged): MoE decode re-graph lever (patch 0025) + speedup-hunt B findings Mirror of llama.cpp dev-tree patch 0025 (qwen35moe NVFP4 MoE-decode re-graph) and the GPU-agent B findings in SPEEDUP_HUNT.md: re-confirmed MoE decode decomposition @npl128, the measured re-graph lever (+4.4%/+2.9%/+1.9% decode_agg at npl 32/64/128; bit-exact: test-backend-ops MUL_MAT_ID 806/806 + parallel-greedy np16 byte-identical ON==OFF), grouped-GEMM occupancy headroom (exhausted on this bandwidth-bound model), and the W4A16 assessment (rejected: non-bit-exact, slower BF16 MMA). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 14:53:14 +00:00
Ettore Di Giacinto	ec7c1b1f68	feat(paged): pin-sync patchset to llama.cpp 9d5d882d (re-export 4 patches) The worktree merge bumped LLAMA_VERSION 8be759e6 -> 9d5d882d. This re-syncs the paged patch-stack (0001-0024) to the new tip: the stack was rebased onto 9d5d882d on the DGX dev tree, rebuilt clean (CUDA sm_121), and re-validated bit-exact before re-exporting the LocalAI .patch files. Re-exporting each shipped patch from its rebased commit and diffing body-to-body against the committed files identifies exactly 4 that changed and no longer git-apply to 9d5d882d: - 0008 cross-request prefix share: re-anchored the [paged 0008] commit block to the refactored update_slots() lambda (continue->return, batch.n_tokens-> batch.size()); identical env-guarded logic. - 0013 static prefill budget: budget var-block / while-gate / admission-break re-expressed against the refactored loop (add_ok=false idiom). - 0015 expert-density MoE token-tile auto-select: pure context re-anchor; upstream inserted a test_mul_mat_id case at the hunk anchor in test-backend-ops.cpp. The inserted lines are unchanged. (This one rebased cleanly via 3-way but its committed .patch no longer applies with plain git apply, so it is caught by the per-patch apply-check, not by the rebase conflict count.) - 0016 dynamic decode-first budget: dynamic budget block + n_decode_in_batch = batch.size() + add_ok=false against the refactored loop. All four are byte-faithful format-patch exports of the gate-green rebased commits. Applying the full corrected series to a fresh 9d5d882d reproduces the gate-green tree byte-for-byte across every code file. The other 7 touched patches (0009/0017/0018/0019/0020/0021/0024) are LINENUM-only (hunk bodies byte-identical, only @@ line-numbers shifted) and still apply cleanly, so they are left unchanged. The remaining patches are identical. Validation on the rebased build (NVFP4 Qwen3.6, GB10 sm_121): - test-backend-ops CUDA0: GATED_DELTA_NET 36/36, SSM_CONV 45/45, MUL_MAT 1146/1146, MUL_MAT_ID 806/806 all OK. - greedy md5 (-fa on -n 48 --temp 0 --seed 1): dense q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439 and MoE q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd, both == baseline. - decode S_TG @npl128: dense 366.41 t/s (ref 373.2, -1.8%), MoE 751.11 t/s (ref 745.7, +0.7%), both within noise. Details in backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 14:12:36 +00:00
Ettore Di Giacinto	30a2b590d9	Merge branch 'master' into worktree-feat+paged-attention (llama.cpp pin -> 9d5d882d) Sync to master (12 commits) + the llama.cpp pin bump 8be759e6 -> 9d5d882d. Conflicts resolved: - Makefile .NOTPARALLEL: union (keep both backends/llama-cpp-localai-paged and master's backends/privacy-filter-darwin). - gallery/index.yaml: our 2 base NVFP4 entries (qwen3.6-27b-nvfp4, qwen3.6-35b-a3b-nvfp4) for the paged backend prepended to master's full list; master keeps its own *-nvfp4-mtp variants (distinct entries). Go build + YAML validated; the 8 duplicate gallery names are pre-existing in master, not introduced here. The patchset still needs re-verification against the new tip (pin-sync, next step). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 13:16:13 +00:00
Ettore Di Giacinto	167768cac3	feat(backend): llama-cpp-localai-paged variant + NVFP4 Qwen3.6 gallery New backend = stock llama-cpp grpc-server + the paged patchset (forces LLAMA_PAGED=on), shipped as its own meta-backend (mirrors turboquant, simpler: no fork pin, no grpc-server patching - the paged runtime hooks already exist in grpc-server.cpp). Stock llama-cpp untouched (LLAMA_PAGED?=on retained; the de-risk flip deferred for sign-off). Gallery: qwen3.6-27b-nvfp4 (dense) + qwen3.6-35b-a3b-nvfp4 (MoE) with the benchmark run config (paged_kv, max_batch_tokens, parallel, flash_attention, f16), mudler/ GGUF uris (sha256 TODO until publish). Importer dropdown entry + tests. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 12:58:56 +00:00
Ettore Di Giacinto	125d10a782	feat(paged): paged-pool burst-reclaim (truncate + defrag + slot release) (patch 0024) Fixes the paged-pool burst-degradation bug (OTHER_PATHS_INVESTIGATION.md section C Part 2): on a long-lived llama-server with LLAMA_KV_PAGED=1, a high-fan-out prefill burst strands KV blocks in the host-side paged pool, so a later lower-npl prefill draws from a depleted/fragmented pool and its throughput collapses (the benchmark's "restart per npl" crutch). Decode is unaffected. The fix changes only host-side block accounting and placement, never KV values or compute, and is gated behind LLAMA_KV_PAGED (LLAMA_PAGED_NO_RECLAIM=1 restores the pre-fix behavior). Fix-1 reclaim trailing blocks: PagedKVManager::truncate(seq, n_keep) frees every block beyond ceil(n_keep/bs) (ref-counted); called from llama_kv_cache::seq_rm for the p1==MAX && p0>0 partial-tail case so the manager tracks the kv-cache exactly. Fix-2 defrag on empty: when the pool is fully idle, defrag_free_pool() relinks the free queue into ascending block-id order (FreeBlockQueue::rebuild), preserving content-cache hashes. Fix-3 release on slot completion: server_slot::release() issues prompt_clear() under the paged engine so a finished-idle slot returns its blocks promptly. Validation (DGX GB10, q36-27b-nvfp4 = qwen35 hybrid; HEAD f7409c2 = patch 0023): - Bit-exact: greedy md5 identical across paged off / paged on / paged on+NO_RECLAIM (5951a5b4d624ce891e22ab5fca9bc439), == the 0023 baseline. test-backend-ops unaffected (no ggml op touched). - Host unit test: truncate reclaims exactly 16 trailing blocks; defrag restores ascending popleft order. UNIT PASS. - Model A/B (one binary, NO_RECLAIM): fragmentation prefill ratio 0.944 -> 0.998; 64 idle slots strand 2048 blocks, reclaim returns the pool to fresh (2527). - Server A/B (FRESH-npl8 -> BURST-npl64 -> POST-npl8): POST-npl8 prefill collapses 488 -> 44 t/s with NO_RECLAIM (the bug; investigation saw 507 -> 65), restored to 532 t/s (fresh 525, within 1%) with the fix. Paged release-log count 17 -> 96 (Fix-3 fires per slot completion). Canary tokens identical fresh-vs-post in both arms (bit-exact serving). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 10:44:33 +00:00
Ettore Di Giacinto	b061e4aef0	docs(paged): OTHER_PATHS investigation - rank 4 post-0023 paths, pick paged-pool burst bug as first build target Synthesis of the four read-only/GPU investigations (A MoE grouped-GEMM, B cublas lm_head, C TTFT/paged-pool burst, D dense CUDA-graph): - A: llama already has the sorted-grouped-FP4-MMA GEMM (higher tier than vLLM's GB10 W4A16 Marlin fallback); standalone bit-exact kernel win is bounded on this bandwidth-bound a3b model. Keep down_proj quantize retune (M1) as a cheap bank-shot; fold the decode-graph (M2) into a later shared GDN+MoE decode-graph project. - B: lm_head is BF16 (not FP4), nvjet already ~72% of peak HBM; bit-exact ceiling <1%, the only big win (NVFP4 head) is non-bit-exact and unfair vs vLLM. Dead end. Rank last. - C: paged-pool burst-degradation BUG (Part 2) is a true correctness defect (prefill collapses 507->65 t/s after a burst, restart cures it): reclamation gap on partial seq_rm + free-queue fragmentation. Plus the static decode-first budget (Part 1) explains 903s/213s burst TTFT and the chunked-interleave fix. - D: f32 dense CUDA-graph is STABLE (<1%, no bimodality); the brief's bimodality was the shelved BF16 SSM path. Closed. First build target: the paged-pool burst-degradation bug fix (Fix-1 truncate-on-partial-seq_rm + Fix-2 defrag-on-empty + Fix-3 release-on-slot- completion). Small, localized, default-off byte-identical, crisp repro (npl64 burst then npl8: prefill within 10% of fresh + num_free restored). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 09:42:55 +00:00
Ettore Di Giacinto	89e62fc74f	docs(paged): finalize f16 glue probe - cost analysis + build verdict Synthesize the GPU kernel-budget probe with the read-only glue source map. Add (4) the implementation cost - llama has no model-compute-dtype knob, the residual stream is F32 by construction (ggml_mul_mat hardcodes F32 output), so f16 glue is not a flag but an opt-in multi-file change (norm.cu f16 kernels + f16 residual stream). Add the final verdict: precision is not the dominant cause of the 8% residual (83% of the step is already f32/W4A4-matched), f16 recovers only 40-60% of the gap and is non-bit-exact, so do not build it as the default; ship the 95%-bit-exact f32 plateau and target the structural cublas/graph-launch ~3-4% instead. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 09:12:55 +00:00
Ettore Di Giacinto	001d833426	docs(paged): f16/bf16 glue probe - dense decode residual ceiling Empirical probe on q36-27b-nvfp4 @npl128 (build f7409c2, patch 0023): - attention KV cache default is ALREADY f16 (K/V f16) -> --cache-type f16 is a no-op; q8_0 within noise -> KV dtype is not a decode lever - nsys node-trace decode budget: f32-glue (norms/elementwise/activations/attn, excl. SSM recurrence + NVFP4 GEMM) = 28.7 ms = 8.4% of step (40.9 ms = 12% incl. the non-FP4 cublas GEMM) - f16 realistically recovers ~11-16 ms of the ~27 ms/step gap = ~40-60% of the 8.2% residual -> ~95-96% parity, not a full close; non-bit-exact opt-in only Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 09:11:21 +00:00
Ettore Di Giacinto	00f92659f8	docs(paged): correct vLLM recurrent-state precision (f32, not bf16) Earlier text claimed bf16 = vLLM's own precision; that was a refuted byte-gate draft re-surfacing. The settled finding (BITEXACT_VS_VLLM.md, proven 3 ways) is that vLLM keeps the gated-DeltaNet TEMPORAL state in f32 (only its conv state is bf16). So bf16 temporal is BELOW vLLM's recurrent precision, not a match; and at equal f32 precision llama's recurrence already beats vLLM (84.6% vs 82.4% peak). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 06:22:08 +00:00
Ettore Di Giacinto	7dd3431040	docs(paged): promote TTFT/prefill + paged-pool burst-degradation bug (benchmark finding) The final benchmark exposed TTFT as the weakest number (dense npl128 903s vs vLLM 6-18s, decode-first budget throttling burst-prefill) plus a concrete paged-pool burst-degradation bug (post-burst low-npl prefill collapses 507->65 t/s; decode unaffected). Highest-value serving fix; decode + memory already strong. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 03:54:27 +00:00
Ettore Di Giacinto	ae0042f214	docs(paged): publish NVFP4 decode benchmark - plot-ready CSV + decode-vs-npl plots Public deliverable for the patch-0018..0023 f32 bit-exact paged-attention ship: the apples-to-apples NVFP4 decode benchmark (llama.cpp paged 0023 vs vLLM 0.23.0 on GB10 / DGX Spark, matched weights, CUDA graphs ON both sides). - final_benchmark.csv: clean 8-column plot-ready schema (model,engine,npl,decode_agg_tps,decode_perseq_tps,prefill_tps,ttft_mean_ms,peak_gb), 16 rows (2 models x 2 engines x npl 8/32/64/128). - QWEN36_NVFP4_BENCH.md: embed the two decode-vs-npl plots; add the internal-consistency note (decode_agg vs perseq*npl is TTFT-governed, holds on both engines, no stale-baseline carry-over). - decode-vs-npl PNGs (one per model), llama vs vLLM, per-point llama-%-of-vLLM labels. Headline (measured, nothing pre-assumed): dense llama 90-117% of vLLM decode (ahead at npl8), MoE 77-83%, at higher precision (f32 GDN state + q8 act vs vLLM bf16 GDN + w4a4) and 1.5-3x lower unified memory (on-demand paged KV vs vLLM's flat ~107 GB pool). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 03:51:35 +00:00
Ettore Di Giacinto	aaaa90ae4b	bench(paged): final apples-to-apples NVFP4 decode benchmark (0023 vs vLLM 0.23.0, GB10) Publishable, plot-ready head-to-head on GB10 / DGX Spark with matched NVFP4 weights, both engines at their best realistic config (CUDA graphs ON both sides; vLLM util 0.85 max-model-len 4096 max-num-seqs 256; llama -c 131072 --parallel 128 LLAMA_KV_PAGED=1 LLAMA_MAX_BATCH_TOKENS=512). Identical async client: 512-tok unique-nonce prompt (fresh full prefill), max_tokens=256, temp 0, ignore_eos, stream+usage; npl 8/32/64/128. llama = clean patch 0023 (dev tree f7409c2, bf16 GDN-state work reverted, build-cuda rebuilt). llama runs at HIGHER precision (f32 GDN state + q8 act) than vLLM (bf16 + w4a4). decode_agg t/s, llama as % of vLLM: DENSE q36-27b-nvfp4: npl8 117% npl32 91% npl64 90% npl128 92% MoE q36-35b-a3b: npl8 83% npl32 78% npl64 77% npl128 82% memory: llama on-demand paged KV 50-90 GB (dense) / 36-58 GB (MoE) vs vLLM fixed ~107 GB pool at all npl (1.5-3x lower). TTFT: vLLM wins under synchronized burst (llama decode-first budget trades burst-prefill for decode; decode + memory unaffected). Outputs: final_benchmark.csv (16 rows, 5 metrics each), refreshed QWEN36_NVFP4_BENCH.md (FINAL section), BENCHMARK_PROGRESS.md (per-row checkpoint log). Methodology notes: per-npl llama server restart (paged-pool degrades after high-npl bursts; decode robust), vLLM npl8 re-check confirms no degradation; clean env (service containers stopped for the run, restored after). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 03:47:24 +00:00
Ettore Di Giacinto	7c45447c9e	docs(paged): FUTURE_LEVERS - parked decode-parity exploration trail Ranked pick-up points after the 95%-bit-exact plateau: hybrid-precision SSM state (per-head f32/bf16 split - the bf16 error is concentrated in long-memory heads, so a split could capture most of the +25-31% while passing the f32 KL gate), dense CUDA-graph instability, the rms_norm->fp4 fold (flat-risk), datacenter Blackwell sm_100 (no LPDDR5x floor), adaptive prefill budget, MoE-specific recurrence tuning. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 00:53:09 +00:00
Ettore Di Giacinto	24833f0966	docs(paged): bf16 SSM-state NO-SHIP - fails f32 KL gate (= vLLM's own precision) De-risk passed (test-backend-ops 52/52 bf16, f32 default byte-identical to 0023), and the throughput lever is real (recurrence -49%/call, dense ~490 t/s = 125% of vLLM clean). But bf16-vs-f32 KLD is 0.06-0.17 at >=1024 ctx (threshold 1e-3) with ~90% top-token agreement: intrinsic bf16 error over gated-DeltaNet long-memory heads, not a bug. That is exactly vLLM's own bf16 GDN precision. Shelved; ship the 95% bit-exact f32 plateau (0018-0023). bf16 work backed up on DGX (BF16_SSM_STATE.diff). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 00:49:49 +00:00
LocalAI [bot]	d388f874de	feat(backends): darwin/Metal build for the privacy-filter backend (#10513 ) * feat(backends): darwin/Metal build for the privacy-filter backend (timeboxed try) The privacy-filter.cpp engine is already Metal-capable on Apple Silicon: it pulls ggml and never forces GGML_METAL=OFF, and ggml defaults Metal ON on Apple, so a plain Darwin build is Metal-enabled. grpc++/protobuf resolve from Homebrew via find_package(... CONFIG). It just had no darwin build path - the existing package.sh and run.sh are Linux-only and there was no make target / workflow step. Adds the bespoke darwin path, modeled on the ds4 one: - scripts/build/privacy-filter-darwin.sh: native make grpc-server, otool -L dylib bundling, create-oci-image (no Linux package.sh). - Makefile: backends/privacy-filter-darwin target (+ .NOTPARALLEL). - .github/workflows/backend_build_darwin.yml: gated build step for privacy-filter. - scripts/changed-backends.js: inferBackendPathDarwin special-case -> backend/cpp. - .github/backend-matrix.yml: includeDarwin entry (lang go, like ds4/llama-cpp). - backend/index.yaml: metal: capability + metal-privacy-filter(-development) entries. - backend/cpp/privacy-filter/run.sh: DYLD_LIBRARY_PATH branch on Darwin. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] * fix(privacy-filter): macOS proto include + bundle ggml dylibs Validated natively on an M4 (the build/package/load chain now works with Metal): - CMakeLists.txt: hw_grpc_proto compiles the generated proto/grpc sources but only linked the binary dir, so on macOS it could not find protobuf's headers (runtime_version.h) - Homebrew puts them under /opt/homebrew, not /usr/include. Link protobuf::libprotobuf + gRPC::grpc++ so their include dirs propagate. No-op on Linux (apt headers are already on the default search path). - privacy-filter-darwin.sh: bundle the ggml shared libs the binary @rpath-links (libggml{,-base,-cpu,-blas,-metal}); the otool -L walk only catches on-disk absolute deps and missed them. Resolved at runtime by run.sh's DYLD_LIBRARY_PATH. M4 check: arm64 grpc-server links @rpath/libggml-metal.0.dylib; with the 15 ggml dylibs + grpc/protobuf bundled, it loads clean (no dyld errors) and prints usage. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 01:18:41 +02:00
LocalAI [bot]	86677495a2	chore: ⬆️ Update ggml-org/llama.cpp to `9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1` (#10514 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-26 01:15:40 +02:00
LocalAI [bot]	74f07ecc35	fix(backends): quote $CURDIR in run.sh (fixes backends in paths with spaces) (#10519 ) fix(backends): quote $CURDIR in run.sh so backends work in paths with spaces The backend launcher scripts derive their own directory with CURDIR=$(dirname "$(realpath $0)") and then referenced it unquoted as $CURDIR (e.g. [ -f $CURDIR/lib/ld.so ], export LD_LIBRARY_PATH=$CURDIR/lib:..., exec $CURDIR/<binary> "$@"). When a backend is installed under a path that contains a space - notably macOS's ~/Library/Application Support/... - bash word-splits the unquoted $CURDIR, so the test builtin fails with "binary operator expected" and exec tries to run ".../Library/Application", yielding "No such file or directory". The backend never starts, surfacing as a gRPC "service not ready" error and an HTTP 500. Quote $CURDIR (and the realpath "$0") in every affected run.sh; no logic changes. Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 01:02:48 +02:00
Ettore Di Giacinto	634c0e5a0f	docs(paged): rms_norm->fp4 fold analysis - bit-exact decode ceiling at 95% of vLLM The standalone quantize fold is empirically flat (Lever-2 precedent) with the worst gain/plumbing ratio; no bit-exact lever remains. Dense 371.81 t/s @npl128 = 95.0% of vLLM 391, recurrence past vLLM at the LPDDR5x DRAM floor, all byte-identical to llama f32. Only bf16 state (shelved) goes further. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 22:42:08 +00:00
Ettore Di Giacinto	64766ecc85	Merge branch 'master' into worktree-feat+paged-attention Resolve pkg/xsysinfo/gpu.go: keep master's NVIDIAComputeCapability + parseComputeCap (the #10485 multi-GPU work); re-express our IsNVIDIABlackwell as a thin wrapper over NVIDIAComputeCapability instead of a duplicate nvidia-smi probe. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 21:56:35 +00:00
Ettore Di Giacinto	02cbae5ea9	feat(paged): qwen35moe NVFP4 activation-quantize de-dup (patch 0023) Mirror patch 0023 + results into the paged series. Bit-exact MoE decode/prefill lever: ggml mul_mat_id re-quantizes each token's activation once per expert for the broadcast up/gate proj (ne11==1); quantize_mmq_nvfp4 has no cross-thread reduction, so the gathered blocks are byte-identical across experts. The lever quantizes the ne12 unique tokens once and gathers the block_fp4_mmq rows into the expert-gathered layout with a coalesced uint4 copy (144 B = 9 uint4); the GEMM is untouched and down_proj keeps the stock path. Measured (DGX GB10, on top of patch 0022, q36-35b-a3b-nvfp4): decode S_TG npl128 745.2 -> 758.1 t/s (+1.73%), npl32 +0.6%, prefill T_PP -4%; dense q36-27b-nvfp4 byte-flat. nsys: quantize_mmq_nvfp4 868 -> 457 ms, gather +32 ms (net -379 ms). Bit-exact: q36-27b 5951a5b4..., q36-35b-a3b 07db32c2... (on == off == 0022); test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805. On by default; GGML_CUDA_MOE_QUANT_DEDUP=0 restores stock. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 21:49:15 +00:00
LocalAI [bot]	6c03e46390	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `b84902d2ad27c34f989f23947200c4b91b1568fd` (#10515 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-25 23:42:21 +02:00
Ettore Di Giacinto	3c1ed67b4b	feat(paged): qwen35 gated-DeltaNet decode occupancy/coalescing retune (patch 0022) Bit-exact occupancy retune of gated_delta_net_cuda, the B=128 decode recurrence kernel, carried as paged patch 0022. After the f32 verdict (vLLM carries the gated-DeltaNet temporal state in float32 and moves the same ~805 MB/call as llama; the gap was pure DRAM bandwidth efficiency on equal bytes - llama 73.4% vs vLLM 82.4% of the 273 GB/s GB10 peak), the lever is a latency-coverage retune that keeps the per-column f32 reduction/FMA order byte-identical (md5-gateable). The bf16-state plan stays shelved. Column folding: each warp owns COLS_PER_WARP columns of the 128x128 recurrent state instead of 1, looping the existing per-column body over col, col+NUM_WARPS, ... within a per-block column tile; grid.z = S_v / (NUM_WARPS*COLS_PER_WARP). The per-lane strided row sharding and the warp_reduce butterfly are unchanged, so only the (warp,block)->column assignment differs and the result is bit-identical; per-warp memory-level parallelism rises ~COLS_PER_WARP-fold, covering more DRAM latency on this bandwidth-bound kernel. Default tile is the measured GB10 winner (NUM_WARPS=16, COLS_PER_WARP=8), env-selectable via GDN_NW / GDN_CPW. GB10: gated_delta_net decode 4.02 -> 3.49 ms/call, 73.4% -> 84.6% of peak (above vLLM's 82.4%; 102.6% of vLLM recurrence BW). decode S_TG t/s: dense 27b npl128 335.9 -> 373.2 (+11.1%), MoE 35b-a3b npl128 688.4 -> 745.7 (+8.3%). Greedy md5 byte-identical to the 0021 baseline on both q36-27b-nvfp4 and q36-35b-a3b-nvfp4; test-backend-ops -o GATED_DELTA_NET 36/36 PASS. Bench/method in OCCUPANCY_RETUNE_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 18:34:17 +00:00
Ettore Di Giacinto	8f8777e0f4	feat(paged): qwen35 decode conv-state in-place fusion (patch 0021) The no-regret bit-exact conv-state cleanup from the GDN recurrence byte-gate design (point 3). After the recurrence verdict (NO-BUILD: the gated-DeltaNet recurrence is already single-pass at the f32 byte floor), the decode conv path was the only remaining bit-exact lever. New fused op ggml_ssm_conv_update_inplace (reuses GGML_OP_SSM_CONV, discriminated by a non-null src[3]). On the single-token decode path it replaces the four-op conv chain - qkv transpose + ggml_concat (concat_cont) + ggml_ssm_conv + ggml_silu + ggml_cpy of the shifted ring state (cpy_scalar) - with one kernel that, per (channel, sequence), assembles the width-K window in registers from the K-1 cached taps plus the current qkv_mixed token, computes the depthwise conv with the SAME ascending-tap FMA order as ssm_conv_f32 at i==0, folds silu, writes the conv output, and writes the 1-token-shifted ring state back IN PLACE into the conv cache slot at kv_head. This is vLLM causal_conv1d_update; it mirrors the 0018 in-place write-back and 0019 patterns. Read source (the build_rs tap gather) and write target (the cache view) are disjoint buffers, so it is race-free by construction with no ids/identity logic. - ggml.h/ggml.c: builder (src0=conv_states [K-1,ch,n_seqs], src1=conv_kernel, src2=x_cur [ch,1,n_seqs], src3=conv_state_dst [(K-1)*ch,n_seqs] in-place ring; op_params[0]=fuse_silu) - ggml-cuda/ssm-conv.cu: ssm_conv_update_f32<apply_silu,d_conv> kernel + ggml_cuda_op_ssm_conv_update + src[3]-discriminated branch in ggml_cuda_op_ssm_conv - ggml-cpu/ops.cpp: ggml_compute_forward_ssm_conv_update_f32 (threads over channels) + branch in ggml_compute_forward_ssm_conv - delta-net-base.cpp/models.h: build_conv_state_fused (keeps the cheap build_rs conv-tap gather; fuses conv+silu+shifted write-back) - qwen35.cpp, qwen35moe.cpp, qwen3next.cpp: route the single-token decode path (n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar); prefill/chunked/rollback keep the original chain - tests/test-backend-ops.cpp: test_ssm_conv_update (16 cases) vs the CPU reference test-backend-ops: SSM_CONV 45/45, SSM_CONV_UPDATE 16/16, SSM_CONV_BIAS_SILU 90/90. Greedy (--temp 0 --seed 1 --ignore-eos -n 256) byte-identical to the Lever-1 (0019/0020) baseline: q36-27b-nvfp4 md5 675cd522..., q36-35b-a3b-nvfp4 md5 ac163882... both BYTE-IDENTICAL. decode_agg S_TG (npp128 ntg128, -fa on, CUDA-graph), same session: dense q36-27b-nvfp4 : npl 32 199.76 -> 202.99 (+1.6%) npl 128 336.35 -> 347.14 (+3.2%, 86.0 -> 88.8 percent of vLLM 391) MoE q36-35b-a3b : npl 32 421.72 -> 432.39 (+2.5%) npl 128 689.74 -> 713.54 (+3.5%) Lift holds in eager too (dense npl128 333.62 -> 342.97). Step -11.9 ms/step (dense npl128: 380.6 -> 368.7). nsys eager decode: concat_cont (1152 calls) and the decode cpy_scalar GONE; ssm_conv_f32 at decode replaced by ssm_conv_update (1152); conv-path ~20.9 -> ~7.6 ms/step. Bit-exact, no regression, de-risks the bf16-state conv-cache plumbing. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 16:56:35 +00:00
Ettore Di Giacinto	5cec1a6a21	docs(paged): bitexact-vs-vLLM verdict + verified f32 GDN-state correction Synthesize the cross-engine bit-exactness and f32-preserving-parity study. Resolve the contradiction between sub-agents (one f32, two bf16) by reading every link of vLLM's state-dtype chain on live source: - config.json text_config.mamba_ssm_dtype = "float32" (both served models) - cache.py default mamba_ssm_cache_dtype = "auto"; bench passes no override - vllm.py __post_init__ -> try_verify_and_update_config (config finalize) - Qwen3_5ForConditionalGenerationConfig override copies "float32" into mamba_ssm_cache_dtype before state-dtype resolution - mamba_utils._mamba_state_dtype -> temporal = torch.float32 (conv = bf16) - qwen_gdn_linear_attn allocates the temporal cache at f32 Verdicts: B1 TRUE (sub-claim 'more efficient than vLLM' refuted); B2 REFUTED (equal f32 bytes both sides, ~10pct efficiency gap not 2x width); B3 REFUTED (vLLM hits throughput with f32 state; a bit-exact occupancy/coalescing retune of gated_delta_net_cuda 74->81pct peak is the f32-preserving parity lever); B4 CONFIRMED (bit-exact-vs-vLLM impossible: A1 FP4 GEMM 8/4/16-bit operand gap + A2 recurrence g.Sigma vs Sigma.g reassociation on different reduction trees, plus general FP non-associativity). bf16 temporal state degrades BELOW vLLM's f32 recurrent precision -> an over-clock, not a parity requirement. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 16:55:25 +00:00
Ettore Di Giacinto	17855735c7	docs(paged): bf16 SSM-state build plan (PART C synthesis: edits, KL gate, bench, risks) Synthesizes the bf16 SSM recurrent-state-cache plan into a build-agent brief: ordered file-by-file edit list (kernel/op dtype-generic first, then cparams default flip, gRPC/YAML, back-compat), the KL<1e-3 + PPL-delta + coherence + long-context-drift acceptance gate that REPLACES the bit-exact md5 gate (bf16 is intentionally non-bit-exact, equal precision to vLLM), bench targets (recurrence 3.98->2-3 ms/call, step 384->289-339 ms, 360-443 tok/s dense) + nsys check, the default-bf16/f32-opt-out semantics + state-file back-compat, the risk register, and the single biggest risk (silent corruption on the prefill/keep_rs_t/gather paths) with the de-risk-first test-backend-ops step. Conv state stays f32 in v1. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 16:46:59 +00:00
Ettore Di Giacinto	2a8103c419	docs(paged): FINAL DECISION - NO-BUILD fused recurrence, BUILD conv fusion + bf16 state Synthesis of the byte-gate workflow (ncu-byte-gate measurement + vllm-fused-recurrence-study + llama-fused-recurrence-design + conv-fusion-design). Verdict closes all five decision points: (1) Byte ratio: llama re-stream ~1.0x (cap <=1.33x); recurrence at 74% GB10 peak, MORE BW-efficient than vLLM packed_decode at 41%. The 2x DRAM gap is 100% f32-vs-bf16 state-cache width, not extra passes. (2) Fused single-pass recurrence: NO-BUILD - already one R + one W of f32 state, gate ops touch tiny q/k/g/beta not the 805 MB state -> recovers ~0 bytes. (3) Conv-state in-place fusion: GO - bit-exact, no-regret, +12-14 ms/step (~+3%), eliminates concat_cont + cpy_scalar + folds silu. (4) bf16 SSM state: BUILD (KL<1e-3 gated product call) - only lever on the dominant 50% recurrence term, +45-95 ms/step -> step 289-339 ms = parity-to-ahead of vLLM. Bit-exact parity unreachable on this term (f32 bytes irreducible); bf16 = equal precision to vLLM, which is itself bf16. (5) Build order: conv fusion next (no-regret, bit-exact), then bf16 state (highest value, gated). Confirming measurements stated per step. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 15:27:04 +00:00
Ettore Di Giacinto	fd4332e8f0	docs(paged): GDN recurrence byte-gate SETTLED - re-stream ~1.0x, build bf16 state not fused kernel Decisive measurement (ncu-byte-gate agent, DGX GB10). ncu HW DRAM counters were blocked (ERR_NVGPUCTRPERM, root-only NVreg param; no passwordless sudo), so the byte ratio was settled via CUPTI kernel timing + exact byte geometry: bytes moved <= peak_BW x duration caps the re-stream factor. llama gated_delta_net_cuda decode (B=128, f32 state): 3.98 ms/call, 805 MB R+W, 202 GB/s = 74% of GB10 peak. vLLM fused_recurrent_packed_decode (B=128, bf16 state): 3.62 ms/call, 402 MB R+W, 111 GB/s = 41% peak. Both single-pass (load-once/store-once, verified in source). llama re-stream factor ~1.0x (hard cap <=1.33x; >=1.5x needs >peak BW = impossible). VERDICT: NO-BUILD the fused single-pass recurrence - the kernel is already single-pass, coalesced, and MORE bandwidth-efficient than vLLM's triton kernel; the gate ops touch the tiny q/k/g/beta projections, not the 805 MB state, so fusion recovers ~0 state bytes. The entire 2x DRAM gap vs vLLM is f32 (llama) vs bf16 (vLLM) state-cache width. BUILD bf16 SSM state instead: halves 805->413 MB, ~45-95 ms/step, step 384 -> 289-339 ms = parity-to-ahead of vLLM 327 (non-bit-exact vs f32 but equal to vLLM's own bf16 precision). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 15:24:49 +00:00
Ettore Di Giacinto	5825b073a5	docs(paged): SYNTHESIS - validated decode-parity picture, ranked plan, verdict Final synthesis of the critical-path gap analysis: the decode step is 99.94% GPU-busy single-stream (idle 0.225ms = 0.06%), so the 14% gap to vLLM is kernel GPU-time dominated by the bandwidth-bound gated_delta_net recurrence (196.37ms = 51.6%), not launch bubbles. Claims A/B/C all REFUTED as worded; the single residual is the unmeasured DRAM byte ratio of llama's recurrence vs vLLM's fused kernel. Ranked plan: single-pass fused GDN recurrence (gap-closer, gate on ncu byte-ratio test) + conv-state concat fusion (no-regret +2-3%, bit-exact); gate-fold alone tops out at ~89% of vLLM; bf16 state is the only floor-mover but breaks bit-exactness. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 15:03:18 +00:00
Ettore Di Giacinto	a72385257a	docs(paged): decisive node-level decode timeline gap - bubbles refuted Fresh nsys --cuda-graph-trace=node capture of one steady decode step on q36-27b-nvfp4 dense at npl128 (clean Lever-1 build-cuda-base). The decode step is a single CUDA graph; node-level expansion shows it is 99.94% GPU-busy on a single stream with 0.225 ms/step inter-kernel idle (0.06%, zero gaps >5us). This refutes the "~60% idle bubbles / 57 ms = 100% bubble" hypothesis and confirms the cudagraph-coverage source verdict. Real decode mix: gated_delta_net 196 ms = 51.6% of the step (4.08 ms/call x48; the prior 1.47 ms/call "near-vLLM" was a prefill-contaminated eager average), FP4 GEMM+quantize 29%, gating glue (Lever 3 target) only 3.35%, gdn_gather 0.06 ms. By roofline-decode's own sizing test (idle < 57 ms => gap is elsewhere) the 14% gap to vLLM lives in kernel GPU-time, dominated by the bandwidth-bound GDN recurrence, not in bubbles; Lever 3 fusion is resized to ~3% and reframed as byte-reduction, not bubble removal. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 14:57:37 +00:00
Ettore Di Giacinto	2b57997df0	docs(paged): cudagraph-coverage - GDN serial chain IS graph-covered at B=128 Determine whether the ggml CUDA graph covers the gated-DeltaNet serial chain at batch=128. It does: nothing in the GDN region forces graph-disable (check_compability lists only split-buffers and large-batch MUL_MAT_ID), and the recurrent head is constant for a steady 128-seq batch so the inplace_ids state_dst offset + rs_head op_param + SSM input shapes are stable across steps. The fused op does no host-sync / capture-time cudaMalloc. The only re-warm is the per-256-token full-attention block-table cadence (not a GDN op). The ~40% util is bandwidth-roofline (SSM state traffic 66% of step bytes), not launch-gap idle - so no GDN graph-safe lever; the only non-covered idle is the ~0.4% between-step host cgraph rebuild. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 14:45:51 +00:00
Ettore Di Giacinto	e597a8ac78	docs(paged): vLLM GDN decode = 2 fused kernels under CUDA graph vs llama ~8 ops Read-only source comparison of the gated-DeltaNet decode region. vLLM folds conv-silu, q/k l2norm, scale, softplus+A_log gate, sigmoid-beta, the delta-rule recurrence and the SSM state write-back into ONE Triton kernel (fused_recurrent_gated_delta_rule_packed_decode), with the output gate fused into a gated rms_norm, and captures the whole decode forward in a full CUDA graph (GDNAttentionMetadata UNIFORM_BATCH, decode-only full cudagraph). llama runs the same region as ~8 separate host-launched, serially-dependent ggml nodes. That launch/bubble delta - not GEMM throughput - is the candidate 62%-vs-40% busy gap. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 14:43:01 +00:00
LocalAI [bot]	4ac67d255d	feat: single-build ggml CPU_ALL_VARIANTS for llama-cpp + turboquant (x86/arm64/apple) (#10497 ) * feat(llama-cpp): single x86 CPU build via ggml CPU_ALL_VARIANTS Replace the per-microarch avx/avx2/avx512/fallback multi-binary build on x86 with a single grpc-server plus the dlopen-able libggml-cpu-.so set that ggml's backend registry selects at runtime by probing host CPU features. One build instead of four, broader microarch coverage (adds alderlake AVX-VNNI, zen4 AVX512-BF16, sapphirerapids AMX), and the shell-side /proc/cpuinfo probing in run.sh goes away. Build/link notes: - CPU_ALL_VARIANTS requires GGML_BACKEND_DL + BUILD_SHARED_LIBS=ON, so ggml/llama become shared objects. SHARED_LIBS is now a make variable (default OFF) so the override survives the recursive sub-make into the VARIANT build dir instead of being re-clobbered by the base flags. - The cpu-all target also builds "--target ggml": the per-microarch backends are runtime-dlopened, not link deps, so they only compile via ggml's add_dependencies(). - hw_grpc_proto is pinned STATIC. Under BUILD_SHARED_LIBS=ON it would otherwise become a DSO referencing hidden-visibility symbols in the static libprotobuf.a, which fails to link ("hidden symbol ... is referenced by DSO"). Keeping it static links gRPC/protobuf into the executable while only ggml/llama stay shared, so no PIC or base-image change is required. - package.sh bundles the libggml-.so set into package/lib; ggml finds them by scanning the bundled ld.so directory (/proc/self/exe), which run.sh launches from. Scope: x86 only. arm64/darwin keep the single fallback build. The ik-llama-cpp / turboquant forks and the other ggml C++ backends are unchanged; the same recipe applies but is out of scope here. Validated with a full docker build plus a live inference smoke test: the model loads, ggml selects the AVX512_BF16 variant on a Zen-class host, and tokens generate correctly. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(llama-cpp,turboquant): extend CPU_ALL_VARIANTS to arm64 + turboquant - llama-cpp: x86 AND arm64 now use the single llama-cpp-cpu-all build (only hipblas keeps the fallback build). ggml's arm64 variant table (armv8.x / armv9.x, plus apple_m* on darwin) is selected at runtime. - turboquant: same recipe via a turboquant-cpu-all target. turboquant copies backend/cpp/llama-cpp's CMakeLists.txt + Makefile per flavor, so the hw_grpc_proto STATIC fix and the SHARED_LIBS / EXTRA_CMAKE_ARGS make-vars are inherited; the target just passes SHARED_LIBS=ON, the DL flags and --target ggml through, then collects the .so set. run.sh and package.sh updated to ship/select turboquant-cpu-all. - Makefile lib-collection find now also matches .dylib (for the darwin build, which emits dylibs rather than .so). ik-llama-cpp is intentionally left unchanged: its pinned ggml has no CPU_ALL_VARIANTS support and its IQK kernels require AVX2, so the per-microarch dynamic backend set does not apply. Scope still excludes the darwin packaging wiring (separate change). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] feat(llama-cpp,turboquant): arm64 gcc-14 for SME variants + darwin cpu-all packaging - arm64: ggml CPU_ALL_VARIANTS builds armv9.2 SME variants whose -march=...+sme is rejected by the Ubuntu 24.04 default gcc-13. Build the arm64 variants with gcc-14 (installed in the compile step). The host only selects a variant it actually supports at runtime, but every variant must still compile. - darwin: scripts/build/llama-cpp-darwin.sh builds llama-cpp-cpu-all instead of the fallback binary, keeps Metal (GGML_METAL stays ON; --target ggml also builds ggml-metal). The per-microarch libggml-cpu-.dylib are placed in the package root next to the binary (darwin has no bundled ld.so, so ggml's executable-dir scan looks there), while the other shared dylibs go in lib/ for DYLD_LIBRARY_PATH. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] fix(llama-cpp-darwin): distribute ggml backends by suffix (.so root, .dylib lib) ggml emits its loadable backends (per-microarch CPU variants, metal, blas) with a .so suffix even on darwin, while the core libraries (ggml-base/ggml/llama/ llama-common/mtmd) use .dylib. Split the distribution by suffix: .so DL backends go in the package root for ggml's executable-directory scan, .dylib core libs go in lib/ for DYLD_LIBRARY_PATH. The previous .dylib name-pattern matched none of the variants. Verified on an M4: ggml loads the apple_m4 CPU variant (SME=1) and Metal, model loads and generates correct tokens. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(llama-cpp,turboquant): only CPU_ALL_VARIANTS for pure-CPU builds, GPU uses fallback The previous gate sent every non-hipblas build through llama-cpp-cpu-all, so the GPU image builds (cublas, sycl_f16/f32, vulkan, nvidia l4t) compiled the whole CPU microarch variant matrix on top of their already-huge GPU backend - blowing the build time (the sycl job was only 59% done after 2h11m) - and the arm64 l4t build failed at `apt-get install gcc-14` (exit 100) on the Jetson base. Gate on an empty BUILD_TYPE instead: only the pure CPU image (build-type: '' in .github/backend-matrix.yml) builds the CPU_ALL_VARIANTS set; every GPU build gets a single fallback CPU grpc-server, since the accelerator does the compute. This also confines the arm64 gcc-14 step (needed for the armv9.2 SME variants) to the CPU build, away from the GPU base images. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * docs(llama-cpp): correct run.sh comment for arm64/darwin cpu-all arm64 and darwin CPU images now also ship llama-cpp-cpu-all (not fallback-only); only GPU images ship fallback-only. Fix the stale comment to match. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 15:47:03 +02:00
Ettore Di Giacinto	b895f4dff8	feat(paged): qwen35 gated-DeltaNet o_proj MMVQ->MMQ reshape (patch 0020) Lever 1, the single biggest decode-parity lever for the Qwen3.6 hybrid-SSM models (arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). Post-SSM (patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391; profiling both engines pinned the largest llama-specific overage to the gated-DeltaNet output projection (ssm_out). The GDN op left its output in SSM layout and the graph reshaped it to 3D [value_dim, n_seq_tokens=1, n_seqs=128] before the ssm_out matmul, so src1->ne[1]=1. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the 128 sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does not amortize the ssm_out weight read across the 128 sequences. vLLM packs the same projection into one M=128 GEMM. The in-projection was already 2D -> MMQ; only the output was 3D. The fix collapses the GDN output to 2D [value_dim, n_seq_tokens * n_seqs] (= [6144, 128] at decode) before the ssm_out ggml_mul_mat, so src1->ne[1]=128 routes to the MMQ M=128 tensor-core GEMM. The result is then already 2D, so the redundant post-matmul reshape_2d is dropped. Same contiguous data, just a 2D vs 3D view: bit-identical. Gated to the gated-DeltaNet path (qwen35 / qwen35moe / qwen3next); other archs untouched. Bit-identical greedy (--temp 0 --seed 1) vs the post-SSM baseline on both q36-27b-nvfp4 (dense) and q36-35b-a3b-nvfp4 (MoE), byte/md5-identical. test-backend-ops MUL_MAT and MUL_MAT_ID OK. decode_agg S_TG (llama-batched-bench, -fa on, npp128 ntg128, npl 32/128): dense q36-27b: 170.52 / 254.92 -> 200.00 / 335.80 t/s (+17.3% / +31.7%) MoE q36-35b-a3b: 373.28 / 560.66 -> 420.77 / 691.24 t/s (+12.7% / +23.3%) Dense @128 = 335.80 t/s = 85.9% of vLLM 391 (up from 65%; target 82-85% hit). nsys: the o_proj mul_mat_vec_q<NVFP4,m=1> bucket (132.8 ms / 48 inst) collapses to zero; mul_mat_q<NVFP4,m=128> absorbs it (+1200 inst, +363 ms) at a LOWER per-call average (620.8 -> 582.7 us). Realized o_proj-as-MMQ cost ~0.30 ms/call vs 2.77 ms/call for the old GEMV. Mirrors DGX dev-tree commit df1cc97. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 10:41:38 +00:00
Ettore Di Giacinto	c0e0ed3865	docs(paged): synthesize decode-parity exploration - the o_proj MMVQ lever Cross-check the adversarial validation against the profiler ground-truth and finalize DECODE_PARITY_EXPLORE.md. The post-SSM 254->391 decode gap is one llama-specific defect: the gated-DeltaNet output projection (ssm_out) runs as an FP4 GEMV (mul_mat_vec_q, 132 ms/step = 26% of decode) at batch 128 instead of a tensor-core MMQ GEMM. Mechanism confirmed at source: final_output is 3D [6144,1,n_seqs] so src1->ne[1]=1 trips the MMVQ dispatch (<=8), with the 128 sequences in ne[2]. vLLM packs the same projection into a cutlass M=128 GEMM. GDN recurrence is only +11%/call (not the lever); P2a optimized the wrong FP4 kernel (the 17% MMQ, not the 26% MMVQ); CUDA graphs, host loop, and DRAM bytes are all ruled out. Decode parity is reachable in software (not a hardware floor): identical bytes/floor, vLLM hits 62% util vs llama 40% on the same GB10. Highest-value next step (~free, bit-exact): collapse final_output to 2D before ssm_out so M=128 routes to MMQ. Ranked levers + cumulative ceilings toward 391 documented. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 09:06:50 +00:00
Ettore Di Giacinto	ee13fd18ce	docs(paged): profile-both-engines post-SSM ground-truth decode decomposition Fresh post-SSM nsys of llama (build-cuda-base, patch 0019) AND vLLM 0.23.0 at npl128 decode. Reproduces the 391 reference (vLLM 394 t/s eager / 420 graphs, graphs +6% only) and confirms llama 245 t/s. Both ~98% GPU-busy; the gap is GPU kernel-time, not idle/host/graphs. GDN compute comparable (llama 4.03 vs vLLM 3.62 ms/call, +11%). bytes/step: llama not higher (131 vs 85 MB memcpy; SSM-fix 18GB/step DtoD removal confirmed in-trace). Single biggest llama-specific overage = FP4 matmul path 236 vs 117 ms/step (+119 ms = 64% of the gap), dominated by mul_mat_vec_q (FP4 GEMV at batch 128, 132 ms/step, 26%, one per GDN layer). Track B optimized the wrong FP4 kernel (mul_mat_q, not the GEMV). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 08:56:37 +00:00

1 2 3 4 5 ...

717 Commits