LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-27 09:57:14 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	79edfd26a3	feat(gallery): -paged suffix rename + qwopus NVFP4-MTP paged variants Rename the two base NVFP4 entries to a consistent -paged suffix (qwen3.6-27b-nvfp4 -> qwen3.6-27b-nvfp4-paged, qwen3.6-35b-a3b-nvfp4 -> qwen3.6-35b-a3b-nvfp4-paged) so all four base/MTP paged entries share the naming convention. Update the two matching examples in the backend plan doc. Add qwopus3.6-27b-v2-mtp-nvfp4-paged and qwopus3.6-27b-coder-mtp-nvfp4-paged: verbatim copies of the stock qwopus NVFP4-MTP entries (same GGUF uri/sha256, sampling, template, tags, function block) rewired onto the LocalAI paged-attention stack (backend llama-cpp-localai-paged; f16, flash_attention, 131072 context, 99 gpu_layers, batch 512; paged_kv + max_batch_tokens:512 + kv_unified:false + parallel:128). The stock entries are left untouched. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 21:26:14 +00:00
Ettore Di Giacinto	b1667b48ea	feat(paged): qwen35 recurrent-state gather fusion (patch 0028) Fuse the residual k_get_rows_float in the gated-DeltaNet decode path (the biggest single kernel vLLM lacks per MOE_GAP_VS_VLLM.md, ~5.2 ms/step MoE). 0019 fused the SSM-state gather, 0021 fused the conv compute but kept a build_rs gather for the conv taps; nsys located that conv-state tap gather (n_embd_r=24576 floats x 128 seqs, ~720 x ~115 us per 24-step window) as the last k_get_rows in the GDN path. New op ggml_ssm_conv_update_inplace_ids reads each sequence's prior conv taps from cache[ids[s]] in-kernel (identity in place from the write slot, non-identity via a disjoint scratch), mirroring the 0019 in-place + ids fusion. Bit-exact: read VALUES unchanged, only the read path changes. Helps both dense and MoE (shared GDN conv). GATE test-backend-ops (CUDA0 2/2): SSM_CONV_UPDATE_IDS, SSM_CONV_UPDATE, SSM_CONV, GATED_DELTA_NET, GET_ROWS all PASS. GATE greedy md5 (-temp 0 -seed 1 -n 48) BYTE-IDENTICAL both models: q36-27b-nvfp4 5951a5b4..., q36-35b-a3b-nvfp4 07db32c2... nsys: k_get_rows<float,float> 10174 -> 9454 instances, 186.3 -> 102.8 ms (720 conv gathers eliminated, replaced by a ~1.1 us no-op gather). Built and gated on the DGX llama tree (branch paged, commit 944636c, f32 default). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 20:59:59 +00:00
Ettore Di Giacinto	6c6a925213	docs(paged): MoE-vs-vLLM DECIDE synthesis - reject W4A16 Marlin, the GEMM is a llama win Cross-agent synthesis on top of the both-engine nsys decomposition (`3b5957157`): settle the user's "can we do what vLLM does on MoE?" question with the three converging investigations (groundtruth measurement + vllm-marlin source-read + marlin-port feasibility). Verdict: vLLM's ~15% MoE-decode lead is NOT the Marlin GEMM (that bucket is a -1.7 ms llama WIN: native FP4-MMA W4A4 47.3 vs Marlin W4A16 50.0 at the ragged tiny-M decode shape, both at the LPDDR5x BW floor). The gap is bf16 dense-projection bandwidth (+6.5), recurrence state-gather plumbing (+6.6, led by k_get_rows 5.2), graph/stream-overlap overhead (~+7), W4A4 act-quant tax (+3.3), and router/glue (+5.4). A W4A16/Marlin grouped MoE GEMM is REJECTED (default and opt-in): it would regress the 27% GEMM bucket to half-rate bf16 MMA, re-enter the GB10 occupancy wall the dense scaffold already STOPPED at, and its entire intrinsic upside is the ~2% act-quant tax - smaller than the bit-exact +1.9% the 0025 re-graph already banked, and closeable bit-exactly by fusing the act-quant. Recommended build (none a new MoE GEMM): (1) fuse the k_get_rows SSM-state gather (bit-exact, ~+5, biggest single-kernel win); (2) extend CUDA-graph coverage + stream overlap (bit-exact, ~+7); (3) fuse the W4A4 act-quant into RMSNorm/SiLU (bit-exact, +3.3); (4) NVFP4-quantize the still-bf16 GDN/attn projections + lm_head (bit-changing, +6.5, the same NVFP4-dense-quant move vLLM makes). Bit-exact levers alone reach ~94% of vLLM; with the projection quant ~96-97%, parity-or-better physically in reach since both heaviest kernels (SSM core, MoE GEMM) are already llama wins. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 20:14:30 +00:00
Ettore Di Giacinto	3b59571579	docs(paged): both-engine MoE decode decomposition - the 15% is NOT the Marlin GEMM Ground-truth side-by-side per-kernel ms/step of the MoE decode gap on DGX GB10. llama (752 t/s, step 169.8ms) vs vLLM graphs-on (901-equiv, step 142.0ms): 27.8ms gap. Headline: the grouped MoE-expert GEMM is a llama WIN - native FP4-MMA W4A4 47.3ms vs vLLM Marlin W4A16 50.0ms at the tiny-M decode shape. A Marlin-style W4A16 MoE GEMM would be slower; it is not the lever (extends the w4a16-marlin DENSE verdict). The 15% lives elsewhere: bf16 projections + convert glue (+6.5ms), recurrence state-gather plumbing (+6.6ms, led by k_get_rows 5.2ms), graph coverage + stream overlap (~+7ms), W4A4 act-quant tax (+3.3ms), router/glue (+5.4ms). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 20:11:40 +00:00
Ettore Di Giacinto	b3d3323105	feat(paged): wire ssm_bf16_tau model option for hybrid SSM-state fast mode Patch 0026 added the hybrid per-head bf16 SSM-state opt-in as the ssm_hybrid_tau_thresh cparam + the --ssm-bf16-tau CLI flag (default 0 = bit-exact f32). Expose it per-model via the LocalAI gallery/model YAML `options:` list, mirroring the paged_kv / max_batch_tokens setenv hooks. - grpc-server.cpp: new `ssm_bf16_tau` (alias `ssm_hybrid_tau`) option -> setenv(LLAMA_SSM_BF16_TAU) when the value parses to a positive float. It does NOT reference the paged-only common_params field, so the turboquant fork (which lacks patch 0026) stays byte-clean. - patch 0026 (common.cpp common_context_params_to_llama): getenv fallback feeds cparams.ssm_hybrid_tau_thresh from LLAMA_SSM_BF16_TAU only when the --ssm-bf16-tau CLI flag is unset (0). Absent/non-positive env => untouched, so stock stays bit-exact; the CLI flag takes precedence when set. - docs: backend/index.yaml note, docs backends.md, gallery header NOTE (referencing A_HYBRID_SSM_RESULTS.md; the 2 NVFP4 entries stay bit-exact). Byte-safe when unset: with no ssm_bf16_tau option the env is never touched and the default f32 bit-exact recurrence is preserved. Verified the parse + consume code paths with a standalone compile-and-run (option string -> LLAMA_SSM_BF16_TAU -> tau, plus 0 / garbage / CLI-precedence / unset cases). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 19:51:00 +00:00
Ettore Di Giacinto	9c1c2a6a16	docs(paged): B-3 mmq_y-down warp-remap NEGATIVE - bit-exact MoE ceiling ~85% of vLLM B-3 (the 0017-deferred mmq_y-down warp-remap of the NVFP4 grouped FP4-MMA mul_mat_q) was built bit-exact on the clean 0025 base and measured: the grouped GEMM kernel itself runs -1.3% (occupancy did rise via the nwarps=4 warp-remap / 128 threads-per-CTA), but end-to-end MoE decode is FLAT (npl128 +0.4%, npl32 +0.3%, within noise) because the stream-k fixup grows +42% (mmq_y=64 doubles the row-tiles) and the step is SSM/BW-bound. md5 PASS both models, test-backend-ops MUL_MAT 1146/1146 + MUL_MAT_ID 806/806 PASS. No patch 0028; DGX dev tree reverted to pristine 0025. Assessment: the bit-exact MoE GEMM/launch track is exhausted (B-1 re-graph banked ~82->85%; B-2 and B-3 are 0). Honest bit-exact MoE ceiling = ~85% of vLLM @npl128. The residual is the structural Marlin-NvFp4 grouped-GEMM gap that no bit-exact lever closes. Recommend shipping the ~85% bit-exact default and exposing the held 0026 bf16-SSM as a default-off opt-in (it reaches ~95% but is non-bit-exact and fails the MoE KL gate). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 19:10:24 +00:00
Ettore Di Giacinto	1f857f179e	docs(paged): B-2 down_proj act-quant retune RESULT - negative (no headroom) B-2 / M1 (SPEEDUP_HUNT rank #2): bit-exact block/grid/occupancy retune of quantize_mmq_nvfp4 (the MoE down_proj activation-quant, ~2% of the MoE decode step). Built+measured on a clean 0025 base (DGX GB10 sm_121), then reverted - it does not lift. Finding: the existing blockDim.x=128 is ALREADY the kernel-level optimum for quantize_mmq_nvfp4 on GB10. nsys (8193 invocations): block=128 total 117.4M ns is the fastest; 64 +8.7%, 192 +9.9%, 256 +6.9%. End-to-end MoE decode_agg is flat within 0.4% noise across all block sizes {32..256} (npl32 ~438, npl128 ~751 t/s). The act-quant is ~2% of a BW-bound step, so even a perfect kernel caps the win at ~2%, and 128 is already optimal => measured 0%. Same outcome as patch 0015 (M-tile) and 0017 (MINBLOCKS): no occupancy headroom on this 256-tiny-expert BW-bound model. Bit-exactness proven: md5 identical at block 64/128/256 for both models (the per-thread quant body is untouched; thread->output map is invariant to blockDim.x). Gate at default: dense 5951a5b4 == ref, MoE 07db32c2 == ref, MUL_MAT 1146/1146, MUL_MAT_ID 806/806 PASS. MoE stays ~85% of vLLM @npl128 / ~87% @npl32 - still well below vLLM, so the remaining MoE lever is B-3 (mmq_y-down warp-remap on the grouped FP4 GEMM). No patch 0027; dev tree reverted to pristine 0025. Full data in B_MOE_RESULTS.md. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 18:31:51 +00:00
Ettore Di Giacinto	33dfe7fd41	feat(paged): qwen35 hybrid per-head f32/bf16 SSM state - carry fix + gate sweep (patch 0026) Regenerate patch 0026 with the hybrid-decode carry fix and record the KL/throughput gate-sweep results. Fix: clear(data=true) zeroes the whole recurrent buffer including the head_slot maps, which were uploaded only once at construction; after the post-warmup reset every head read head_slot==0 (f32-local-0), collapsing the split and producing incoherent decode. Persist head_slot_host and re-upload via upload_head_slots() after every buffer clear. Hybrid decode is now coherent and the cross-op state carry is byte-exact (write==read, both partitions). Gate result: de-risk PASS (test-backend-ops 84/84; T=0 md5 == 0023 baseline, both models). Ship gate FAILS - no T_thresh meets MeanKLD<1e-3 AND same-top-p>=99.5% with a meaningful speedup. The premise that the bf16 error concentrates in long-memory heads is refuted: KL scales with the bf16 head count and saturates ~0.06/~91% (MoE saturates at the minimal split). The carry is byte-exact, so this is genuine bf16 sensitivity, not a bug. The byte-saving lever is real (dense +12.4%, MoE +11.5% decode @npl128 at T=128) but cannot meet the strict KL bar. Shipped default-off (f32, bit-exact opt-out); hybrid is opt-in only and not recommended in the gallery config. Full tables in A_HYBRID_SSM_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 17:44:05 +00:00
Ettore Di Giacinto	fe5bd3f53d	feat(paged): qwen35 hybrid per-head f32/bf16 SSM state (patch 0026) Lever A patch + build/de-risk results. Splits the persisted gated-DeltaNet recurrent state per head: f32 on long-memory heads (where bf16 rounding does not contract and the KL error concentrates), bf16 on fast-decaying heads, classified at model load by tau_h = 1/(\|ssm_a\|*softplus(ssm_dt)). Default ssm_hybrid_tau_thresh = 0.0 keeps every head f32 (bit-exact opt-out). De-risk gates BOTH PASS: test-backend-ops GATED_DELTA_NET CUDA0 OK (incl 32 hybrid mixed CUDA-vs-CPU cases); default all-f32 greedy md5 == 0023 baseline both models (dense 5951a5b4d624ce891e22ab5fca9bc439, MoE 07db32c2bcb78d17a43ed18bc22705cd). Known open issue (opt-in hybrid only; default unaffected): hybrid-ON model decode (ids in-place path) is incoherent; classifier/cache/kernel-params verified correct, bug isolated to the ids in-place cross-step state path. See A_HYBRID_SSM_RESULTS.md. Not ready for the GateSweep until fixed. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code]	2026-06-26 16:21:33 +00:00
Ettore Di Giacinto	6bfca146d6	docs(paged): speedup-hunt C section + final RANK + PLAN synthesis Append lever C (structural dense residual: lm_head + scheduling) findings and the master RANK + PLAN section to SPEEDUP_HUNT.md. Per-lever scorecard (gain x tractability x gate), ranked build order, the concrete A build plan for the hybrid per-head f32/bf16 SSM state cache, and the ordered B/C/D queue with each one's build trigger. Verdict: ship the MoE re-graph (patch 0025, measured +1.9-4.4%, both gates PASSED) now; build A as the lead (only lever ABOVE vLLM on dense, KL-gated, ~430-454 t/s = 103-108% of vLLM); bank B-2/B-3 on MoE; C last (<1% bit-exact, dead-end); D opt-in-only and dense-only behind the same KL gate bf16-SSM failed. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 14:56:53 +00:00
Ettore Di Giacinto	4d3fecd524	docs(paged): MoE decode re-graph lever (patch 0025) + speedup-hunt B findings Mirror of llama.cpp dev-tree patch 0025 (qwen35moe NVFP4 MoE-decode re-graph) and the GPU-agent B findings in SPEEDUP_HUNT.md: re-confirmed MoE decode decomposition @npl128, the measured re-graph lever (+4.4%/+2.9%/+1.9% decode_agg at npl 32/64/128; bit-exact: test-backend-ops MUL_MAT_ID 806/806 + parallel-greedy np16 byte-identical ON==OFF), grouped-GEMM occupancy headroom (exhausted on this bandwidth-bound model), and the W4A16 assessment (rejected: non-bit-exact, slower BF16 MMA). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 14:53:14 +00:00
Ettore Di Giacinto	ec7c1b1f68	feat(paged): pin-sync patchset to llama.cpp 9d5d882d (re-export 4 patches) The worktree merge bumped LLAMA_VERSION 8be759e6 -> 9d5d882d. This re-syncs the paged patch-stack (0001-0024) to the new tip: the stack was rebased onto 9d5d882d on the DGX dev tree, rebuilt clean (CUDA sm_121), and re-validated bit-exact before re-exporting the LocalAI .patch files. Re-exporting each shipped patch from its rebased commit and diffing body-to-body against the committed files identifies exactly 4 that changed and no longer git-apply to 9d5d882d: - 0008 cross-request prefix share: re-anchored the [paged 0008] commit block to the refactored update_slots() lambda (continue->return, batch.n_tokens-> batch.size()); identical env-guarded logic. - 0013 static prefill budget: budget var-block / while-gate / admission-break re-expressed against the refactored loop (add_ok=false idiom). - 0015 expert-density MoE token-tile auto-select: pure context re-anchor; upstream inserted a test_mul_mat_id case at the hunk anchor in test-backend-ops.cpp. The inserted lines are unchanged. (This one rebased cleanly via 3-way but its committed .patch no longer applies with plain git apply, so it is caught by the per-patch apply-check, not by the rebase conflict count.) - 0016 dynamic decode-first budget: dynamic budget block + n_decode_in_batch = batch.size() + add_ok=false against the refactored loop. All four are byte-faithful format-patch exports of the gate-green rebased commits. Applying the full corrected series to a fresh 9d5d882d reproduces the gate-green tree byte-for-byte across every code file. The other 7 touched patches (0009/0017/0018/0019/0020/0021/0024) are LINENUM-only (hunk bodies byte-identical, only @@ line-numbers shifted) and still apply cleanly, so they are left unchanged. The remaining patches are identical. Validation on the rebased build (NVFP4 Qwen3.6, GB10 sm_121): - test-backend-ops CUDA0: GATED_DELTA_NET 36/36, SSM_CONV 45/45, MUL_MAT 1146/1146, MUL_MAT_ID 806/806 all OK. - greedy md5 (-fa on -n 48 --temp 0 --seed 1): dense q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439 and MoE q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd, both == baseline. - decode S_TG @npl128: dense 366.41 t/s (ref 373.2, -1.8%), MoE 751.11 t/s (ref 745.7, +0.7%), both within noise. Details in backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 14:12:36 +00:00
Ettore Di Giacinto	167768cac3	feat(backend): llama-cpp-localai-paged variant + NVFP4 Qwen3.6 gallery New backend = stock llama-cpp grpc-server + the paged patchset (forces LLAMA_PAGED=on), shipped as its own meta-backend (mirrors turboquant, simpler: no fork pin, no grpc-server patching - the paged runtime hooks already exist in grpc-server.cpp). Stock llama-cpp untouched (LLAMA_PAGED?=on retained; the de-risk flip deferred for sign-off). Gallery: qwen3.6-27b-nvfp4 (dense) + qwen3.6-35b-a3b-nvfp4 (MoE) with the benchmark run config (paged_kv, max_batch_tokens, parallel, flash_attention, f16), mudler/ GGUF uris (sha256 TODO until publish). Importer dropdown entry + tests. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 12:58:56 +00:00
Ettore Di Giacinto	125d10a782	feat(paged): paged-pool burst-reclaim (truncate + defrag + slot release) (patch 0024) Fixes the paged-pool burst-degradation bug (OTHER_PATHS_INVESTIGATION.md section C Part 2): on a long-lived llama-server with LLAMA_KV_PAGED=1, a high-fan-out prefill burst strands KV blocks in the host-side paged pool, so a later lower-npl prefill draws from a depleted/fragmented pool and its throughput collapses (the benchmark's "restart per npl" crutch). Decode is unaffected. The fix changes only host-side block accounting and placement, never KV values or compute, and is gated behind LLAMA_KV_PAGED (LLAMA_PAGED_NO_RECLAIM=1 restores the pre-fix behavior). Fix-1 reclaim trailing blocks: PagedKVManager::truncate(seq, n_keep) frees every block beyond ceil(n_keep/bs) (ref-counted); called from llama_kv_cache::seq_rm for the p1==MAX && p0>0 partial-tail case so the manager tracks the kv-cache exactly. Fix-2 defrag on empty: when the pool is fully idle, defrag_free_pool() relinks the free queue into ascending block-id order (FreeBlockQueue::rebuild), preserving content-cache hashes. Fix-3 release on slot completion: server_slot::release() issues prompt_clear() under the paged engine so a finished-idle slot returns its blocks promptly. Validation (DGX GB10, q36-27b-nvfp4 = qwen35 hybrid; HEAD f7409c2 = patch 0023): - Bit-exact: greedy md5 identical across paged off / paged on / paged on+NO_RECLAIM (5951a5b4d624ce891e22ab5fca9bc439), == the 0023 baseline. test-backend-ops unaffected (no ggml op touched). - Host unit test: truncate reclaims exactly 16 trailing blocks; defrag restores ascending popleft order. UNIT PASS. - Model A/B (one binary, NO_RECLAIM): fragmentation prefill ratio 0.944 -> 0.998; 64 idle slots strand 2048 blocks, reclaim returns the pool to fresh (2527). - Server A/B (FRESH-npl8 -> BURST-npl64 -> POST-npl8): POST-npl8 prefill collapses 488 -> 44 t/s with NO_RECLAIM (the bug; investigation saw 507 -> 65), restored to 532 t/s (fresh 525, within 1%) with the fix. Paged release-log count 17 -> 96 (Fix-3 fires per slot completion). Canary tokens identical fresh-vs-post in both arms (bit-exact serving). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 10:44:33 +00:00
Ettore Di Giacinto	b061e4aef0	docs(paged): OTHER_PATHS investigation - rank 4 post-0023 paths, pick paged-pool burst bug as first build target Synthesis of the four read-only/GPU investigations (A MoE grouped-GEMM, B cublas lm_head, C TTFT/paged-pool burst, D dense CUDA-graph): - A: llama already has the sorted-grouped-FP4-MMA GEMM (higher tier than vLLM's GB10 W4A16 Marlin fallback); standalone bit-exact kernel win is bounded on this bandwidth-bound a3b model. Keep down_proj quantize retune (M1) as a cheap bank-shot; fold the decode-graph (M2) into a later shared GDN+MoE decode-graph project. - B: lm_head is BF16 (not FP4), nvjet already ~72% of peak HBM; bit-exact ceiling <1%, the only big win (NVFP4 head) is non-bit-exact and unfair vs vLLM. Dead end. Rank last. - C: paged-pool burst-degradation BUG (Part 2) is a true correctness defect (prefill collapses 507->65 t/s after a burst, restart cures it): reclamation gap on partial seq_rm + free-queue fragmentation. Plus the static decode-first budget (Part 1) explains 903s/213s burst TTFT and the chunked-interleave fix. - D: f32 dense CUDA-graph is STABLE (<1%, no bimodality); the brief's bimodality was the shelved BF16 SSM path. Closed. First build target: the paged-pool burst-degradation bug fix (Fix-1 truncate-on-partial-seq_rm + Fix-2 defrag-on-empty + Fix-3 release-on-slot- completion). Small, localized, default-off byte-identical, crisp repro (npl64 burst then npl8: prefill within 10% of fresh + num_free restored). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 09:42:55 +00:00
Ettore Di Giacinto	89e62fc74f	docs(paged): finalize f16 glue probe - cost analysis + build verdict Synthesize the GPU kernel-budget probe with the read-only glue source map. Add (4) the implementation cost - llama has no model-compute-dtype knob, the residual stream is F32 by construction (ggml_mul_mat hardcodes F32 output), so f16 glue is not a flag but an opt-in multi-file change (norm.cu f16 kernels + f16 residual stream). Add the final verdict: precision is not the dominant cause of the 8% residual (83% of the step is already f32/W4A4-matched), f16 recovers only 40-60% of the gap and is non-bit-exact, so do not build it as the default; ship the 95%-bit-exact f32 plateau and target the structural cublas/graph-launch ~3-4% instead. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 09:12:55 +00:00
Ettore Di Giacinto	001d833426	docs(paged): f16/bf16 glue probe - dense decode residual ceiling Empirical probe on q36-27b-nvfp4 @npl128 (build f7409c2, patch 0023): - attention KV cache default is ALREADY f16 (K/V f16) -> --cache-type f16 is a no-op; q8_0 within noise -> KV dtype is not a decode lever - nsys node-trace decode budget: f32-glue (norms/elementwise/activations/attn, excl. SSM recurrence + NVFP4 GEMM) = 28.7 ms = 8.4% of step (40.9 ms = 12% incl. the non-FP4 cublas GEMM) - f16 realistically recovers ~11-16 ms of the ~27 ms/step gap = ~40-60% of the 8.2% residual -> ~95-96% parity, not a full close; non-bit-exact opt-in only Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 09:11:21 +00:00
Ettore Di Giacinto	00f92659f8	docs(paged): correct vLLM recurrent-state precision (f32, not bf16) Earlier text claimed bf16 = vLLM's own precision; that was a refuted byte-gate draft re-surfacing. The settled finding (BITEXACT_VS_VLLM.md, proven 3 ways) is that vLLM keeps the gated-DeltaNet TEMPORAL state in f32 (only its conv state is bf16). So bf16 temporal is BELOW vLLM's recurrent precision, not a match; and at equal f32 precision llama's recurrence already beats vLLM (84.6% vs 82.4% peak). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 06:22:08 +00:00
Ettore Di Giacinto	7dd3431040	docs(paged): promote TTFT/prefill + paged-pool burst-degradation bug (benchmark finding) The final benchmark exposed TTFT as the weakest number (dense npl128 903s vs vLLM 6-18s, decode-first budget throttling burst-prefill) plus a concrete paged-pool burst-degradation bug (post-burst low-npl prefill collapses 507->65 t/s; decode unaffected). Highest-value serving fix; decode + memory already strong. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 03:54:27 +00:00
Ettore Di Giacinto	ae0042f214	docs(paged): publish NVFP4 decode benchmark - plot-ready CSV + decode-vs-npl plots Public deliverable for the patch-0018..0023 f32 bit-exact paged-attention ship: the apples-to-apples NVFP4 decode benchmark (llama.cpp paged 0023 vs vLLM 0.23.0 on GB10 / DGX Spark, matched weights, CUDA graphs ON both sides). - final_benchmark.csv: clean 8-column plot-ready schema (model,engine,npl,decode_agg_tps,decode_perseq_tps,prefill_tps,ttft_mean_ms,peak_gb), 16 rows (2 models x 2 engines x npl 8/32/64/128). - QWEN36_NVFP4_BENCH.md: embed the two decode-vs-npl plots; add the internal-consistency note (decode_agg vs perseq*npl is TTFT-governed, holds on both engines, no stale-baseline carry-over). - decode-vs-npl PNGs (one per model), llama vs vLLM, per-point llama-%-of-vLLM labels. Headline (measured, nothing pre-assumed): dense llama 90-117% of vLLM decode (ahead at npl8), MoE 77-83%, at higher precision (f32 GDN state + q8 act vs vLLM bf16 GDN + w4a4) and 1.5-3x lower unified memory (on-demand paged KV vs vLLM's flat ~107 GB pool). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 03:51:35 +00:00
Ettore Di Giacinto	aaaa90ae4b	bench(paged): final apples-to-apples NVFP4 decode benchmark (0023 vs vLLM 0.23.0, GB10) Publishable, plot-ready head-to-head on GB10 / DGX Spark with matched NVFP4 weights, both engines at their best realistic config (CUDA graphs ON both sides; vLLM util 0.85 max-model-len 4096 max-num-seqs 256; llama -c 131072 --parallel 128 LLAMA_KV_PAGED=1 LLAMA_MAX_BATCH_TOKENS=512). Identical async client: 512-tok unique-nonce prompt (fresh full prefill), max_tokens=256, temp 0, ignore_eos, stream+usage; npl 8/32/64/128. llama = clean patch 0023 (dev tree f7409c2, bf16 GDN-state work reverted, build-cuda rebuilt). llama runs at HIGHER precision (f32 GDN state + q8 act) than vLLM (bf16 + w4a4). decode_agg t/s, llama as % of vLLM: DENSE q36-27b-nvfp4: npl8 117% npl32 91% npl64 90% npl128 92% MoE q36-35b-a3b: npl8 83% npl32 78% npl64 77% npl128 82% memory: llama on-demand paged KV 50-90 GB (dense) / 36-58 GB (MoE) vs vLLM fixed ~107 GB pool at all npl (1.5-3x lower). TTFT: vLLM wins under synchronized burst (llama decode-first budget trades burst-prefill for decode; decode + memory unaffected). Outputs: final_benchmark.csv (16 rows, 5 metrics each), refreshed QWEN36_NVFP4_BENCH.md (FINAL section), BENCHMARK_PROGRESS.md (per-row checkpoint log). Methodology notes: per-npl llama server restart (paged-pool degrades after high-npl bursts; decode robust), vLLM npl8 re-check confirms no degradation; clean env (service containers stopped for the run, restored after). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 03:47:24 +00:00
Ettore Di Giacinto	7c45447c9e	docs(paged): FUTURE_LEVERS - parked decode-parity exploration trail Ranked pick-up points after the 95%-bit-exact plateau: hybrid-precision SSM state (per-head f32/bf16 split - the bf16 error is concentrated in long-memory heads, so a split could capture most of the +25-31% while passing the f32 KL gate), dense CUDA-graph instability, the rms_norm->fp4 fold (flat-risk), datacenter Blackwell sm_100 (no LPDDR5x floor), adaptive prefill budget, MoE-specific recurrence tuning. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 00:53:09 +00:00
Ettore Di Giacinto	24833f0966	docs(paged): bf16 SSM-state NO-SHIP - fails f32 KL gate (= vLLM's own precision) De-risk passed (test-backend-ops 52/52 bf16, f32 default byte-identical to 0023), and the throughput lever is real (recurrence -49%/call, dense ~490 t/s = 125% of vLLM clean). But bf16-vs-f32 KLD is 0.06-0.17 at >=1024 ctx (threshold 1e-3) with ~90% top-token agreement: intrinsic bf16 error over gated-DeltaNet long-memory heads, not a bug. That is exactly vLLM's own bf16 GDN precision. Shelved; ship the 95% bit-exact f32 plateau (0018-0023). bf16 work backed up on DGX (BF16_SSM_STATE.diff). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 00:49:49 +00:00
Ettore Di Giacinto	634c0e5a0f	docs(paged): rms_norm->fp4 fold analysis - bit-exact decode ceiling at 95% of vLLM The standalone quantize fold is empirically flat (Lever-2 precedent) with the worst gain/plumbing ratio; no bit-exact lever remains. Dense 371.81 t/s @npl128 = 95.0% of vLLM 391, recurrence past vLLM at the LPDDR5x DRAM floor, all byte-identical to llama f32. Only bf16 state (shelved) goes further. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 22:42:08 +00:00
Ettore Di Giacinto	02cbae5ea9	feat(paged): qwen35moe NVFP4 activation-quantize de-dup (patch 0023) Mirror patch 0023 + results into the paged series. Bit-exact MoE decode/prefill lever: ggml mul_mat_id re-quantizes each token's activation once per expert for the broadcast up/gate proj (ne11==1); quantize_mmq_nvfp4 has no cross-thread reduction, so the gathered blocks are byte-identical across experts. The lever quantizes the ne12 unique tokens once and gathers the block_fp4_mmq rows into the expert-gathered layout with a coalesced uint4 copy (144 B = 9 uint4); the GEMM is untouched and down_proj keeps the stock path. Measured (DGX GB10, on top of patch 0022, q36-35b-a3b-nvfp4): decode S_TG npl128 745.2 -> 758.1 t/s (+1.73%), npl32 +0.6%, prefill T_PP -4%; dense q36-27b-nvfp4 byte-flat. nsys: quantize_mmq_nvfp4 868 -> 457 ms, gather +32 ms (net -379 ms). Bit-exact: q36-27b 5951a5b4..., q36-35b-a3b 07db32c2... (on == off == 0022); test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805. On by default; GGML_CUDA_MOE_QUANT_DEDUP=0 restores stock. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 21:49:15 +00:00
Ettore Di Giacinto	3c1ed67b4b	feat(paged): qwen35 gated-DeltaNet decode occupancy/coalescing retune (patch 0022) Bit-exact occupancy retune of gated_delta_net_cuda, the B=128 decode recurrence kernel, carried as paged patch 0022. After the f32 verdict (vLLM carries the gated-DeltaNet temporal state in float32 and moves the same ~805 MB/call as llama; the gap was pure DRAM bandwidth efficiency on equal bytes - llama 73.4% vs vLLM 82.4% of the 273 GB/s GB10 peak), the lever is a latency-coverage retune that keeps the per-column f32 reduction/FMA order byte-identical (md5-gateable). The bf16-state plan stays shelved. Column folding: each warp owns COLS_PER_WARP columns of the 128x128 recurrent state instead of 1, looping the existing per-column body over col, col+NUM_WARPS, ... within a per-block column tile; grid.z = S_v / (NUM_WARPS*COLS_PER_WARP). The per-lane strided row sharding and the warp_reduce butterfly are unchanged, so only the (warp,block)->column assignment differs and the result is bit-identical; per-warp memory-level parallelism rises ~COLS_PER_WARP-fold, covering more DRAM latency on this bandwidth-bound kernel. Default tile is the measured GB10 winner (NUM_WARPS=16, COLS_PER_WARP=8), env-selectable via GDN_NW / GDN_CPW. GB10: gated_delta_net decode 4.02 -> 3.49 ms/call, 73.4% -> 84.6% of peak (above vLLM's 82.4%; 102.6% of vLLM recurrence BW). decode S_TG t/s: dense 27b npl128 335.9 -> 373.2 (+11.1%), MoE 35b-a3b npl128 688.4 -> 745.7 (+8.3%). Greedy md5 byte-identical to the 0021 baseline on both q36-27b-nvfp4 and q36-35b-a3b-nvfp4; test-backend-ops -o GATED_DELTA_NET 36/36 PASS. Bench/method in OCCUPANCY_RETUNE_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 18:34:17 +00:00
Ettore Di Giacinto	8f8777e0f4	feat(paged): qwen35 decode conv-state in-place fusion (patch 0021) The no-regret bit-exact conv-state cleanup from the GDN recurrence byte-gate design (point 3). After the recurrence verdict (NO-BUILD: the gated-DeltaNet recurrence is already single-pass at the f32 byte floor), the decode conv path was the only remaining bit-exact lever. New fused op ggml_ssm_conv_update_inplace (reuses GGML_OP_SSM_CONV, discriminated by a non-null src[3]). On the single-token decode path it replaces the four-op conv chain - qkv transpose + ggml_concat (concat_cont) + ggml_ssm_conv + ggml_silu + ggml_cpy of the shifted ring state (cpy_scalar) - with one kernel that, per (channel, sequence), assembles the width-K window in registers from the K-1 cached taps plus the current qkv_mixed token, computes the depthwise conv with the SAME ascending-tap FMA order as ssm_conv_f32 at i==0, folds silu, writes the conv output, and writes the 1-token-shifted ring state back IN PLACE into the conv cache slot at kv_head. This is vLLM causal_conv1d_update; it mirrors the 0018 in-place write-back and 0019 patterns. Read source (the build_rs tap gather) and write target (the cache view) are disjoint buffers, so it is race-free by construction with no ids/identity logic. - ggml.h/ggml.c: builder (src0=conv_states [K-1,ch,n_seqs], src1=conv_kernel, src2=x_cur [ch,1,n_seqs], src3=conv_state_dst [(K-1)*ch,n_seqs] in-place ring; op_params[0]=fuse_silu) - ggml-cuda/ssm-conv.cu: ssm_conv_update_f32<apply_silu,d_conv> kernel + ggml_cuda_op_ssm_conv_update + src[3]-discriminated branch in ggml_cuda_op_ssm_conv - ggml-cpu/ops.cpp: ggml_compute_forward_ssm_conv_update_f32 (threads over channels) + branch in ggml_compute_forward_ssm_conv - delta-net-base.cpp/models.h: build_conv_state_fused (keeps the cheap build_rs conv-tap gather; fuses conv+silu+shifted write-back) - qwen35.cpp, qwen35moe.cpp, qwen3next.cpp: route the single-token decode path (n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar); prefill/chunked/rollback keep the original chain - tests/test-backend-ops.cpp: test_ssm_conv_update (16 cases) vs the CPU reference test-backend-ops: SSM_CONV 45/45, SSM_CONV_UPDATE 16/16, SSM_CONV_BIAS_SILU 90/90. Greedy (--temp 0 --seed 1 --ignore-eos -n 256) byte-identical to the Lever-1 (0019/0020) baseline: q36-27b-nvfp4 md5 675cd522..., q36-35b-a3b-nvfp4 md5 ac163882... both BYTE-IDENTICAL. decode_agg S_TG (npp128 ntg128, -fa on, CUDA-graph), same session: dense q36-27b-nvfp4 : npl 32 199.76 -> 202.99 (+1.6%) npl 128 336.35 -> 347.14 (+3.2%, 86.0 -> 88.8 percent of vLLM 391) MoE q36-35b-a3b : npl 32 421.72 -> 432.39 (+2.5%) npl 128 689.74 -> 713.54 (+3.5%) Lift holds in eager too (dense npl128 333.62 -> 342.97). Step -11.9 ms/step (dense npl128: 380.6 -> 368.7). nsys eager decode: concat_cont (1152 calls) and the decode cpy_scalar GONE; ssm_conv_f32 at decode replaced by ssm_conv_update (1152); conv-path ~20.9 -> ~7.6 ms/step. Bit-exact, no regression, de-risks the bf16-state conv-cache plumbing. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 16:56:35 +00:00
Ettore Di Giacinto	5cec1a6a21	docs(paged): bitexact-vs-vLLM verdict + verified f32 GDN-state correction Synthesize the cross-engine bit-exactness and f32-preserving-parity study. Resolve the contradiction between sub-agents (one f32, two bf16) by reading every link of vLLM's state-dtype chain on live source: - config.json text_config.mamba_ssm_dtype = "float32" (both served models) - cache.py default mamba_ssm_cache_dtype = "auto"; bench passes no override - vllm.py __post_init__ -> try_verify_and_update_config (config finalize) - Qwen3_5ForConditionalGenerationConfig override copies "float32" into mamba_ssm_cache_dtype before state-dtype resolution - mamba_utils._mamba_state_dtype -> temporal = torch.float32 (conv = bf16) - qwen_gdn_linear_attn allocates the temporal cache at f32 Verdicts: B1 TRUE (sub-claim 'more efficient than vLLM' refuted); B2 REFUTED (equal f32 bytes both sides, ~10pct efficiency gap not 2x width); B3 REFUTED (vLLM hits throughput with f32 state; a bit-exact occupancy/coalescing retune of gated_delta_net_cuda 74->81pct peak is the f32-preserving parity lever); B4 CONFIRMED (bit-exact-vs-vLLM impossible: A1 FP4 GEMM 8/4/16-bit operand gap + A2 recurrence g.Sigma vs Sigma.g reassociation on different reduction trees, plus general FP non-associativity). bf16 temporal state degrades BELOW vLLM's f32 recurrent precision -> an over-clock, not a parity requirement. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 16:55:25 +00:00
Ettore Di Giacinto	17855735c7	docs(paged): bf16 SSM-state build plan (PART C synthesis: edits, KL gate, bench, risks) Synthesizes the bf16 SSM recurrent-state-cache plan into a build-agent brief: ordered file-by-file edit list (kernel/op dtype-generic first, then cparams default flip, gRPC/YAML, back-compat), the KL<1e-3 + PPL-delta + coherence + long-context-drift acceptance gate that REPLACES the bit-exact md5 gate (bf16 is intentionally non-bit-exact, equal precision to vLLM), bench targets (recurrence 3.98->2-3 ms/call, step 384->289-339 ms, 360-443 tok/s dense) + nsys check, the default-bf16/f32-opt-out semantics + state-file back-compat, the risk register, and the single biggest risk (silent corruption on the prefill/keep_rs_t/gather paths) with the de-risk-first test-backend-ops step. Conv state stays f32 in v1. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 16:46:59 +00:00
Ettore Di Giacinto	2a8103c419	docs(paged): FINAL DECISION - NO-BUILD fused recurrence, BUILD conv fusion + bf16 state Synthesis of the byte-gate workflow (ncu-byte-gate measurement + vllm-fused-recurrence-study + llama-fused-recurrence-design + conv-fusion-design). Verdict closes all five decision points: (1) Byte ratio: llama re-stream ~1.0x (cap <=1.33x); recurrence at 74% GB10 peak, MORE BW-efficient than vLLM packed_decode at 41%. The 2x DRAM gap is 100% f32-vs-bf16 state-cache width, not extra passes. (2) Fused single-pass recurrence: NO-BUILD - already one R + one W of f32 state, gate ops touch tiny q/k/g/beta not the 805 MB state -> recovers ~0 bytes. (3) Conv-state in-place fusion: GO - bit-exact, no-regret, +12-14 ms/step (~+3%), eliminates concat_cont + cpy_scalar + folds silu. (4) bf16 SSM state: BUILD (KL<1e-3 gated product call) - only lever on the dominant 50% recurrence term, +45-95 ms/step -> step 289-339 ms = parity-to-ahead of vLLM. Bit-exact parity unreachable on this term (f32 bytes irreducible); bf16 = equal precision to vLLM, which is itself bf16. (5) Build order: conv fusion next (no-regret, bit-exact), then bf16 state (highest value, gated). Confirming measurements stated per step. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 15:27:04 +00:00
Ettore Di Giacinto	fd4332e8f0	docs(paged): GDN recurrence byte-gate SETTLED - re-stream ~1.0x, build bf16 state not fused kernel Decisive measurement (ncu-byte-gate agent, DGX GB10). ncu HW DRAM counters were blocked (ERR_NVGPUCTRPERM, root-only NVreg param; no passwordless sudo), so the byte ratio was settled via CUPTI kernel timing + exact byte geometry: bytes moved <= peak_BW x duration caps the re-stream factor. llama gated_delta_net_cuda decode (B=128, f32 state): 3.98 ms/call, 805 MB R+W, 202 GB/s = 74% of GB10 peak. vLLM fused_recurrent_packed_decode (B=128, bf16 state): 3.62 ms/call, 402 MB R+W, 111 GB/s = 41% peak. Both single-pass (load-once/store-once, verified in source). llama re-stream factor ~1.0x (hard cap <=1.33x; >=1.5x needs >peak BW = impossible). VERDICT: NO-BUILD the fused single-pass recurrence - the kernel is already single-pass, coalesced, and MORE bandwidth-efficient than vLLM's triton kernel; the gate ops touch the tiny q/k/g/beta projections, not the 805 MB state, so fusion recovers ~0 state bytes. The entire 2x DRAM gap vs vLLM is f32 (llama) vs bf16 (vLLM) state-cache width. BUILD bf16 SSM state instead: halves 805->413 MB, ~45-95 ms/step, step 384 -> 289-339 ms = parity-to-ahead of vLLM 327 (non-bit-exact vs f32 but equal to vLLM's own bf16 precision). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 15:24:49 +00:00
Ettore Di Giacinto	5825b073a5	docs(paged): SYNTHESIS - validated decode-parity picture, ranked plan, verdict Final synthesis of the critical-path gap analysis: the decode step is 99.94% GPU-busy single-stream (idle 0.225ms = 0.06%), so the 14% gap to vLLM is kernel GPU-time dominated by the bandwidth-bound gated_delta_net recurrence (196.37ms = 51.6%), not launch bubbles. Claims A/B/C all REFUTED as worded; the single residual is the unmeasured DRAM byte ratio of llama's recurrence vs vLLM's fused kernel. Ranked plan: single-pass fused GDN recurrence (gap-closer, gate on ncu byte-ratio test) + conv-state concat fusion (no-regret +2-3%, bit-exact); gate-fold alone tops out at ~89% of vLLM; bf16 state is the only floor-mover but breaks bit-exactness. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 15:03:18 +00:00
Ettore Di Giacinto	a72385257a	docs(paged): decisive node-level decode timeline gap - bubbles refuted Fresh nsys --cuda-graph-trace=node capture of one steady decode step on q36-27b-nvfp4 dense at npl128 (clean Lever-1 build-cuda-base). The decode step is a single CUDA graph; node-level expansion shows it is 99.94% GPU-busy on a single stream with 0.225 ms/step inter-kernel idle (0.06%, zero gaps >5us). This refutes the "~60% idle bubbles / 57 ms = 100% bubble" hypothesis and confirms the cudagraph-coverage source verdict. Real decode mix: gated_delta_net 196 ms = 51.6% of the step (4.08 ms/call x48; the prior 1.47 ms/call "near-vLLM" was a prefill-contaminated eager average), FP4 GEMM+quantize 29%, gating glue (Lever 3 target) only 3.35%, gdn_gather 0.06 ms. By roofline-decode's own sizing test (idle < 57 ms => gap is elsewhere) the 14% gap to vLLM lives in kernel GPU-time, dominated by the bandwidth-bound GDN recurrence, not in bubbles; Lever 3 fusion is resized to ~3% and reframed as byte-reduction, not bubble removal. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 14:57:37 +00:00
Ettore Di Giacinto	2b57997df0	docs(paged): cudagraph-coverage - GDN serial chain IS graph-covered at B=128 Determine whether the ggml CUDA graph covers the gated-DeltaNet serial chain at batch=128. It does: nothing in the GDN region forces graph-disable (check_compability lists only split-buffers and large-batch MUL_MAT_ID), and the recurrent head is constant for a steady 128-seq batch so the inplace_ids state_dst offset + rs_head op_param + SSM input shapes are stable across steps. The fused op does no host-sync / capture-time cudaMalloc. The only re-warm is the per-256-token full-attention block-table cadence (not a GDN op). The ~40% util is bandwidth-roofline (SSM state traffic 66% of step bytes), not launch-gap idle - so no GDN graph-safe lever; the only non-covered idle is the ~0.4% between-step host cgraph rebuild. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 14:45:51 +00:00
Ettore Di Giacinto	e597a8ac78	docs(paged): vLLM GDN decode = 2 fused kernels under CUDA graph vs llama ~8 ops Read-only source comparison of the gated-DeltaNet decode region. vLLM folds conv-silu, q/k l2norm, scale, softplus+A_log gate, sigmoid-beta, the delta-rule recurrence and the SSM state write-back into ONE Triton kernel (fused_recurrent_gated_delta_rule_packed_decode), with the output gate fused into a gated rms_norm, and captures the whole decode forward in a full CUDA graph (GDNAttentionMetadata UNIFORM_BATCH, decode-only full cudagraph). llama runs the same region as ~8 separate host-launched, serially-dependent ggml nodes. That launch/bubble delta - not GEMM throughput - is the candidate 62%-vs-40% busy gap. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 14:43:01 +00:00
Ettore Di Giacinto	b895f4dff8	feat(paged): qwen35 gated-DeltaNet o_proj MMVQ->MMQ reshape (patch 0020) Lever 1, the single biggest decode-parity lever for the Qwen3.6 hybrid-SSM models (arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). Post-SSM (patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391; profiling both engines pinned the largest llama-specific overage to the gated-DeltaNet output projection (ssm_out). The GDN op left its output in SSM layout and the graph reshaped it to 3D [value_dim, n_seq_tokens=1, n_seqs=128] before the ssm_out matmul, so src1->ne[1]=1. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the 128 sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does not amortize the ssm_out weight read across the 128 sequences. vLLM packs the same projection into one M=128 GEMM. The in-projection was already 2D -> MMQ; only the output was 3D. The fix collapses the GDN output to 2D [value_dim, n_seq_tokens * n_seqs] (= [6144, 128] at decode) before the ssm_out ggml_mul_mat, so src1->ne[1]=128 routes to the MMQ M=128 tensor-core GEMM. The result is then already 2D, so the redundant post-matmul reshape_2d is dropped. Same contiguous data, just a 2D vs 3D view: bit-identical. Gated to the gated-DeltaNet path (qwen35 / qwen35moe / qwen3next); other archs untouched. Bit-identical greedy (--temp 0 --seed 1) vs the post-SSM baseline on both q36-27b-nvfp4 (dense) and q36-35b-a3b-nvfp4 (MoE), byte/md5-identical. test-backend-ops MUL_MAT and MUL_MAT_ID OK. decode_agg S_TG (llama-batched-bench, -fa on, npp128 ntg128, npl 32/128): dense q36-27b: 170.52 / 254.92 -> 200.00 / 335.80 t/s (+17.3% / +31.7%) MoE q36-35b-a3b: 373.28 / 560.66 -> 420.77 / 691.24 t/s (+12.7% / +23.3%) Dense @128 = 335.80 t/s = 85.9% of vLLM 391 (up from 65%; target 82-85% hit). nsys: the o_proj mul_mat_vec_q<NVFP4,m=1> bucket (132.8 ms / 48 inst) collapses to zero; mul_mat_q<NVFP4,m=128> absorbs it (+1200 inst, +363 ms) at a LOWER per-call average (620.8 -> 582.7 us). Realized o_proj-as-MMQ cost ~0.30 ms/call vs 2.77 ms/call for the old GEMV. Mirrors DGX dev-tree commit df1cc97. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 10:41:38 +00:00
Ettore Di Giacinto	c0e0ed3865	docs(paged): synthesize decode-parity exploration - the o_proj MMVQ lever Cross-check the adversarial validation against the profiler ground-truth and finalize DECODE_PARITY_EXPLORE.md. The post-SSM 254->391 decode gap is one llama-specific defect: the gated-DeltaNet output projection (ssm_out) runs as an FP4 GEMV (mul_mat_vec_q, 132 ms/step = 26% of decode) at batch 128 instead of a tensor-core MMQ GEMM. Mechanism confirmed at source: final_output is 3D [6144,1,n_seqs] so src1->ne[1]=1 trips the MMVQ dispatch (<=8), with the 128 sequences in ne[2]. vLLM packs the same projection into a cutlass M=128 GEMM. GDN recurrence is only +11%/call (not the lever); P2a optimized the wrong FP4 kernel (the 17% MMQ, not the 26% MMVQ); CUDA graphs, host loop, and DRAM bytes are all ruled out. Decode parity is reachable in software (not a hardware floor): identical bytes/floor, vLLM hits 62% util vs llama 40% on the same GB10. Highest-value next step (~free, bit-exact): collapse final_output to 2D before ssm_out so M=128 routes to MMQ. Ranked levers + cumulative ceilings toward 391 documented. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 09:06:50 +00:00
Ettore Di Giacinto	ee13fd18ce	docs(paged): profile-both-engines post-SSM ground-truth decode decomposition Fresh post-SSM nsys of llama (build-cuda-base, patch 0019) AND vLLM 0.23.0 at npl128 decode. Reproduces the 391 reference (vLLM 394 t/s eager / 420 graphs, graphs +6% only) and confirms llama 245 t/s. Both ~98% GPU-busy; the gap is GPU kernel-time, not idle/host/graphs. GDN compute comparable (llama 4.03 vs vLLM 3.62 ms/call, +11%). bytes/step: llama not higher (131 vs 85 MB memcpy; SSM-fix 18GB/step DtoD removal confirmed in-trace). Single biggest llama-specific overage = FP4 matmul path 236 vs 117 ms/step (+119 ms = 64% of the gap), dominated by mul_mat_vec_q (FP4 GEMV at batch 128, 132 ms/step, 26%, one per GDN layer). Track B optimized the wrong FP4 kernel (mul_mat_q, not the GEMV). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 08:56:37 +00:00
Ettore Di Giacinto	6f0792c3be	feat(paged): qwen35 SSM decode fused recurrent-state gather (patch 0019) Mirror of the llama-paged-dev patch 0019 engine change plus the measured results. Step 2 of the SSM decode work: after Step 1 (in-place state write-back, patch 0018) the largest non-GEMM decode bucket was the recurrent-state get_rows gather (18.8 percent of decode GPU time). This removes that materialization, mirroring ggml_ssm_scan's ids source: ggml_gated_delta_net_inplace_ids reads each sequence's prior state directly from cache[ids[seq]] (src[5] = full cache, src[7] = ids), so combined with Step 1's in-place write the op reads AND writes the cache directly with no state materialization at all. Race-free by construction: identity sequences (ids[seq] == rs_head + seq, the whole AR decode path) read s0 in place from the destination slot; non-identity sequences (reorder / rs_zero, e.g. multi-new-seq prefill) read from a disjoint scratch a small gather kernel populates first. ids stays a device pointer. Bit-identical to the get_rows path. Gated to qwen35 + qwen35moe; qwen3next, kimi-linear, the non-fused and rollback paths are unchanged. Measured (decode_agg S_TG, npp128 ntg128, -fa on, paged on, fusion off): q36-27b-nvfp4 dense: npl32 137.64 -> 170.68 (+24.0 percent), npl128 186.25 -> 256.57 (+37.8 percent, 47.6 -> 65.6 percent of vLLM 391). q36-35b-a3b-nvfp4 MoE: npl32 299.68 -> 366.69 (+22.4 percent), npl128 409.30 -> 553.63 (+35.3 percent). Greedy (--temp 0 --seed 1) llama-completion bit-identical vs the Step-1 build (dense + MoE). nsys k_get_rows_float bucket 18.8 -> 0.7 percent. The residual decode gap to vLLM is now the FP4 GEMM (~48 percent of decode). See SSM_DECODE_FIX_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 23:47:51 +00:00
Ettore Di Giacinto	5ce2f1df51	feat(paged): qwen35 gated-DeltaNet in-place SSM state write-back (patch 0018) Mirror of the llama-paged-dev patch 0018 engine change plus the measured results. Per SSM layer per step decode no longer D2D-copies the full ~225 MB recurrent state into the cache: the fused gated_delta_net op writes the final state in place at the active sequences cache slot (new ggml_gated_delta_net_inplace, src[6] = state_dst), mirroring vLLM fused_recurrent_gated_delta_rule. SSM math unchanged (bit-identical greedy). Measured (decode_agg S_TG, npp128 ntg128, -fa on, paged on): q36-27b-nvfp4 dense: npl32 113.74 -> 136.39 (+19.9 percent), npl128 146.23 -> 180.53 (+23.5 percent, = predicted copy-removal ceiling). q36-35b-a3b-nvfp4 MoE: npl128 313.36 -> 372.62 (+18.9 percent). nsys D2D memcpy bucket 18.9 -> 0.23 percent (356 -> 2.93 GB). vLLM share (391 @128) 37.4 -> 46.2 percent. See SSM_DECODE_FIX_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 22:45:49 +00:00
Ettore Di Giacinto	34cadb64af	docs(paged): A.2 final synthesis - CUDA-graph decode verdict Append the four-point synthesis to A2_CUDAGRAPH_DECODE.md: measured CUDA-graph lever size (<1%, not the guessed 10-20%), the corrected 'eager' premise (default paged decode already captures), the unchanged 37-38% of vLLM at npl128, and the honest verdict that A.2 closes none of the 2.6x gap because paged attention touches ~0.4% of decode on this hybrid-SSM model. Residual lever is the qwen35 gated-DeltaNet SSM path (state D2D copy + get_rows gather), orthogonal to paged attention. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 21:45:42 +00:00
Ettore Di Giacinto	2dd5d68e6d	docs(paged): A.2 Phase 2 - locate the real decode lever (gated-DeltaNet SSM path) Phase 1 ruled out CUDA graphs as the paged-decode lever (GPU 99.4% busy, decode_agg flat graphs on-vs-off) and attributed the 2.6x gap to vLLM to the per-step GPU kernel work (FP4 GEMM + attention at batch 128). Phase 2 decomposed that kernel work directly on the Phase-1 nsys reps and corrects the attribution. Findings (q36-27b-nvfp4 = gguf arch qwen35, a 48:16 hybrid gated-DeltaNet linear-attention + full-attention model; DGX GB10 sm_121, fusion off): - Graphs re-confirmed not the lever: fresh paged graphs-ON 146.03 vs OFF 144.90 t/s (+0.78%, noise); the captured rep is 99.5% busy with the same ~3267ms memcpy (graphs capture memcpy nodes too). - The 99.4% busy is real but ~19% of it is D2D memcpy, not compute: an overlap-correct interval-union sweep gives kernels-only 80.2% busy, the gap filled by 1584 D2D copies/run (~80/step, ~230MB each = the gated-DeltaNet recurrent state). Phase 1's cuda_gpu_trace lumped this into compute. - Decode GPU-time decomposition (% of kernel+memcpy busy): gated_delta_net 23.4%, get_rows 21.9%, D2D state copy 18.9%, FP4 GEMV 15.5%, FP4 GEMM 10.4%, full attention 0.4%. Grouped: SSM/gated-DeltaNet machinery ~67%, FP4 matmul ~28%, full attention (all paged-attn optimizes) ~0.4%. Verdict: not graphs, not the host loop, not primarily FP4 GEMM, not attention. Paged attention touches ~0.4% of decode on this model, so no paged/graph/ block-table change can move decode_agg. The lever is the ggml qwen35 gated-DeltaNet decode: kill the per-layer recurrent-state D2D copy and fuse the get_rows gather into the recurrence (vLLM's fused_recurrent_gated_delta_rule keeps state in place). Ceiling: -copy ~146->180; -copy-and-gather ~146->247 t/s. No code patch (the lever is an SSM-path rewrite, orthogonal to paged attention); patches/paged/0018 stays free. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 21:44:22 +00:00
Ettore Di Giacinto	da67fd87e2	docs(paged): A.2 CUDA-graph decode lever measurement and gap diagnosis Phase 1 measures the CUDA-graph lever on the paged decode (q36-27b-nvfp4 dense, GB10 sm_121, fusion off). The 4-cell decode_agg {stock,paged} x {graphs on,off} is flat within ~1%: the graphs-on win is +0.13% at npl128 and +1.1% at npl32 (both within run noise). The default paged decode is not eager: it captures and replays graphs with a 256-token reset cadence identical to stock non-paged (block-table ne0 = GGML_PAD(n_gather,256) only steps at 256-token boundaries); only the gather fallback grows n_gather every step and runs pure eager. 'graphs reused=0' was a uid fast-path false negative (llama rebuilds the cgraph each step, so the reuse log never fires while the graph still replays via the instance path). nsys (reliable eager trace, plus the captured trace re-run with --cuda-graph-trace=node to defeat nsys omitting graph-internal kernels, an artifact that otherwise reads 0.3% busy) shows the steady decode is 99.4-99.5% GPU-busy. Idle is ~0.6% of the step: 0.37% within-step launch gaps (the only thing graphs remove, cut to 0.11% when captured) plus a 0.24% between-step host gap (~2ms per step). Throughput is identical on/off. Verdict: CUDA-graphing the paged decode is not a throughput lever; the decode is GPU-compute-bound and the 2.6x gap to vLLM (148 vs 391) is in the per-step GPU kernel work (FP4 GEMM + attention at batch 128), not launch overhead or the host loop. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 21:26:16 +00:00
Ettore Di Giacinto	40f019e761	docs(paged): mirror FP4 decode-GEMM track-B P0 gate + P1 kill-gate results (patch 0017) Mirror of llama.cpp dev-tree commit 089f78d. Track B P0 (bit-exact NVFP4 dense decode-shape MUL_MAT parity gate) + P1 (default-off occupancy levers) for the GB10 dense FP4 weight GEMM. P1 kill-gate TRIPPED: the cheap host/occupancy levers do not lift decode_agg on GB10 (sm_121). DENSE q36-27b-nvfp4 @npl128 149.5 -> minblocks2 147.9 (-1.1%) -> dense mmq_x=64 144.3 (-3.5%); MoE q36-35b-a3b mmq_x-down regresses (TILE16 -3.7%, TILE8 -5.9%, reproduces patch 0015). nsys: the FP4 GEMM mul_mat_q<NVFP4,128,0> went 2.782s->3.025s (+8.7% slower) under register-capping (spilling). The dense M=128 tile is already weight-read/one-read-optimal; the only untested lever is the structural mmq_y-down (nwarps=4 warp-remap, blocked by nwarps*tile_C::I==mmq_y), deferred to P2. All levers default-off => default build byte-identical to stock. See THROUGHPUT_B_P1_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 17:58:00 +00:00
Ettore Di Giacinto	39e16cc2c4	docs(paged): adversarial review of track-B FP4-GEMM parity go/no-go Append section 9 (skeptical staff-CUDA-engineer review) to FP4_GEMM_SCOPE_B.md, stress-testing the dense/MoE parity verdict against the committed grounding. Key findings: - Not the W4A16 wall: the npl-sweep (dense 99/56/46/41% of vLLM at npl 8/32/64/128) shows llama's FP4-MMA kernel HITS the weight-read floor at M=8 and FALLS OFF it as M grows, while vLLM HOLDS it. Working-path tune, dual existence proof (M=8 + vLLM M=128), not a greenfield build. Same binding constraint as W4A16 though (hide LPDDR5x latency at the larger tile on an occupancy-dominated part). - The dense gap is ~82-87% GEMM, ~13-18% non-GEMM (467 ms total = 383-405 GEMM + 62-84 non-GEMM). B alone caps ~80%; track A is what tips dense over the parity line. - Sharpest omission: vLLM's M=128 floor is reached via cutlass TMA + deep pipeline - the technique the doc forbids on GB10. TMA != manual cp.async (lower occupancy cost); it must be an in-scope P2 fallback, not categorically banned. - Honest landing: dense ~80-90% (parity the optimistic tail, contingent on B+A+floor), MoE ~55-65% (parity not reachable from B). Low-regret: even a tripped P2 kill-gate lands B+A ~89%, doubling today's 41%. - Sequencing fix: land A first (defines B's interface + baseline + kill-gate), then run B's P2 against the post-A number. Verdict: DENSE conditional GO (scope as GEMM-gap-closing, not true parity; A-first, gate at P2, add TMA); MoE NO-GO for parity from B (do the cheap mmq_x-down win as a 1.7-1.85x, not parity). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 14:31:35 +00:00
Ettore Di Giacinto	7434d64c75	docs(paged): build-ready track-B FP4-GEMM scope - kernel decision + per-phase decode_agg Rewrite the track-B scope into the definitive build-ready plan for the NVFP4 FP4-MMA decode GEMM toward vLLM GB10 parity. Source-read of the mmq.cuh/mma.cuh/quantize.cu FP4 path on the dgx paged dev tree settles two load-bearing facts the prior draft got partly wrong: - llama's dense path is already TRUE W4A4 (block_fp4_mmq packs 256 e2m1 values + ue4m3 scales; the MMA is kind::mxf4nvf4 e2m1.e2m1...ue4m3), so there is no activation-bit-width work to do; the whole dense deficit is scheduling/occupancy. - the mmq_x selector minimizes ntiles_x, which PINS dense decode at mmq_x=128 (weights read once). Shrinking mmq_x re-reads the 18 GB weights, so the dense occupancy lever is mmq_y-down (BW-neutral), NOT mmq_x-down; MoE's free lever is the per-expert mmq_x-down (patch 0015). Adds the explicit kernel-approach decision (tune the existing FP4-MMA mul_mat_q; reject the cutlass-SM120 rewrite, dead on GB10 and broken on sm_121; reject the BF16-Marlin descent), the concrete build-ready changes (mmq_y/granularity/stream-k knobs, FP4-MMA fragment invariants, the ue4m3 scale path, and the block_fp4_mmq y-tile ABI contract for the track-A act-quant fusion handoff), the GB10-fit rules, the bit-exact test-backend-ops gate with decode-shape + ragged-M cases, and per-phase expected decode_agg tables. Verdict (honest, roofline-grounded): the decode GEMM is bandwidth-bound on the hardware roofline (M=128 << crossover 611; weight-read floors 4-6x above vLLM) but compute-bound in practice at ~3% FP4 eff, so 273 GB/s is not the wall. DENSE: GO (conditional) - B+A reaches 376-394 tok/s = 90-103% of vLLM 391, gated by a P2 occupancy kill-gate (<15% FP4 eff -> parity off). MoE: PARTIAL/NO-GO - ceiling ~76% of 811 (618) from the GEMM alone; full MoE parity needs the non-GEMM tracks too. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 14:21:48 +00:00
Ettore Di Giacinto	c1d7f336cb	docs(paged): enrich track-B scope with code-level FP4-GEMM inefficiencies Add the source-read kernel-mechanism map (no cp.async weight pipeline, mmq_x tile-maximizing selector vs GB10 occupancy, MoE per-expert M-tile waste, iter_k=512 coupling, ruled-out non-levers) and strip the stray trailing tags from the prior write. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 14:11:41 +00:00
Ettore Di Giacinto	ea634ee958	docs(paged): scope track B - FP4-MMA decode-GEMM roofline + parity go/no-go Roofline at the decode batch shape (M=128, NVFP4 weights) on GB10 (sm_121): the dense weight-read floor (~1,940 tok/s) and MoE floor (~1,590 tok/s) sit 4-6x above vLLM's 391/811, so 273 GB/s is NOT the wall. At FP4 peak the GEMM is bandwidth-bound (crossover M*~611 >> 128); at the kernel's ~3% achieved FP4 efficiency it is compute-bound by its own inefficiency (471 ms vs a 66 ms floor). Verdict: dense decode parity is plausibly reachable via a tuned FP4-MMA decode M-tile (track B) + fused act-quant (track A), landing 376-394 tok/s = 90-103% of vLLM 391, but only at the top of the demonstrated GB10 FP4 envelope (~17-21%) and with no margin (occupancy wall is the binding constraint, not bandwidth). MoE parity is NOT reachable from the GEMM alone (ceiling ~60-76% of 811): its floor is the hardest grouped-GEMM regime and ~24% of its step is non-GEMM work outside track B. GO (conditional) for dense, PARTIAL for MoE. Build-ready phased plan included; tune the existing block_fp4_mmq path, not a W4A16 rewrite. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 14:09:41 +00:00
Ettore Di Giacinto	e4c63179e0	docs(paged): verify llama.cpp GDN decode is O(1)-in-context, not a 2.4x lever Closes lever 5 of VLLM_DECODE_GROUNDING.md. GGUF metadata + source reading on the paged dev tree plus nsys decode traces on Qwen3.6-27B NVFP4 (GB10 sm_121) confirm the Gated-Delta-Net linear-attention layers decode as a fused single CUDA kernel (gated_delta_net.cu) updating a fixed-size cached recurrent state: no context-length parameter, no KV re-scan. Matched-batch context-scaling control (npl4, pure decode) shows the GDN kernel flat (10.3 -> 8.0 us/launch) across 4x context while full-attention grows 3.1x (27 -> 85 us). GDN is a small, context-flat share (~0.4-10%% by batch); the FP4 weight GEMM dominates (~67%). Verdict: GDN decode is efficient, not the cheap model-specific fix; the 2.4x is the general GEMM + full-attention kernel work, as the grounding concluded. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 11:21:44 +00:00
Ettore Di Giacinto	f7500df64e	docs(paged): staggered-arrival evaluation of patch 0016 dynamic budget The prior all-at-once BURST H2H is adversarial to any prefill budget (TTFT is prefill-rate-bound, a cap only slows the drain) and showed 0016 ~= 0013. Run a STAGGERED-arrival benchmark on the GB10 DGX (patch 0016 built @253cbae): a steady-rate client that keeps a mix of in-flight decoders + newly-arriving prefills, capturing per-request TTFT and the full inter-token-latency series. Append the metrics (in-flight decode protection + new-request TTFT, per arm) and an honest verdict to P1_DYNAMIC_BUDGET_RESULTS.md. On staggered traffic stock's in-flight decoders freeze multi-second on every prefill admission while both budget arms keep ITL flat; 0016 (mbt512) sits at a strictly better point on the protection/TTFT frontier than 0013-256 (equal spike-free protection, materially lower TTFT/throughput/wall) and adds a decode-adaptive single-T knob. It does not strictly dominate stock (Pareto tradeoff: smoothness vs raw TTFT). Verdict: 0016 earns its keep over 0013 on staggered traffic; recommend LLAMA_MAX_BATCH_TOKENS=512. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 10:56:13 +00:00

1 2

77 Commits