The llama-cpp-localai-paged backend sets NO explicit CUDA arch list anywhere
(CUDA_DOCKER_ARCH empty in every matrix row; compile.sh only injects
-DCMAKE_CUDA_ARCHITECTURES when non-empty), so it compiles the full upstream
ggml default arch fan - bit-identical targeting to stock llama-cpp, NOT
Blackwell-only. NVFP4 FP4-MMA is gated inside the kernel by
BLACKWELL_MMA_AVAILABLE, not by the build matrix, so the binary is arch-portable.
Variants: CUDA 12/13 + l4t arm64, ROCm, SYCL f32/f16, Vulkan amd64/arm64, CPU
amd64/arm64 (CPU_ALL_VARIANTS) - same Linux set as stock llama-cpp, not CUDA-only.
Single gap vs stock: NO Metal/Darwin row in includeDarwin and NO metal:
capability key in the meta-backend. macOS hosts fall back to the default cpu
(Linux) image, which will not run, and do not auto-fall to stock llama-cpp.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Mirror of paged-dev commit e2acb3b (lever 5). get_block_table() is recomputed
once per full-attention layer per decode step, but the KV cell layout is fixed
for the whole step (it only changes in apply()). This caches the table the first
time it is built in a step and memcpy-reuses the identical bytes for the rest,
invalidating in apply(). Bit-exact; toggle off with LLAMA_PAGED_NO_BT_CACHE=1.
Host-side get_block_table time (llama-batched-bench, npp128 ntg128 npl128,
cache OFF -> ON): MoE 112.94 -> 14.82 ms (-87%), dense 193.78 -> 16.90 ms (-91%).
Dense decode is partly host-bound and gains (TG 364.8 -> 374.7 t/s, ~96% of the
vLLM 391 t/s @npl128 reference); MoE decode is compute-bound (FP4 GEMM) so the
saved host time is off the critical path and MoE TG is flat. Details in
LEVER5_HOSTPIPE_RESULTS.md.
Also records the per-path bit-exactness gate (PAGED_BITEXACT_NOTE.md): the
paged-MoE greedy md5 (8cb0ce23) differs from the non-paged md5 (07db32c2) by a
benign FP-accumulation-order difference of the paged attention reduction, not a
bug. KL-validated vs the f16 reference (16 chunks, c512): KLD(paged||f16) =
0.13600 <= KLD(nonpaged||f16) = 0.13660, PPL(paged) = 7.4009 ~ PPL(nonpaged) =
7.3896 (within +/- 0.29). Canonical references are now per path: non-paged MoE
07db32c2 and paged MoE 8cb0ce23; dense is bit-exact across paths (5951a5b4).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Conclude the MoE-parity hunt. The two remaining sub-levers in the
20.3-vs-13.8 ms projection bucket are both bit-changing or at the BW floor:
- convert-glue (3.24 ms/step, measured: 1.73 input f32->bf16 + 1.52 output
bf16->f32): NOT bit-exact eliminable. ggml-cuda.cu:1663-1690 rounds the f32
GEMM accumulator to bf16 (CUDA_R_16BF dst) then widens to f32; that
bf16-rounded value is load-bearing for the shipped md5. Removing the
round-trip (f32-direct output, bf16 residual stream, or NVFP4 weights) all
rebaseline md5. A precision boundary, like lever 4.
- bf16 projection GEMM (17.27 ms/step): BW-bound at the LPDDR5x floor
(~4.7 GB/step at 273 GB/s; M=128 -> 128 FLOP/byte vs >900 ridge). nvjet
already TMA-streams the weights; cutlass reads the same bytes. No kernel
lever; only fewer bytes (quantize) helps - rejected on quality.
Corrects the body premise that vLLM runs these projections as NVFP4-Marlin:
vLLM runs the same nvidia-modelopt checkpoint that keeps them BF16, so the
projection bucket is a matched-precision gap, not a quant gap.
Realistic bit-exact MoE ceiling ~86-88% of vLLM; shipped lever 1 (86.3%) is
at it. No one-more-lever for MoE. Only clean win left is DENSE (+0.41% lever 5),
gated behind resolving the paged-MoE baseline md5 drift.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Re-quantizing the MoE GGUF's bf16 GDN/attn projections to NVFP4 (the lever-4
scope hypothesis) fails the KL gate on every axis vs the shipping NVFP4 baseline:
PPL +6.51% (FULL) / +6.15% (CONS) against a <1% gate, mean KLD-to-f16 0.164/0.172
vs baseline 0.137, top-1 argmax agreement down ~2.2-2.6 points. Both projq
variants rejected; in_proj_ba being kept bf16 (CONS) recovered almost nothing, so
the damage is in the bulk attn/GDN projections.
Root cause: the bf16 projections are a deliberate modelopt precision choice, not a
provenance accident. vLLM runs the same modelopt checkpoint, so it keeps these
projections bf16 too - the baseline GGUF already matches vLLM. The ~20.3ms
projection-GEMM bucket is the price of high-precision projections that vLLM also
pays; it is not the llama-vs-vLLM lever it appeared to be. The speed win is only
purchasable with a 6% PPL regression. MoE stays at 86.3% of vLLM @ npl128.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Scopes lever 4 (read-only, no GPU) on top of the flat levers 2+3. Root cause: the
MoE GGUF (nvidia modelopt, 241 NVFP4 tensors) quantized only the experts and left the
GDN/attn linear projections in BF16, while the dense GGUF (unsloth, 304 NVFP4 tensors)
already has them NVFP4 (proven: dense ssm_out runs FP4 MMQ; dense decode at 96.6% of
vLLM). Lever 4 = re-quantize the MoE GGUF's bf16 GDN/attn projections to NVFP4, the same
move vLLM makes on the identical weights - the +6.5ms projections bucket, the largest
single banked MoE gain available.
Path: offline re-quantize to a new GGUF variant (expanded --tensor-type); zero kernel
code - the loader sidecar-scale path + tuned mul_mat_q<NVFP4> are already in tree and
proven by the dense GGUF. Bit-changing => KL-gate, not md5. KL expected to pass (per-step
non-accumulating weight quant, unlike the failed bf16-state; experts already W4A4-clean);
lm_head is the one risky tensor (gate on argmax-agreement). Expected ~+4-6.5ms => MoE
86.3% -> ~88-91% of vLLM. Recommend a separate OPT-IN gallery variant (preserve the
bit-exact default; promote to default only if the KL gate is clean).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The dense + MoE base NVFP4 GGUFs are live (huggingface.co/mudler/Qwen3.6-27B-NVFP4-GGUF
and .../Qwen3.6-35B-A3B-NVFP4-GGUF), sha256 verified vs the Hub LFS hash, uris resolve.
Replaces the placeholder/not-yet-published TODO.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Rename the two base NVFP4 entries to a consistent -paged suffix
(qwen3.6-27b-nvfp4 -> qwen3.6-27b-nvfp4-paged, qwen3.6-35b-a3b-nvfp4 ->
qwen3.6-35b-a3b-nvfp4-paged) so all four base/MTP paged entries share the
naming convention. Update the two matching examples in the backend plan doc.
Add qwopus3.6-27b-v2-mtp-nvfp4-paged and qwopus3.6-27b-coder-mtp-nvfp4-paged:
verbatim copies of the stock qwopus NVFP4-MTP entries (same GGUF uri/sha256,
sampling, template, tags, function block) rewired onto the LocalAI
paged-attention stack (backend llama-cpp-localai-paged; f16, flash_attention,
131072 context, 99 gpu_layers, batch 512; paged_kv + max_batch_tokens:512 +
kv_unified:false + parallel:128). The stock entries are left untouched.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Add qwen3.6-27b-nvfp4-mtp-paged and qwen3.6-35b-a3b-nvfp4-mtp-paged: the
existing michaelw9999 NVFP4-MTP GGUFs (same uri/sha256/filename and the
recommended Qwen3.6 sampling defaults) wired to backend
llama-cpp-localai-paged with our optimized paged options (f16, flash
attention, 128k context, gpu_layers 99, batch 512, paged_kv, decode-first
max_batch_tokens, kv_unified:false, parallel:128).
These coexist with the stock llama-cpp *-nvfp4-mtp entries (distinct
-paged names) so the four LocalAI-paged NVFP4 entries sit together at the
top of the gallery.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
fix(nodes): return labels in single-node GET so the detail view shows them
The node detail view (/app/nodes/:id) reads `node.labels` to render a
node's existing labels, but the single-node GET endpoint returned a bare
BackendNode whose Labels live in a separate table - so the list was always
empty and operators could only add labels, never see what was already set
(#10527). The same response also lacked in_flight_count and model_count.
Add NodeRegistry.GetWithExtras, mirroring the existing List vs ListWithExtras
split: bare Get stays cheap for the routing hot paths and existence checks,
while the detail endpoint uses the enriched variant to attach the labels map
and live counts. No frontend change is needed - the UI already renders
existing labels once the data is present.
Closes#10527
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Fuse the residual k_get_rows_float in the gated-DeltaNet decode path (the biggest
single kernel vLLM lacks per MOE_GAP_VS_VLLM.md, ~5.2 ms/step MoE). 0019 fused the
SSM-state gather, 0021 fused the conv compute but kept a build_rs gather for the
conv taps; nsys located that conv-state tap gather (n_embd_r=24576 floats x 128
seqs, ~720 x ~115 us per 24-step window) as the last k_get_rows in the GDN path.
New op ggml_ssm_conv_update_inplace_ids reads each sequence's prior conv taps from
cache[ids[s]] in-kernel (identity in place from the write slot, non-identity via a
disjoint scratch), mirroring the 0019 in-place + ids fusion. Bit-exact: read VALUES
unchanged, only the read path changes. Helps both dense and MoE (shared GDN conv).
GATE test-backend-ops (CUDA0 2/2): SSM_CONV_UPDATE_IDS, SSM_CONV_UPDATE, SSM_CONV,
GATED_DELTA_NET, GET_ROWS all PASS. GATE greedy md5 (-temp 0 -seed 1 -n 48)
BYTE-IDENTICAL both models: q36-27b-nvfp4 5951a5b4..., q36-35b-a3b-nvfp4 07db32c2...
nsys: k_get_rows<float,float> 10174 -> 9454 instances, 186.3 -> 102.8 ms (720 conv
gathers eliminated, replaced by a ~1.1 us no-op gather).
Built and gated on the DGX llama tree (branch paged, commit 944636c, f32 default).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
fix(backends): darwin packaging for silero-vad
silero-vad was the last Go backend with Linux-only darwin packaging:
- package.sh fell through to "Could not detect architecture" -> exit 1 on
macOS (no Darwin branch), so its darwin image never packaged.
- run.sh exported LD_LIBRARY_PATH, which macOS dyld ignores, so the bundled
libonnxruntime.dylib couldn't be found at runtime.
Add a Darwin branch to package.sh (skip the glibc/ld.so bundling; add an
@loader_path/lib rpath so @rpath resolves to package/lib/) and a
DYLD_LIBRARY_PATH branch to run.sh — mirroring the piper darwin fix (#10525).
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Cross-agent synthesis on top of the both-engine nsys decomposition (3b5957157):
settle the user's "can we do what vLLM does on MoE?" question with the three
converging investigations (groundtruth measurement + vllm-marlin source-read +
marlin-port feasibility).
Verdict: vLLM's ~15% MoE-decode lead is NOT the Marlin GEMM (that bucket is a
-1.7 ms llama WIN: native FP4-MMA W4A4 47.3 vs Marlin W4A16 50.0 at the ragged
tiny-M decode shape, both at the LPDDR5x BW floor). The gap is bf16
dense-projection bandwidth (+6.5), recurrence state-gather plumbing (+6.6, led
by k_get_rows 5.2), graph/stream-overlap overhead (~+7), W4A4 act-quant tax
(+3.3), and router/glue (+5.4).
A W4A16/Marlin grouped MoE GEMM is REJECTED (default and opt-in): it would
regress the 27% GEMM bucket to half-rate bf16 MMA, re-enter the GB10 occupancy
wall the dense scaffold already STOPPED at, and its entire intrinsic upside is
the ~2% act-quant tax - smaller than the bit-exact +1.9% the 0025 re-graph
already banked, and closeable bit-exactly by fusing the act-quant.
Recommended build (none a new MoE GEMM): (1) fuse the k_get_rows SSM-state
gather (bit-exact, ~+5, biggest single-kernel win); (2) extend CUDA-graph
coverage + stream overlap (bit-exact, ~+7); (3) fuse the W4A4 act-quant into
RMSNorm/SiLU (bit-exact, +3.3); (4) NVFP4-quantize the still-bf16 GDN/attn
projections + lm_head (bit-changing, +6.5, the same NVFP4-dense-quant move vLLM
makes). Bit-exact levers alone reach ~94% of vLLM; with the projection quant
~96-97%, parity-or-better physically in reach since both heaviest kernels
(SSM core, MoE GEMM) are already llama wins.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Ground-truth side-by-side per-kernel ms/step of the MoE decode gap on DGX GB10.
llama (752 t/s, step 169.8ms) vs vLLM graphs-on (901-equiv, step 142.0ms): 27.8ms gap.
Headline: the grouped MoE-expert GEMM is a llama WIN - native FP4-MMA W4A4 47.3ms
vs vLLM Marlin W4A16 50.0ms at the tiny-M decode shape. A Marlin-style W4A16 MoE
GEMM would be slower; it is not the lever (extends the w4a16-marlin DENSE verdict).
The 15% lives elsewhere: bf16 projections + convert glue (+6.5ms), recurrence
state-gather plumbing (+6.6ms, led by k_get_rows 5.2ms), graph coverage + stream
overlap (~+7ms), W4A4 act-quant tax (+3.3ms), router/glue (+5.4ms).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Patch 0026 added the hybrid per-head bf16 SSM-state opt-in as the
ssm_hybrid_tau_thresh cparam + the --ssm-bf16-tau CLI flag (default 0 =
bit-exact f32). Expose it per-model via the LocalAI gallery/model YAML
`options:` list, mirroring the paged_kv / max_batch_tokens setenv hooks.
- grpc-server.cpp: new `ssm_bf16_tau` (alias `ssm_hybrid_tau`) option ->
setenv(LLAMA_SSM_BF16_TAU) when the value parses to a positive float. It
does NOT reference the paged-only common_params field, so the turboquant
fork (which lacks patch 0026) stays byte-clean.
- patch 0026 (common.cpp common_context_params_to_llama): getenv fallback
feeds cparams.ssm_hybrid_tau_thresh from LLAMA_SSM_BF16_TAU only when the
--ssm-bf16-tau CLI flag is unset (0). Absent/non-positive env => untouched,
so stock stays bit-exact; the CLI flag takes precedence when set.
- docs: backend/index.yaml note, docs backends.md, gallery header NOTE
(referencing A_HYBRID_SSM_RESULTS.md; the 2 NVFP4 entries stay bit-exact).
Byte-safe when unset: with no ssm_bf16_tau option the env is never touched
and the default f32 bit-exact recurrence is preserved. Verified the parse +
consume code paths with a standalone compile-and-run (option string ->
LLAMA_SSM_BF16_TAU -> tau, plus 0 / garbage / CLI-precedence / unset cases).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
B-3 (the 0017-deferred mmq_y-down warp-remap of the NVFP4 grouped FP4-MMA
mul_mat_q) was built bit-exact on the clean 0025 base and measured: the
grouped GEMM kernel itself runs -1.3% (occupancy did rise via the nwarps=4
warp-remap / 128 threads-per-CTA), but end-to-end MoE decode is FLAT
(npl128 +0.4%, npl32 +0.3%, within noise) because the stream-k fixup grows
+42% (mmq_y=64 doubles the row-tiles) and the step is SSM/BW-bound. md5 PASS
both models, test-backend-ops MUL_MAT 1146/1146 + MUL_MAT_ID 806/806 PASS.
No patch 0028; DGX dev tree reverted to pristine 0025.
Assessment: the bit-exact MoE GEMM/launch track is exhausted (B-1 re-graph
banked ~82->85%; B-2 and B-3 are 0). Honest bit-exact MoE ceiling = ~85% of
vLLM @npl128. The residual is the structural Marlin-NvFp4 grouped-GEMM gap
that no bit-exact lever closes. Recommend shipping the ~85% bit-exact default
and exposing the held 0026 bf16-SSM as a default-off opt-in (it reaches ~95%
but is non-bit-exact and fails the MoE KL gate).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
B-2 / M1 (SPEEDUP_HUNT rank #2): bit-exact block/grid/occupancy retune of
quantize_mmq_nvfp4 (the MoE down_proj activation-quant, ~2% of the MoE decode
step). Built+measured on a clean 0025 base (DGX GB10 sm_121), then reverted -
it does not lift.
Finding: the existing blockDim.x=128 is ALREADY the kernel-level optimum for
quantize_mmq_nvfp4 on GB10. nsys (8193 invocations): block=128 total 117.4M ns
is the fastest; 64 +8.7%, 192 +9.9%, 256 +6.9%. End-to-end MoE decode_agg is
flat within 0.4% noise across all block sizes {32..256} (npl32 ~438, npl128
~751 t/s). The act-quant is ~2% of a BW-bound step, so even a perfect kernel
caps the win at ~2%, and 128 is already optimal => measured 0%. Same outcome as
patch 0015 (M-tile) and 0017 (MINBLOCKS): no occupancy headroom on this
256-tiny-expert BW-bound model.
Bit-exactness proven: md5 identical at block 64/128/256 for both models (the
per-thread quant body is untouched; thread->output map is invariant to
blockDim.x). Gate at default: dense 5951a5b4 == ref, MoE 07db32c2 == ref,
MUL_MAT 1146/1146, MUL_MAT_ID 806/806 PASS.
MoE stays ~85% of vLLM @npl128 / ~87% @npl32 - still well below vLLM, so the
remaining MoE lever is B-3 (mmq_y-down warp-remap on the grouped FP4 GEMM).
No patch 0027; dev tree reverted to pristine 0025. Full data in B_MOE_RESULTS.md.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Regenerate patch 0026 with the hybrid-decode carry fix and record the
KL/throughput gate-sweep results.
Fix: clear(data=true) zeroes the whole recurrent buffer including the head_slot
maps, which were uploaded only once at construction; after the post-warmup
reset every head read head_slot==0 (f32-local-0), collapsing the split and
producing incoherent decode. Persist head_slot_host and re-upload via
upload_head_slots() after every buffer clear. Hybrid decode is now coherent and
the cross-op state carry is byte-exact (write==read, both partitions).
Gate result: de-risk PASS (test-backend-ops 84/84; T=0 md5 == 0023 baseline,
both models). Ship gate FAILS - no T_thresh meets MeanKLD<1e-3 AND
same-top-p>=99.5% with a meaningful speedup. The premise that the bf16 error
concentrates in long-memory heads is refuted: KL scales with the bf16 head
count and saturates ~0.06/~91% (MoE saturates at the minimal split). The carry
is byte-exact, so this is genuine bf16 sensitivity, not a bug. The byte-saving
lever is real (dense +12.4%, MoE +11.5% decode @npl128 at T=128) but cannot
meet the strict KL bar. Shipped default-off (f32, bit-exact opt-out); hybrid is
opt-in only and not recommended in the gallery config. Full tables in
A_HYBRID_SSM_RESULTS.md.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Lever A patch + build/de-risk results. Splits the persisted gated-DeltaNet
recurrent state per head: f32 on long-memory heads (where bf16 rounding does not
contract and the KL error concentrates), bf16 on fast-decaying heads, classified
at model load by tau_h = 1/(|ssm_a|*softplus(ssm_dt)). Default ssm_hybrid_tau_thresh
= 0.0 keeps every head f32 (bit-exact opt-out).
De-risk gates BOTH PASS: test-backend-ops GATED_DELTA_NET CUDA0 OK (incl 32 hybrid
mixed CUDA-vs-CPU cases); default all-f32 greedy md5 == 0023 baseline both models
(dense 5951a5b4d624ce891e22ab5fca9bc439, MoE 07db32c2bcb78d17a43ed18bc22705cd).
Known open issue (opt-in hybrid only; default unaffected): hybrid-ON model decode
(ids in-place path) is incoherent; classifier/cache/kernel-params verified correct,
bug isolated to the ids in-place cross-step state path. See A_HYBRID_SSM_RESULTS.md.
Not ready for the GateSweep until fixed.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
Append lever C (structural dense residual: lm_head + scheduling) findings
and the master RANK + PLAN section to SPEEDUP_HUNT.md. Per-lever scorecard
(gain x tractability x gate), ranked build order, the concrete A build plan
for the hybrid per-head f32/bf16 SSM state cache, and the ordered B/C/D queue
with each one's build trigger.
Verdict: ship the MoE re-graph (patch 0025, measured +1.9-4.4%, both gates
PASSED) now; build A as the lead (only lever ABOVE vLLM on dense, KL-gated,
~430-454 t/s = 103-108% of vLLM); bank B-2/B-3 on MoE; C last (<1% bit-exact,
dead-end); D opt-in-only and dense-only behind the same KL gate bf16-SSM failed.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The worktree merge bumped LLAMA_VERSION 8be759e6 -> 9d5d882d. This re-syncs the
paged patch-stack (0001-0024) to the new tip: the stack was rebased onto
9d5d882d on the DGX dev tree, rebuilt clean (CUDA sm_121), and re-validated
bit-exact before re-exporting the LocalAI .patch files.
Re-exporting each shipped patch from its rebased commit and diffing body-to-body
against the committed files identifies exactly 4 that changed and no longer
git-apply to 9d5d882d:
- 0008 cross-request prefix share: re-anchored the [paged 0008] commit block to
the refactored update_slots() lambda (continue->return, batch.n_tokens->
batch.size()); identical env-guarded logic.
- 0013 static prefill budget: budget var-block / while-gate / admission-break
re-expressed against the refactored loop (add_ok=false idiom).
- 0015 expert-density MoE token-tile auto-select: pure context re-anchor; upstream
inserted a test_mul_mat_id case at the hunk anchor in test-backend-ops.cpp. The
inserted lines are unchanged. (This one rebased cleanly via 3-way but its
committed .patch no longer applies with plain git apply, so it is caught by the
per-patch apply-check, not by the rebase conflict count.)
- 0016 dynamic decode-first budget: dynamic budget block + n_decode_in_batch =
batch.size() + add_ok=false against the refactored loop.
All four are byte-faithful format-patch exports of the gate-green rebased commits.
Applying the full corrected series to a fresh 9d5d882d reproduces the gate-green
tree byte-for-byte across every code file.
The other 7 touched patches (0009/0017/0018/0019/0020/0021/0024) are LINENUM-only
(hunk bodies byte-identical, only @@ line-numbers shifted) and still apply
cleanly, so they are left unchanged. The remaining patches are identical.
Validation on the rebased build (NVFP4 Qwen3.6, GB10 sm_121):
- test-backend-ops CUDA0: GATED_DELTA_NET 36/36, SSM_CONV 45/45, MUL_MAT
1146/1146, MUL_MAT_ID 806/806 all OK.
- greedy md5 (-fa on -n 48 --temp 0 --seed 1): dense q36-27b-nvfp4
5951a5b4d624ce891e22ab5fca9bc439 and MoE q36-35b-a3b-nvfp4
07db32c2bcb78d17a43ed18bc22705cd, both == baseline.
- decode S_TG @npl128: dense 366.41 t/s (ref 373.2, -1.8%), MoE 751.11 t/s
(ref 745.7, +0.7%), both within noise.
Details in backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Sync to master (12 commits) + the llama.cpp pin bump 8be759e6 -> 9d5d882d.
Conflicts resolved:
- Makefile .NOTPARALLEL: union (keep both backends/llama-cpp-localai-paged and
master's backends/privacy-filter-darwin).
- gallery/index.yaml: our 2 base NVFP4 entries (qwen3.6-27b-nvfp4, qwen3.6-35b-a3b-nvfp4)
for the paged backend prepended to master's full list; master keeps its own
*-nvfp4-mtp variants (distinct entries). Go build + YAML validated; the 8 duplicate
gallery names are pre-existing in master, not introduced here.
The patchset still needs re-verification against the new tip (pin-sync, next step).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The metal-darwin-arm64-piper backend crashed at launch on macOS:
DYLD "Library missing"
Library not loaded: @rpath/libucd.dylib
Referenced from: .../piper
Reason: no LC_RPATH's found
The piper binary links libucd, libespeak-ng, libpiper_phonemize and
libonnxruntime via @rpath, but ships with no LC_RPATH, so dyld cannot
expand @rpath and aborts before piper runs. The libraries themselves are
already bundled in package/lib/ by package.sh.
Additionally, package.sh's architecture detection only handled the Linux
glibc loaders (/lib64/ld-linux-x86-64.so.2, /lib/ld-linux-aarch64.so.1)
and otherwise hit `echo "Error: Could not detect architecture"; exit 1`,
so on macOS packaging failed outright.
Add a Darwin branch (before the Linux checks) that skips the glibc/ld.so
bundling macOS has no use for and instead runs
`install_name_tool -add_rpath @loader_path/lib` on the piper binary, so
@rpath resolves to the bundled package/lib/ directory.
Also mirror sherpa-onnx/opus in run.sh: export DYLD_LIBRARY_PATH on
Darwin (LD_LIBRARY_PATH is Linux-only) as a defensive fallback.
Validated by hand on Apple Silicon: with the rpath added, piper
synthesized a real WAV. The darwin build is validated in CI.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
New backend = stock llama-cpp grpc-server + the paged patchset (forces LLAMA_PAGED=on),
shipped as its own meta-backend (mirrors turboquant, simpler: no fork pin, no
grpc-server patching - the paged runtime hooks already exist in grpc-server.cpp).
Stock llama-cpp untouched (LLAMA_PAGED?=on retained; the de-risk flip deferred for
sign-off). Gallery: qwen3.6-27b-nvfp4 (dense) + qwen3.6-35b-a3b-nvfp4 (MoE) with the
benchmark run config (paged_kv, max_batch_tokens, parallel, flash_attention, f16),
mudler/ GGUF uris (sha256 TODO until publish). Importer dropdown entry + tests.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Produce a Gatekeeper-clean macOS distribution with no user workaround:
- Launcher DMG + the LocalAI.app inside it are built via fyne, codesigned
with the Developer ID under the hardened runtime, then the DMG is signed,
notarized (notarytool) and stapled. Replaces macos-dmg-creator (which had
no signing hook) with fyne package + hdiutil so we control the .app before
packaging.
- The bare local-ai darwin server binary is signed + notarized via
GoReleaser's native notarize block (quill backend, runs on Linux).
- All signing is gated on secrets being present, so forks/PRs/local builds
stay unsigned and green (contrib/macos/sign-and-notarize.sh no-ops).
- Add hardened-runtime entitlements and FyneApp.toml for deterministic
packaging; update macOS install docs to drop the quarantine workaround.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Fixes the paged-pool burst-degradation bug (OTHER_PATHS_INVESTIGATION.md section C
Part 2): on a long-lived llama-server with LLAMA_KV_PAGED=1, a high-fan-out prefill
burst strands KV blocks in the host-side paged pool, so a later lower-npl prefill
draws from a depleted/fragmented pool and its throughput collapses (the benchmark's
"restart per npl" crutch). Decode is unaffected. The fix changes only host-side
block accounting and placement, never KV values or compute, and is gated behind
LLAMA_KV_PAGED (LLAMA_PAGED_NO_RECLAIM=1 restores the pre-fix behavior).
Fix-1 reclaim trailing blocks: PagedKVManager::truncate(seq, n_keep) frees every
block beyond ceil(n_keep/bs) (ref-counted); called from llama_kv_cache::seq_rm for
the p1==MAX && p0>0 partial-tail case so the manager tracks the kv-cache exactly.
Fix-2 defrag on empty: when the pool is fully idle, defrag_free_pool() relinks the
free queue into ascending block-id order (FreeBlockQueue::rebuild), preserving
content-cache hashes.
Fix-3 release on slot completion: server_slot::release() issues prompt_clear()
under the paged engine so a finished-idle slot returns its blocks promptly.
Validation (DGX GB10, q36-27b-nvfp4 = qwen35 hybrid; HEAD f7409c2 = patch 0023):
- Bit-exact: greedy md5 identical across paged off / paged on / paged on+NO_RECLAIM
(5951a5b4d624ce891e22ab5fca9bc439), == the 0023 baseline. test-backend-ops
unaffected (no ggml op touched).
- Host unit test: truncate reclaims exactly 16 trailing blocks; defrag restores
ascending popleft order. UNIT PASS.
- Model A/B (one binary, NO_RECLAIM): fragmentation prefill ratio 0.944 -> 0.998;
64 idle slots strand 2048 blocks, reclaim returns the pool to fresh (2527).
- Server A/B (FRESH-npl8 -> BURST-npl64 -> POST-npl8): POST-npl8 prefill collapses
488 -> 44 t/s with NO_RECLAIM (the bug; investigation saw 507 -> 65), restored to
532 t/s (fresh 525, within 1%) with the fix. Paged release-log count 17 -> 96
(Fix-3 fires per slot completion). Canary tokens identical fresh-vs-post in both
arms (bit-exact serving).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Synthesis of the four read-only/GPU investigations (A MoE grouped-GEMM,
B cublas lm_head, C TTFT/paged-pool burst, D dense CUDA-graph):
- A: llama already has the sorted-grouped-FP4-MMA GEMM (higher tier than
vLLM's GB10 W4A16 Marlin fallback); standalone bit-exact kernel win is
bounded on this bandwidth-bound a3b model. Keep down_proj quantize
retune (M1) as a cheap bank-shot; fold the decode-graph (M2) into a
later shared GDN+MoE decode-graph project.
- B: lm_head is BF16 (not FP4), nvjet already ~72% of peak HBM; bit-exact
ceiling <1%, the only big win (NVFP4 head) is non-bit-exact and unfair
vs vLLM. Dead end. Rank last.
- C: paged-pool burst-degradation BUG (Part 2) is a true correctness
defect (prefill collapses 507->65 t/s after a burst, restart cures it):
reclamation gap on partial seq_rm + free-queue fragmentation. Plus the
static decode-first budget (Part 1) explains 903s/213s burst TTFT and
the chunked-interleave fix.
- D: f32 dense CUDA-graph is STABLE (<1%, no bimodality); the brief's
bimodality was the shelved BF16 SSM path. Closed.
First build target: the paged-pool burst-degradation bug fix (Fix-1
truncate-on-partial-seq_rm + Fix-2 defrag-on-empty + Fix-3 release-on-slot-
completion). Small, localized, default-off byte-identical, crisp repro
(npl64 burst then npl8: prefill within 10% of fresh + num_free restored).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The opus Go backend (WebRTC audio codec) never built on macOS, so the
published master-metal-darwin-arm64-opus image shipped source only — no
opus binary and no libopusshim — because every step assumed Linux.
- Makefile: hardcoded libopusshim.so with no OS handling. Mirror
sherpa-onnx: SHIM_EXT=so / dylib on Darwin and build
libopusshim.$(SHIM_EXT). On Darwin link the shim with
-undefined dynamic_lookup so it resolves opus_encoder_ctl from the
already globally-loaded libopus (codec.go dlopens it RTLD_GLOBAL
first) instead of baking an absolute Homebrew path into the dylib,
keeping the packaged shim relocatable.
- run.sh: hardcoded LD_LIBRARY_PATH + libopusshim.so even on macOS. Add
a Darwin branch exporting DYLD_LIBRARY_PATH and the .dylib shim, like
sherpa-onnx/run.sh.
- package.sh: bundle libopusshim.$(SHIM_EXT) and libopus*.dylib (not
just .so) into package/lib so the OCI image (which ships package/.)
is self-contained on a runtime with no Homebrew; add a Darwin arch
branch so it doesn't warn/skip.
- backend_build_darwin.yml: install + link opus and pkg-config via brew
so the Makefile's `pkg-config opus` resolves on the macOS runner, and
cache opus' Cellar dir.
Go code is unchanged; darwin build is validated in CI.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Synthesize the GPU kernel-budget probe with the read-only glue source
map. Add (4) the implementation cost - llama has no model-compute-dtype
knob, the residual stream is F32 by construction (ggml_mul_mat hardcodes
F32 output), so f16 glue is not a flag but an opt-in multi-file change
(norm.cu f16 kernels + f16 residual stream). Add the final verdict:
precision is not the dominant cause of the 8% residual (83% of the step
is already f32/W4A4-matched), f16 recovers only 40-60% of the gap and is
non-bit-exact, so do not build it as the default; ship the 95%-bit-exact
f32 plateau and target the structural cublas/graph-launch ~3-4% instead.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Empirical probe on q36-27b-nvfp4 @npl128 (build f7409c2, patch 0023):
- attention KV cache default is ALREADY f16 (K/V f16) -> --cache-type f16 is a
no-op; q8_0 within noise -> KV dtype is not a decode lever
- nsys node-trace decode budget: f32-glue (norms/elementwise/activations/attn,
excl. SSM recurrence + NVFP4 GEMM) = 28.7 ms = 8.4% of step (40.9 ms = 12%
incl. the non-FP4 cublas GEMM)
- f16 realistically recovers ~11-16 ms of the ~27 ms/step gap = ~40-60% of the
8.2% residual -> ~95-96% parity, not a full close; non-bit-exact opt-in only
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
fix(backends): ship the package/ dir for darwin go backends
golang-darwin.sh packaged the whole backend source/build dir as the OCI
image (backend/go/$BACKEND/.), so the runtime dylibs ended up under
package/lib and backend-assets/lib while run.sh looks in $CURDIR/lib. As a
result a backend like sherpa-onnx could not dlopen its libsherpa-shim.dylib
at runtime and exited immediately (the model then 500s with "grpc service
not ready"); it started fine only when run from inside package/.
Ship package/. instead — the self-contained run.sh + binary + lib/ bundle —
matching the Linux Dockerfile.golang (`COPY .../package/. ./`). Backends
that don't assemble a package/ fall back to the backend dir, and the
binary-existence guard now checks the directory actually shipped.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Earlier text claimed bf16 = vLLM's own precision; that was a refuted byte-gate
draft re-surfacing. The settled finding (BITEXACT_VS_VLLM.md, proven 3 ways) is
that vLLM keeps the gated-DeltaNet TEMPORAL state in f32 (only its conv state is
bf16). So bf16 temporal is BELOW vLLM's recurrent precision, not a match; and at
equal f32 precision llama's recurrence already beats vLLM (84.6% vs 82.4% peak).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
When LOCALAI_PREFER_DEV_BACKENDS is set, install the -development image as the
primary backend URI (keeping the released image reachable as the first
fallback), instead of only reaching development as a download fallback when the
released image is missing. This lets an operator force backends built from the
development branch — e.g. to pick up a fix already on master before a release.
Threads PreferDevelopmentBackends through SystemState so InstallBackend can see
it, and reuses the same development-URI convention as the existing failure-path
fallback (released tag -> branch tag + dev suffix). The unexported developmentURI
helper is covered by a Ginkgo spec.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Public deliverable for the patch-0018..0023 f32 bit-exact paged-attention ship:
the apples-to-apples NVFP4 decode benchmark (llama.cpp paged 0023 vs vLLM 0.23.0
on GB10 / DGX Spark, matched weights, CUDA graphs ON both sides).
- final_benchmark.csv: clean 8-column plot-ready schema
(model,engine,npl,decode_agg_tps,decode_perseq_tps,prefill_tps,ttft_mean_ms,peak_gb),
16 rows (2 models x 2 engines x npl 8/32/64/128).
- QWEN36_NVFP4_BENCH.md: embed the two decode-vs-npl plots; add the
internal-consistency note (decode_agg vs perseq*npl is TTFT-governed, holds on
both engines, no stale-baseline carry-over).
- decode-vs-npl PNGs (one per model), llama vs vLLM, per-point llama-%-of-vLLM labels.
Headline (measured, nothing pre-assumed): dense llama 90-117% of vLLM decode
(ahead at npl8), MoE 77-83%, at higher precision (f32 GDN state + q8 act vs vLLM
bf16 GDN + w4a4) and 1.5-3x lower unified memory (on-demand paged KV vs vLLM's
flat ~107 GB pool).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Ranked pick-up points after the 95%-bit-exact plateau: hybrid-precision SSM state
(per-head f32/bf16 split - the bf16 error is concentrated in long-memory heads, so
a split could capture most of the +25-31% while passing the f32 KL gate), dense
CUDA-graph instability, the rms_norm->fp4 fold (flat-risk), datacenter Blackwell
sm_100 (no LPDDR5x floor), adaptive prefill budget, MoE-specific recurrence tuning.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
De-risk passed (test-backend-ops 52/52 bf16, f32 default byte-identical to 0023),
and the throughput lever is real (recurrence -49%/call, dense ~490 t/s = 125% of
vLLM clean). But bf16-vs-f32 KLD is 0.06-0.17 at >=1024 ctx (threshold 1e-3) with
~90% top-token agreement: intrinsic bf16 error over gated-DeltaNet long-memory
heads, not a bug. That is exactly vLLM's own bf16 GDN precision. Shelved; ship the
95% bit-exact f32 plateau (0018-0023). bf16 work backed up on DGX (BF16_SSM_STATE.diff).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(backends): darwin/Metal build for the privacy-filter backend (timeboxed try)
The privacy-filter.cpp engine is already Metal-capable on Apple Silicon: it pulls
ggml and never forces GGML_METAL=OFF, and ggml defaults Metal ON on Apple, so a
plain Darwin build is Metal-enabled. grpc++/protobuf resolve from Homebrew via
find_package(... CONFIG). It just had no darwin build path - the existing
package.sh and run.sh are Linux-only and there was no make target / workflow step.
Adds the bespoke darwin path, modeled on the ds4 one:
- scripts/build/privacy-filter-darwin.sh: native make grpc-server, otool -L dylib
bundling, create-oci-image (no Linux package.sh).
- Makefile: backends/privacy-filter-darwin target (+ .NOTPARALLEL).
- .github/workflows/backend_build_darwin.yml: gated build step for privacy-filter.
- scripts/changed-backends.js: inferBackendPathDarwin special-case -> backend/cpp.
- .github/backend-matrix.yml: includeDarwin entry (lang go, like ds4/llama-cpp).
- backend/index.yaml: metal: capability + metal-privacy-filter(-development) entries.
- backend/cpp/privacy-filter/run.sh: DYLD_LIBRARY_PATH branch on Darwin.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
* fix(privacy-filter): macOS proto include + bundle ggml dylibs
Validated natively on an M4 (the build/package/load chain now works with Metal):
- CMakeLists.txt: hw_grpc_proto compiles the generated proto/grpc sources but
only linked the binary dir, so on macOS it could not find protobuf's headers
(runtime_version.h) - Homebrew puts them under /opt/homebrew, not /usr/include.
Link protobuf::libprotobuf + gRPC::grpc++ so their include dirs propagate. No-op
on Linux (apt headers are already on the default search path).
- privacy-filter-darwin.sh: bundle the ggml shared libs the binary @rpath-links
(libggml{,-base,-cpu,-blas,-metal}); the otool -L walk only catches on-disk
absolute deps and missed them. Resolved at runtime by run.sh's DYLD_LIBRARY_PATH.
M4 check: arm64 grpc-server links @rpath/libggml-metal.0.dylib; with the 15 ggml
dylibs + grpc/protobuf bundled, it loads clean (no dyld errors) and prints usage.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
fix(backends): quote $CURDIR in run.sh so backends work in paths with spaces
The backend launcher scripts derive their own directory with
CURDIR=$(dirname "$(realpath $0)") and then referenced it unquoted as
$CURDIR (e.g. [ -f $CURDIR/lib/ld.so ], export LD_LIBRARY_PATH=$CURDIR/lib:...,
exec $CURDIR/<binary> "$@"). When a backend is installed under a path that
contains a space - notably macOS's ~/Library/Application Support/... - bash
word-splits the unquoted $CURDIR, so the test builtin fails with
"binary operator expected" and exec tries to run ".../Library/Application",
yielding "No such file or directory". The backend never starts, surfacing as
a gRPC "service not ready" error and an HTTP 500. Quote $CURDIR (and the
realpath "$0") in every affected run.sh; no logic changes.
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#10517 pinned the pseudo-version of the postgres connection-timeout fix;
mudler/LocalRecall@v0.6.3 now tags that exact commit. Use the clean release
tag instead of the pseudo-version. No code change.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* chore: bump localrecall for postgres per-connection timeouts
Pulls mudler/LocalRecall#49: sets lock_timeout / idle_in_transaction
(default on) + opt-in statement_timeout on every pooled connection, so a
corrupt/wedged index (e.g. a BM25 insert spinning on a buffer-content lock)
can no longer hold its relation lock forever and head-of-line block the
whole vector store.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* docs(agents): document PostgreSQL connection safety timeouts
Note the POSTGRES_LOCK_TIMEOUT / POSTGRES_IDLE_IN_TRANSACTION_TIMEOUT /
POSTGRES_STATEMENT_TIMEOUT env vars read by the embedded vector store, and
that safe defaults are on automatically.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The standalone quantize fold is empirically flat (Lever-2 precedent) with the
worst gain/plumbing ratio; no bit-exact lever remains. Dense 371.81 t/s @npl128
= 95.0% of vLLM 391, recurrence past vLLM at the LPDDR5x DRAM floor, all
byte-identical to llama f32. Only bf16 state (shelved) goes further.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Resolve pkg/xsysinfo/gpu.go: keep master's NVIDIAComputeCapability +
parseComputeCap (the #10485 multi-GPU work); re-express our IsNVIDIABlackwell
as a thin wrapper over NVIDIAComputeCapability instead of a duplicate
nvidia-smi probe.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>