LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-26 09:26:55 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	b061e4aef0	docs(paged): OTHER_PATHS investigation - rank 4 post-0023 paths, pick paged-pool burst bug as first build target Synthesis of the four read-only/GPU investigations (A MoE grouped-GEMM, B cublas lm_head, C TTFT/paged-pool burst, D dense CUDA-graph): - A: llama already has the sorted-grouped-FP4-MMA GEMM (higher tier than vLLM's GB10 W4A16 Marlin fallback); standalone bit-exact kernel win is bounded on this bandwidth-bound a3b model. Keep down_proj quantize retune (M1) as a cheap bank-shot; fold the decode-graph (M2) into a later shared GDN+MoE decode-graph project. - B: lm_head is BF16 (not FP4), nvjet already ~72% of peak HBM; bit-exact ceiling <1%, the only big win (NVFP4 head) is non-bit-exact and unfair vs vLLM. Dead end. Rank last. - C: paged-pool burst-degradation BUG (Part 2) is a true correctness defect (prefill collapses 507->65 t/s after a burst, restart cures it): reclamation gap on partial seq_rm + free-queue fragmentation. Plus the static decode-first budget (Part 1) explains 903s/213s burst TTFT and the chunked-interleave fix. - D: f32 dense CUDA-graph is STABLE (<1%, no bimodality); the brief's bimodality was the shelved BF16 SSM path. Closed. First build target: the paged-pool burst-degradation bug fix (Fix-1 truncate-on-partial-seq_rm + Fix-2 defrag-on-empty + Fix-3 release-on-slot- completion). Small, localized, default-off byte-identical, crisp repro (npl64 burst then npl8: prefill within 10% of fresh + num_free restored). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 09:42:55 +00:00
Ettore Di Giacinto	89e62fc74f	docs(paged): finalize f16 glue probe - cost analysis + build verdict Synthesize the GPU kernel-budget probe with the read-only glue source map. Add (4) the implementation cost - llama has no model-compute-dtype knob, the residual stream is F32 by construction (ggml_mul_mat hardcodes F32 output), so f16 glue is not a flag but an opt-in multi-file change (norm.cu f16 kernels + f16 residual stream). Add the final verdict: precision is not the dominant cause of the 8% residual (83% of the step is already f32/W4A4-matched), f16 recovers only 40-60% of the gap and is non-bit-exact, so do not build it as the default; ship the 95%-bit-exact f32 plateau and target the structural cublas/graph-launch ~3-4% instead. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 09:12:55 +00:00
Ettore Di Giacinto	001d833426	docs(paged): f16/bf16 glue probe - dense decode residual ceiling Empirical probe on q36-27b-nvfp4 @npl128 (build f7409c2, patch 0023): - attention KV cache default is ALREADY f16 (K/V f16) -> --cache-type f16 is a no-op; q8_0 within noise -> KV dtype is not a decode lever - nsys node-trace decode budget: f32-glue (norms/elementwise/activations/attn, excl. SSM recurrence + NVFP4 GEMM) = 28.7 ms = 8.4% of step (40.9 ms = 12% incl. the non-FP4 cublas GEMM) - f16 realistically recovers ~11-16 ms of the ~27 ms/step gap = ~40-60% of the 8.2% residual -> ~95-96% parity, not a full close; non-bit-exact opt-in only Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 09:11:21 +00:00
Ettore Di Giacinto	00f92659f8	docs(paged): correct vLLM recurrent-state precision (f32, not bf16) Earlier text claimed bf16 = vLLM's own precision; that was a refuted byte-gate draft re-surfacing. The settled finding (BITEXACT_VS_VLLM.md, proven 3 ways) is that vLLM keeps the gated-DeltaNet TEMPORAL state in f32 (only its conv state is bf16). So bf16 temporal is BELOW vLLM's recurrent precision, not a match; and at equal f32 precision llama's recurrence already beats vLLM (84.6% vs 82.4% peak). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 06:22:08 +00:00
Ettore Di Giacinto	7dd3431040	docs(paged): promote TTFT/prefill + paged-pool burst-degradation bug (benchmark finding) The final benchmark exposed TTFT as the weakest number (dense npl128 903s vs vLLM 6-18s, decode-first budget throttling burst-prefill) plus a concrete paged-pool burst-degradation bug (post-burst low-npl prefill collapses 507->65 t/s; decode unaffected). Highest-value serving fix; decode + memory already strong. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 03:54:27 +00:00
Ettore Di Giacinto	ae0042f214	docs(paged): publish NVFP4 decode benchmark - plot-ready CSV + decode-vs-npl plots Public deliverable for the patch-0018..0023 f32 bit-exact paged-attention ship: the apples-to-apples NVFP4 decode benchmark (llama.cpp paged 0023 vs vLLM 0.23.0 on GB10 / DGX Spark, matched weights, CUDA graphs ON both sides). - final_benchmark.csv: clean 8-column plot-ready schema (model,engine,npl,decode_agg_tps,decode_perseq_tps,prefill_tps,ttft_mean_ms,peak_gb), 16 rows (2 models x 2 engines x npl 8/32/64/128). - QWEN36_NVFP4_BENCH.md: embed the two decode-vs-npl plots; add the internal-consistency note (decode_agg vs perseq*npl is TTFT-governed, holds on both engines, no stale-baseline carry-over). - decode-vs-npl PNGs (one per model), llama vs vLLM, per-point llama-%-of-vLLM labels. Headline (measured, nothing pre-assumed): dense llama 90-117% of vLLM decode (ahead at npl8), MoE 77-83%, at higher precision (f32 GDN state + q8 act vs vLLM bf16 GDN + w4a4) and 1.5-3x lower unified memory (on-demand paged KV vs vLLM's flat ~107 GB pool). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 03:51:35 +00:00
Ettore Di Giacinto	aaaa90ae4b	bench(paged): final apples-to-apples NVFP4 decode benchmark (0023 vs vLLM 0.23.0, GB10) Publishable, plot-ready head-to-head on GB10 / DGX Spark with matched NVFP4 weights, both engines at their best realistic config (CUDA graphs ON both sides; vLLM util 0.85 max-model-len 4096 max-num-seqs 256; llama -c 131072 --parallel 128 LLAMA_KV_PAGED=1 LLAMA_MAX_BATCH_TOKENS=512). Identical async client: 512-tok unique-nonce prompt (fresh full prefill), max_tokens=256, temp 0, ignore_eos, stream+usage; npl 8/32/64/128. llama = clean patch 0023 (dev tree f7409c2, bf16 GDN-state work reverted, build-cuda rebuilt). llama runs at HIGHER precision (f32 GDN state + q8 act) than vLLM (bf16 + w4a4). decode_agg t/s, llama as % of vLLM: DENSE q36-27b-nvfp4: npl8 117% npl32 91% npl64 90% npl128 92% MoE q36-35b-a3b: npl8 83% npl32 78% npl64 77% npl128 82% memory: llama on-demand paged KV 50-90 GB (dense) / 36-58 GB (MoE) vs vLLM fixed ~107 GB pool at all npl (1.5-3x lower). TTFT: vLLM wins under synchronized burst (llama decode-first budget trades burst-prefill for decode; decode + memory unaffected). Outputs: final_benchmark.csv (16 rows, 5 metrics each), refreshed QWEN36_NVFP4_BENCH.md (FINAL section), BENCHMARK_PROGRESS.md (per-row checkpoint log). Methodology notes: per-npl llama server restart (paged-pool degrades after high-npl bursts; decode robust), vLLM npl8 re-check confirms no degradation; clean env (service containers stopped for the run, restored after). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 03:47:24 +00:00
Ettore Di Giacinto	7c45447c9e	docs(paged): FUTURE_LEVERS - parked decode-parity exploration trail Ranked pick-up points after the 95%-bit-exact plateau: hybrid-precision SSM state (per-head f32/bf16 split - the bf16 error is concentrated in long-memory heads, so a split could capture most of the +25-31% while passing the f32 KL gate), dense CUDA-graph instability, the rms_norm->fp4 fold (flat-risk), datacenter Blackwell sm_100 (no LPDDR5x floor), adaptive prefill budget, MoE-specific recurrence tuning. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 00:53:09 +00:00
Ettore Di Giacinto	24833f0966	docs(paged): bf16 SSM-state NO-SHIP - fails f32 KL gate (= vLLM's own precision) De-risk passed (test-backend-ops 52/52 bf16, f32 default byte-identical to 0023), and the throughput lever is real (recurrence -49%/call, dense ~490 t/s = 125% of vLLM clean). But bf16-vs-f32 KLD is 0.06-0.17 at >=1024 ctx (threshold 1e-3) with ~90% top-token agreement: intrinsic bf16 error over gated-DeltaNet long-memory heads, not a bug. That is exactly vLLM's own bf16 GDN precision. Shelved; ship the 95% bit-exact f32 plateau (0018-0023). bf16 work backed up on DGX (BF16_SSM_STATE.diff). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 00:49:49 +00:00
Ettore Di Giacinto	634c0e5a0f	docs(paged): rms_norm->fp4 fold analysis - bit-exact decode ceiling at 95% of vLLM The standalone quantize fold is empirically flat (Lever-2 precedent) with the worst gain/plumbing ratio; no bit-exact lever remains. Dense 371.81 t/s @npl128 = 95.0% of vLLM 391, recurrence past vLLM at the LPDDR5x DRAM floor, all byte-identical to llama f32. Only bf16 state (shelved) goes further. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 22:42:08 +00:00
Ettore Di Giacinto	64766ecc85	Merge branch 'master' into worktree-feat+paged-attention Resolve pkg/xsysinfo/gpu.go: keep master's NVIDIAComputeCapability + parseComputeCap (the #10485 multi-GPU work); re-express our IsNVIDIABlackwell as a thin wrapper over NVIDIAComputeCapability instead of a duplicate nvidia-smi probe. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 21:56:35 +00:00
Ettore Di Giacinto	02cbae5ea9	feat(paged): qwen35moe NVFP4 activation-quantize de-dup (patch 0023) Mirror patch 0023 + results into the paged series. Bit-exact MoE decode/prefill lever: ggml mul_mat_id re-quantizes each token's activation once per expert for the broadcast up/gate proj (ne11==1); quantize_mmq_nvfp4 has no cross-thread reduction, so the gathered blocks are byte-identical across experts. The lever quantizes the ne12 unique tokens once and gathers the block_fp4_mmq rows into the expert-gathered layout with a coalesced uint4 copy (144 B = 9 uint4); the GEMM is untouched and down_proj keeps the stock path. Measured (DGX GB10, on top of patch 0022, q36-35b-a3b-nvfp4): decode S_TG npl128 745.2 -> 758.1 t/s (+1.73%), npl32 +0.6%, prefill T_PP -4%; dense q36-27b-nvfp4 byte-flat. nsys: quantize_mmq_nvfp4 868 -> 457 ms, gather +32 ms (net -379 ms). Bit-exact: q36-27b 5951a5b4..., q36-35b-a3b 07db32c2... (on == off == 0022); test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805. On by default; GGML_CUDA_MOE_QUANT_DEDUP=0 restores stock. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 21:49:15 +00:00
LocalAI [bot]	f2ed63e39a	docs(backends): make OS coverage explicit + require darwin support for new backends (#10516 ) docs(backends): make OS coverage explicit + require darwin for new backends The backend matrix is the source of truth for which OS a backend ships on, but that was never written down, so backends were landing Linux-only by default even when the engine builds fine on macOS. - .github/backend-matrix.yml: header block documenting the two matrices (include = Linux, includeDarwin = macOS/Apple Silicon) and the policy that new backends target every OS they can build for. - .agents/adding-backends.md: a 'Cover every OS' subsection in step 2 (full darwin wiring: includeDarwin entry, index.yaml metal: + metal-<backend> entries, run.sh DYLD branch + inferBackendPathDarwin case for C++ backends, the hw_grpc_proto protobuf/grpc link gotcha, and the path-filter touch) plus a verification-checklist item. - AGENTS.md (CLAUDE.md): Quick Reference pointer so it surfaces every session. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 23:26:39 +02:00
LocalAI [bot]	286c508ce0	feat(backends): darwin build for the localvqe backend (acoustic echo cancellation) (#10512 ) feat(backends): darwin build for the localvqe backend LocalVQE (acoustic echo cancellation / noise suppression / dereverberation) already builds on Darwin - its Makefile takes the OS=Darwin branch with GGML_METAL=OFF (upstream is CPU + Vulkan only), producing a native arm64 CPU image. It was just never wired into CI. - .github/backend-matrix.yml: add localvqe to includeDarwin (build-type metal, lang go) - the darwin/arm64 build profile; the backend itself stays CPU. - backend/index.yaml: metal: capability + concrete metal-localvqe(-development) entries pointing at the -metal-darwin-arm64-localvqe images. - backend/go/localvqe/Makefile: note on the existing Darwin branch (also the per-backend change the CI path filter needs to build it here). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 22:54:36 +02:00
LocalAI [bot]	d1a9d59917	feat(backends): darwin/Metal builds for vision C++/ggml backends (depth-anything, locate-anything, rfdetr-cpp, sam3-cpp) (#10511 ) feat(backends): darwin/Metal builds for the vision C++/ggml backends depth-anything-cpp, locate-anything-cpp, rfdetr-cpp and sam3-cpp already carry a Darwin/Metal path in their Makefiles (GGML_METAL=ON when build-type=metal), but were never wired into CI, so no Metal image was published and Apple Silicon could not install them. - .github/backend-matrix.yml: add the four to includeDarwin (build-type metal, lang go), matching the other go+ggml -cpp Metal entries. - backend/index.yaml: add metal: to each backend's capabilities map (main and -development) plus concrete metal-<backend>(-development) entries pointing at the latest/master -metal-darwin-arm64-<backend> images. - backend/go//Makefile: a one-line note on the existing Darwin branch (also the per-backend change the CI path filter needs to actually build them here). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 22:07:56 +02:00
Ettore Di Giacinto	3c1ed67b4b	feat(paged): qwen35 gated-DeltaNet decode occupancy/coalescing retune (patch 0022) Bit-exact occupancy retune of gated_delta_net_cuda, the B=128 decode recurrence kernel, carried as paged patch 0022. After the f32 verdict (vLLM carries the gated-DeltaNet temporal state in float32 and moves the same ~805 MB/call as llama; the gap was pure DRAM bandwidth efficiency on equal bytes - llama 73.4% vs vLLM 82.4% of the 273 GB/s GB10 peak), the lever is a latency-coverage retune that keeps the per-column f32 reduction/FMA order byte-identical (md5-gateable). The bf16-state plan stays shelved. Column folding: each warp owns COLS_PER_WARP columns of the 128x128 recurrent state instead of 1, looping the existing per-column body over col, col+NUM_WARPS, ... within a per-block column tile; grid.z = S_v / (NUM_WARPS*COLS_PER_WARP). The per-lane strided row sharding and the warp_reduce butterfly are unchanged, so only the (warp,block)->column assignment differs and the result is bit-identical; per-warp memory-level parallelism rises ~COLS_PER_WARP-fold, covering more DRAM latency on this bandwidth-bound kernel. Default tile is the measured GB10 winner (NUM_WARPS=16, COLS_PER_WARP=8), env-selectable via GDN_NW / GDN_CPW. GB10: gated_delta_net decode 4.02 -> 3.49 ms/call, 73.4% -> 84.6% of peak (above vLLM's 82.4%; 102.6% of vLLM recurrence BW). decode S_TG t/s: dense 27b npl128 335.9 -> 373.2 (+11.1%), MoE 35b-a3b npl128 688.4 -> 745.7 (+8.3%). Greedy md5 byte-identical to the 0021 baseline on both q36-27b-nvfp4 and q36-35b-a3b-nvfp4; test-backend-ops -o GATED_DELTA_NET 36/36 PASS. Bench/method in OCCUPANCY_RETUNE_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 18:34:17 +00:00
Ettore Di Giacinto	8f8777e0f4	feat(paged): qwen35 decode conv-state in-place fusion (patch 0021) The no-regret bit-exact conv-state cleanup from the GDN recurrence byte-gate design (point 3). After the recurrence verdict (NO-BUILD: the gated-DeltaNet recurrence is already single-pass at the f32 byte floor), the decode conv path was the only remaining bit-exact lever. New fused op ggml_ssm_conv_update_inplace (reuses GGML_OP_SSM_CONV, discriminated by a non-null src[3]). On the single-token decode path it replaces the four-op conv chain - qkv transpose + ggml_concat (concat_cont) + ggml_ssm_conv + ggml_silu + ggml_cpy of the shifted ring state (cpy_scalar) - with one kernel that, per (channel, sequence), assembles the width-K window in registers from the K-1 cached taps plus the current qkv_mixed token, computes the depthwise conv with the SAME ascending-tap FMA order as ssm_conv_f32 at i==0, folds silu, writes the conv output, and writes the 1-token-shifted ring state back IN PLACE into the conv cache slot at kv_head. This is vLLM causal_conv1d_update; it mirrors the 0018 in-place write-back and 0019 patterns. Read source (the build_rs tap gather) and write target (the cache view) are disjoint buffers, so it is race-free by construction with no ids/identity logic. - ggml.h/ggml.c: builder (src0=conv_states [K-1,ch,n_seqs], src1=conv_kernel, src2=x_cur [ch,1,n_seqs], src3=conv_state_dst [(K-1)*ch,n_seqs] in-place ring; op_params[0]=fuse_silu) - ggml-cuda/ssm-conv.cu: ssm_conv_update_f32<apply_silu,d_conv> kernel + ggml_cuda_op_ssm_conv_update + src[3]-discriminated branch in ggml_cuda_op_ssm_conv - ggml-cpu/ops.cpp: ggml_compute_forward_ssm_conv_update_f32 (threads over channels) + branch in ggml_compute_forward_ssm_conv - delta-net-base.cpp/models.h: build_conv_state_fused (keeps the cheap build_rs conv-tap gather; fuses conv+silu+shifted write-back) - qwen35.cpp, qwen35moe.cpp, qwen3next.cpp: route the single-token decode path (n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar); prefill/chunked/rollback keep the original chain - tests/test-backend-ops.cpp: test_ssm_conv_update (16 cases) vs the CPU reference test-backend-ops: SSM_CONV 45/45, SSM_CONV_UPDATE 16/16, SSM_CONV_BIAS_SILU 90/90. Greedy (--temp 0 --seed 1 --ignore-eos -n 256) byte-identical to the Lever-1 (0019/0020) baseline: q36-27b-nvfp4 md5 675cd522..., q36-35b-a3b-nvfp4 md5 ac163882... both BYTE-IDENTICAL. decode_agg S_TG (npp128 ntg128, -fa on, CUDA-graph), same session: dense q36-27b-nvfp4 : npl 32 199.76 -> 202.99 (+1.6%) npl 128 336.35 -> 347.14 (+3.2%, 86.0 -> 88.8 percent of vLLM 391) MoE q36-35b-a3b : npl 32 421.72 -> 432.39 (+2.5%) npl 128 689.74 -> 713.54 (+3.5%) Lift holds in eager too (dense npl128 333.62 -> 342.97). Step -11.9 ms/step (dense npl128: 380.6 -> 368.7). nsys eager decode: concat_cont (1152 calls) and the decode cpy_scalar GONE; ssm_conv_f32 at decode replaced by ssm_conv_update (1152); conv-path ~20.9 -> ~7.6 ms/step. Bit-exact, no regression, de-risks the bf16-state conv-cache plumbing. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 16:56:35 +00:00
Ettore Di Giacinto	5cec1a6a21	docs(paged): bitexact-vs-vLLM verdict + verified f32 GDN-state correction Synthesize the cross-engine bit-exactness and f32-preserving-parity study. Resolve the contradiction between sub-agents (one f32, two bf16) by reading every link of vLLM's state-dtype chain on live source: - config.json text_config.mamba_ssm_dtype = "float32" (both served models) - cache.py default mamba_ssm_cache_dtype = "auto"; bench passes no override - vllm.py __post_init__ -> try_verify_and_update_config (config finalize) - Qwen3_5ForConditionalGenerationConfig override copies "float32" into mamba_ssm_cache_dtype before state-dtype resolution - mamba_utils._mamba_state_dtype -> temporal = torch.float32 (conv = bf16) - qwen_gdn_linear_attn allocates the temporal cache at f32 Verdicts: B1 TRUE (sub-claim 'more efficient than vLLM' refuted); B2 REFUTED (equal f32 bytes both sides, ~10pct efficiency gap not 2x width); B3 REFUTED (vLLM hits throughput with f32 state; a bit-exact occupancy/coalescing retune of gated_delta_net_cuda 74->81pct peak is the f32-preserving parity lever); B4 CONFIRMED (bit-exact-vs-vLLM impossible: A1 FP4 GEMM 8/4/16-bit operand gap + A2 recurrence g.Sigma vs Sigma.g reassociation on different reduction trees, plus general FP non-associativity). bf16 temporal state degrades BELOW vLLM's f32 recurrent precision -> an over-clock, not a parity requirement. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 16:55:25 +00:00
Ettore Di Giacinto	17855735c7	docs(paged): bf16 SSM-state build plan (PART C synthesis: edits, KL gate, bench, risks) Synthesizes the bf16 SSM recurrent-state-cache plan into a build-agent brief: ordered file-by-file edit list (kernel/op dtype-generic first, then cparams default flip, gRPC/YAML, back-compat), the KL<1e-3 + PPL-delta + coherence + long-context-drift acceptance gate that REPLACES the bit-exact md5 gate (bf16 is intentionally non-bit-exact, equal precision to vLLM), bench targets (recurrence 3.98->2-3 ms/call, step 384->289-339 ms, 360-443 tok/s dense) + nsys check, the default-bf16/f32-opt-out semantics + state-file back-compat, the risk register, and the single biggest risk (silent corruption on the prefill/keep_rs_t/gather paths) with the de-risk-first test-backend-ops step. Conv state stays f32 in v1. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 16:46:59 +00:00
Ettore Di Giacinto	2a8103c419	docs(paged): FINAL DECISION - NO-BUILD fused recurrence, BUILD conv fusion + bf16 state Synthesis of the byte-gate workflow (ncu-byte-gate measurement + vllm-fused-recurrence-study + llama-fused-recurrence-design + conv-fusion-design). Verdict closes all five decision points: (1) Byte ratio: llama re-stream ~1.0x (cap <=1.33x); recurrence at 74% GB10 peak, MORE BW-efficient than vLLM packed_decode at 41%. The 2x DRAM gap is 100% f32-vs-bf16 state-cache width, not extra passes. (2) Fused single-pass recurrence: NO-BUILD - already one R + one W of f32 state, gate ops touch tiny q/k/g/beta not the 805 MB state -> recovers ~0 bytes. (3) Conv-state in-place fusion: GO - bit-exact, no-regret, +12-14 ms/step (~+3%), eliminates concat_cont + cpy_scalar + folds silu. (4) bf16 SSM state: BUILD (KL<1e-3 gated product call) - only lever on the dominant 50% recurrence term, +45-95 ms/step -> step 289-339 ms = parity-to-ahead of vLLM. Bit-exact parity unreachable on this term (f32 bytes irreducible); bf16 = equal precision to vLLM, which is itself bf16. (5) Build order: conv fusion next (no-regret, bit-exact), then bf16 state (highest value, gated). Confirming measurements stated per step. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 15:27:04 +00:00
Ettore Di Giacinto	fd4332e8f0	docs(paged): GDN recurrence byte-gate SETTLED - re-stream ~1.0x, build bf16 state not fused kernel Decisive measurement (ncu-byte-gate agent, DGX GB10). ncu HW DRAM counters were blocked (ERR_NVGPUCTRPERM, root-only NVreg param; no passwordless sudo), so the byte ratio was settled via CUPTI kernel timing + exact byte geometry: bytes moved <= peak_BW x duration caps the re-stream factor. llama gated_delta_net_cuda decode (B=128, f32 state): 3.98 ms/call, 805 MB R+W, 202 GB/s = 74% of GB10 peak. vLLM fused_recurrent_packed_decode (B=128, bf16 state): 3.62 ms/call, 402 MB R+W, 111 GB/s = 41% peak. Both single-pass (load-once/store-once, verified in source). llama re-stream factor ~1.0x (hard cap <=1.33x; >=1.5x needs >peak BW = impossible). VERDICT: NO-BUILD the fused single-pass recurrence - the kernel is already single-pass, coalesced, and MORE bandwidth-efficient than vLLM's triton kernel; the gate ops touch the tiny q/k/g/beta projections, not the 805 MB state, so fusion recovers ~0 state bytes. The entire 2x DRAM gap vs vLLM is f32 (llama) vs bf16 (vLLM) state-cache width. BUILD bf16 SSM state instead: halves 805->413 MB, ~45-95 ms/step, step 384 -> 289-339 ms = parity-to-ahead of vLLM 327 (non-bit-exact vs f32 but equal to vLLM's own bf16 precision). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 15:24:49 +00:00
LocalAI [bot]	f72046b5b5	fix(auth): make advisory locks dialect-aware and harden SQLite DSN (#10509 ) * fix(auth): make advisory locks dialect-aware and harden SQLite DSN Fixes #10506. Two failures hit deployments that use the default SQLite auth database: 1. advisorylock executed PostgreSQL-only SQL (pg_advisory_lock / pg_try_advisory_lock) unconditionally. On a SQLite auth DB the job store, agent store and node registry migrations failed with "no such function: pg_advisory_lock". WithLockCtx/TryWithLockCtx now branch on the gorm dialect: PostgreSQL keeps the cross-process advisory lock, every other dialect uses a context-aware, per-key in-process lock (a SQLite auth DB is effectively single-process, so serializing within the process is sufficient). 2. The SQLite auth DSN set no busy timeout, so transient SQLITE_BUSY over network-backed storage (SMB/CIFS/NFS, e.g. Azure Files) failed the auth migration immediately with "database is locked". The DSN now sets _busy_timeout=5000 and _txlock=immediate (caller-supplied values are preserved). WAL is intentionally not enabled since its shared-memory mmap does not work over network filesystems. Docs note that PostgreSQL should be used when the data directory lives on shared storage. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * test(jobs): regression test for #10506 SQLite job store migration Exercises the exact caller chain that failed in the issue: auth.InitDB(sqlite) -> jobs.NewJobStore -> advisorylock.WithLockCtx -> AutoMigrate. Before the dialect-aware advisory lock fix this failed with "no such function: pg_advisory_lock"; the test now asserts it migrates cleanly on a SQLite auth DB. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 17:18:55 +02:00
Ettore Di Giacinto	5825b073a5	docs(paged): SYNTHESIS - validated decode-parity picture, ranked plan, verdict Final synthesis of the critical-path gap analysis: the decode step is 99.94% GPU-busy single-stream (idle 0.225ms = 0.06%), so the 14% gap to vLLM is kernel GPU-time dominated by the bandwidth-bound gated_delta_net recurrence (196.37ms = 51.6%), not launch bubbles. Claims A/B/C all REFUTED as worded; the single residual is the unmeasured DRAM byte ratio of llama's recurrence vs vLLM's fused kernel. Ranked plan: single-pass fused GDN recurrence (gap-closer, gate on ncu byte-ratio test) + conv-state concat fusion (no-regret +2-3%, bit-exact); gate-fold alone tops out at ~89% of vLLM; bf16 state is the only floor-mover but breaks bit-exactness. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 15:03:18 +00:00
Ettore Di Giacinto	a72385257a	docs(paged): decisive node-level decode timeline gap - bubbles refuted Fresh nsys --cuda-graph-trace=node capture of one steady decode step on q36-27b-nvfp4 dense at npl128 (clean Lever-1 build-cuda-base). The decode step is a single CUDA graph; node-level expansion shows it is 99.94% GPU-busy on a single stream with 0.225 ms/step inter-kernel idle (0.06%, zero gaps >5us). This refutes the "~60% idle bubbles / 57 ms = 100% bubble" hypothesis and confirms the cudagraph-coverage source verdict. Real decode mix: gated_delta_net 196 ms = 51.6% of the step (4.08 ms/call x48; the prior 1.47 ms/call "near-vLLM" was a prefill-contaminated eager average), FP4 GEMM+quantize 29%, gating glue (Lever 3 target) only 3.35%, gdn_gather 0.06 ms. By roofline-decode's own sizing test (idle < 57 ms => gap is elsewhere) the 14% gap to vLLM lives in kernel GPU-time, dominated by the bandwidth-bound GDN recurrence, not in bubbles; Lever 3 fusion is resized to ~3% and reframed as byte-reduction, not bubble removal. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 14:57:37 +00:00
Ettore Di Giacinto	2b57997df0	docs(paged): cudagraph-coverage - GDN serial chain IS graph-covered at B=128 Determine whether the ggml CUDA graph covers the gated-DeltaNet serial chain at batch=128. It does: nothing in the GDN region forces graph-disable (check_compability lists only split-buffers and large-batch MUL_MAT_ID), and the recurrent head is constant for a steady 128-seq batch so the inplace_ids state_dst offset + rs_head op_param + SSM input shapes are stable across steps. The fused op does no host-sync / capture-time cudaMalloc. The only re-warm is the per-256-token full-attention block-table cadence (not a GDN op). The ~40% util is bandwidth-roofline (SSM state traffic 66% of step bytes), not launch-gap idle - so no GDN graph-safe lever; the only non-covered idle is the ~0.4% between-step host cgraph rebuild. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 14:45:51 +00:00
Ettore Di Giacinto	e597a8ac78	docs(paged): vLLM GDN decode = 2 fused kernels under CUDA graph vs llama ~8 ops Read-only source comparison of the gated-DeltaNet decode region. vLLM folds conv-silu, q/k l2norm, scale, softplus+A_log gate, sigmoid-beta, the delta-rule recurrence and the SSM state write-back into ONE Triton kernel (fused_recurrent_gated_delta_rule_packed_decode), with the output gate fused into a gated rms_norm, and captures the whole decode forward in a full CUDA graph (GDNAttentionMetadata UNIFORM_BATCH, decode-only full cudagraph). llama runs the same region as ~8 separate host-launched, serially-dependent ggml nodes. That launch/bubble delta - not GEMM throughput - is the candidate 62%-vs-40% busy gap. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 14:43:01 +00:00
LocalAI [bot]	79783120dd	fix(config): gate parallel-slot default on per-device VRAM too (#10485 ) (#10507 ) The first #10485 fix (#10494) made the Blackwell physical-batch boost per-device/context-aware, which neutralized the big compute-buffer OOM, but the reporter's 2x16 GiB consumer Blackwell still OOM'd. Tracing the post-fix log: the model now loads its weights, builds the main context and warms up fine, and dies only on the last allocation — the MTP draft context's 800 MiB KV cache on the tighter device. #10411 changed only two defaults: the physical batch (now gated) and a VRAM-scaled parallel-slot count. The KV cache is unified (n_ctx_seq == full context proves slots share the budget, so parallel doesn't multiply KV), but n_seq_max=4 still adds per-slot compute-graph / context-checkpoint / output scratch. On a device packed ~99% by a 27B model spanning both cards, that overhead is the few-hundred-MiB straw — which is why reverting #10411 (and only #10411) restores a working load. Gate the parallel-slot default on the same per-device headroom predicate as the batch boost: when a large context already fills a single card (largeContextForDevice), keep n_parallel=1. A user running one big-context model that barely fits across two consumer GPUs is not serving four concurrent tenants. Small contexts and large unified-memory devices (GB10) keep full concurrency. Applied on both the single-host path and the distributed router. Also make the auto-tuning visible and reversible (the debugging here needed DEBUG logs and a git bisect): - Log the effective performance-relevant runtime options at INFO once per model load ("effective runtime tuning …": context, n_batch, n_gpu_layers, parallel, flash_attention, f16) so an admin can see what will run and pin or override any value in the model YAML. - LOCALAI_DISABLE_HARDWARE_DEFAULTS=true skips the hardware auto-tuning entirely (mirrors LOCALAI_DISABLE_GUESSING) for stock llama.cpp behavior. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 15:48:23 +02:00
LocalAI [bot]	4ac67d255d	feat: single-build ggml CPU_ALL_VARIANTS for llama-cpp + turboquant (x86/arm64/apple) (#10497 ) * feat(llama-cpp): single x86 CPU build via ggml CPU_ALL_VARIANTS Replace the per-microarch avx/avx2/avx512/fallback multi-binary build on x86 with a single grpc-server plus the dlopen-able libggml-cpu-.so set that ggml's backend registry selects at runtime by probing host CPU features. One build instead of four, broader microarch coverage (adds alderlake AVX-VNNI, zen4 AVX512-BF16, sapphirerapids AMX), and the shell-side /proc/cpuinfo probing in run.sh goes away. Build/link notes: - CPU_ALL_VARIANTS requires GGML_BACKEND_DL + BUILD_SHARED_LIBS=ON, so ggml/llama become shared objects. SHARED_LIBS is now a make variable (default OFF) so the override survives the recursive sub-make into the VARIANT build dir instead of being re-clobbered by the base flags. - The cpu-all target also builds "--target ggml": the per-microarch backends are runtime-dlopened, not link deps, so they only compile via ggml's add_dependencies(). - hw_grpc_proto is pinned STATIC. Under BUILD_SHARED_LIBS=ON it would otherwise become a DSO referencing hidden-visibility symbols in the static libprotobuf.a, which fails to link ("hidden symbol ... is referenced by DSO"). Keeping it static links gRPC/protobuf into the executable while only ggml/llama stay shared, so no PIC or base-image change is required. - package.sh bundles the libggml-.so set into package/lib; ggml finds them by scanning the bundled ld.so directory (/proc/self/exe), which run.sh launches from. Scope: x86 only. arm64/darwin keep the single fallback build. The ik-llama-cpp / turboquant forks and the other ggml C++ backends are unchanged; the same recipe applies but is out of scope here. Validated with a full docker build plus a live inference smoke test: the model loads, ggml selects the AVX512_BF16 variant on a Zen-class host, and tokens generate correctly. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(llama-cpp,turboquant): extend CPU_ALL_VARIANTS to arm64 + turboquant - llama-cpp: x86 AND arm64 now use the single llama-cpp-cpu-all build (only hipblas keeps the fallback build). ggml's arm64 variant table (armv8.x / armv9.x, plus apple_m* on darwin) is selected at runtime. - turboquant: same recipe via a turboquant-cpu-all target. turboquant copies backend/cpp/llama-cpp's CMakeLists.txt + Makefile per flavor, so the hw_grpc_proto STATIC fix and the SHARED_LIBS / EXTRA_CMAKE_ARGS make-vars are inherited; the target just passes SHARED_LIBS=ON, the DL flags and --target ggml through, then collects the .so set. run.sh and package.sh updated to ship/select turboquant-cpu-all. - Makefile lib-collection find now also matches .dylib (for the darwin build, which emits dylibs rather than .so). ik-llama-cpp is intentionally left unchanged: its pinned ggml has no CPU_ALL_VARIANTS support and its IQK kernels require AVX2, so the per-microarch dynamic backend set does not apply. Scope still excludes the darwin packaging wiring (separate change). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] feat(llama-cpp,turboquant): arm64 gcc-14 for SME variants + darwin cpu-all packaging - arm64: ggml CPU_ALL_VARIANTS builds armv9.2 SME variants whose -march=...+sme is rejected by the Ubuntu 24.04 default gcc-13. Build the arm64 variants with gcc-14 (installed in the compile step). The host only selects a variant it actually supports at runtime, but every variant must still compile. - darwin: scripts/build/llama-cpp-darwin.sh builds llama-cpp-cpu-all instead of the fallback binary, keeps Metal (GGML_METAL stays ON; --target ggml also builds ggml-metal). The per-microarch libggml-cpu-.dylib are placed in the package root next to the binary (darwin has no bundled ld.so, so ggml's executable-dir scan looks there), while the other shared dylibs go in lib/ for DYLD_LIBRARY_PATH. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] fix(llama-cpp-darwin): distribute ggml backends by suffix (.so root, .dylib lib) ggml emits its loadable backends (per-microarch CPU variants, metal, blas) with a .so suffix even on darwin, while the core libraries (ggml-base/ggml/llama/ llama-common/mtmd) use .dylib. Split the distribution by suffix: .so DL backends go in the package root for ggml's executable-directory scan, .dylib core libs go in lib/ for DYLD_LIBRARY_PATH. The previous .dylib name-pattern matched none of the variants. Verified on an M4: ggml loads the apple_m4 CPU variant (SME=1) and Metal, model loads and generates correct tokens. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(llama-cpp,turboquant): only CPU_ALL_VARIANTS for pure-CPU builds, GPU uses fallback The previous gate sent every non-hipblas build through llama-cpp-cpu-all, so the GPU image builds (cublas, sycl_f16/f32, vulkan, nvidia l4t) compiled the whole CPU microarch variant matrix on top of their already-huge GPU backend - blowing the build time (the sycl job was only 59% done after 2h11m) - and the arm64 l4t build failed at `apt-get install gcc-14` (exit 100) on the Jetson base. Gate on an empty BUILD_TYPE instead: only the pure CPU image (build-type: '' in .github/backend-matrix.yml) builds the CPU_ALL_VARIANTS set; every GPU build gets a single fallback CPU grpc-server, since the accelerator does the compute. This also confines the arm64 gcc-14 step (needed for the armv9.2 SME variants) to the CPU build, away from the GPU base images. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * docs(llama-cpp): correct run.sh comment for arm64/darwin cpu-all arm64 and darwin CPU images now also ship llama-cpp-cpu-all (not fallback-only); only GPU images ship fallback-only. Fix the stale comment to match. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 15:47:03 +02:00
LocalAI [bot]	3a87d9e48f	feat(vllm): macOS/Metal support via vllm-metal (MLX) (#10489 ) * feat(vllm): macOS/Metal support via vllm-metal (MLX) Add an additive Apple-Silicon path to the existing vllm Python backend so vLLM runs on macOS via vllm-metal (github.com/vllm-project/vllm-metal). Spike outcome (proven on a real M4 / macOS 26.5, Qwen3-0.6B): - vllm-metal registers through vLLM's platform-plugin entry point (metal -> vllm_metal:register); MetalPlatform activates and runs on the GPU through MLX. - LocalAI's backend.py is UNCHANGED: AsyncEngineArgs(...) -> AsyncLLMEngine.from_engine_args transparently resolves to vLLM 0.23's v1 AsyncLLM MLX engine, and async generate produced correct output. - backend.py is NOT touched: its only empty_cache() call is CUDA-only (guarded by torch.cuda.is_available()), so the benign shutdown-only "Allocator for mps is not a DeviceAllocator" noise comes from vLLM's internal EngineCore teardown, not from our code. Changes (all gated behind a darwin condition; Linux/CUDA/ROCm/Intel paths are byte-for-byte unchanged): - install.sh: darwin branch forces PYTHON_VERSION=3.12 (vllm-metal requirement), creates/activates LocalAI's managed venv via ensureVenv, then reproduces vllm-metal's installer INTO that venv (build vLLM 0.23.0 from the release source tarball against requirements/cpu.txt, then install the prebuilt vllm-metal wheel from its latest GitHub release), and runs runProtogen. installRequirements is skipped on darwin. - backend-matrix.yml: add a vllm includeDarwin entry (mps, python). - index.yaml: add metal capability + concrete metal-vllm / metal-vllm-development child entries mirroring the metal-kitten-tts template. Version coupling: vllm-metal pins vLLM 0.23.0, equal to LocalAI's current vllm pin. Bumping vllm must be coordinated with a supporting vllm-metal release; documented in install.sh and requirements-cublas13-after.txt. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] * chore(vllm): track the darwin vllm-metal pin via the autobumper The Apple Silicon build pinned vLLM 0.23.0 as a hidden string in install.sh while floating the vllm-metal wheel on releases/latest - the two could drift apart silently. Make both a tracked, reproducible pair (VLLM_METAL_VERSION + VLLM_VERSION), fetch the wheel by tag, and add .github/bump_vllm_metal.sh wired into bump_deps.yaml. It tracks vllm-project/vllm-metal (not vllm/vllm latest), reading the coupled vLLM source version from vllm-metal's own installer, and opens a bump PR - mirroring the existing bump_vllm_wheel.sh for the cu130 wheel. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] * chore(vllm): derive the darwin vLLM version, drop the second pin Follow-up: VLLM_VERSION was still a hardcoded string duplicating what VLLM_METAL_VERSION already determines. Derive it at install time from vllm-metal's own installer (vllm_v=) at the pinned tag - one source of truth, no second value to drift. The bumper now touches only VLLM_METAL_VERSION; the derivation is immutable per tag, so builds stay reproducible. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] * fix(vllm): fetch the vllm-metal wheel without the GitHub API The darwin build resolved the wheel URL via api.github.com, whose unauthenticated rate limit (60/hr per IP) 403s on shared macOS runners (observed after the 9-min vLLM source build). Construct the release-asset download URL deterministically from the pinned tag and the cp312/arm64 wheel name instead - no API call, no rate limit. Verified the URL resolves (200). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] * fix(vllm): fail Score cleanly when the engine returns no prompt_logprobs Audit of the Score path against vllm-metal (MLX on macOS): the engine accepts SamplingParams(prompt_logprobs=1) but returns an all-None prompt_logprobs list rather than computing it, so scoring is not supported there. The old guard treated the truthy [None] list as valid and silently scored every candidate as 0. Detect the all-None case and return UNIMPLEMENTED instead. No-op on Linux/CUDA, which populate real entries. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 15:46:19 +02:00
Ettore Di Giacinto	b895f4dff8	feat(paged): qwen35 gated-DeltaNet o_proj MMVQ->MMQ reshape (patch 0020) Lever 1, the single biggest decode-parity lever for the Qwen3.6 hybrid-SSM models (arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). Post-SSM (patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391; profiling both engines pinned the largest llama-specific overage to the gated-DeltaNet output projection (ssm_out). The GDN op left its output in SSM layout and the graph reshaped it to 3D [value_dim, n_seq_tokens=1, n_seqs=128] before the ssm_out matmul, so src1->ne[1]=1. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the 128 sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does not amortize the ssm_out weight read across the 128 sequences. vLLM packs the same projection into one M=128 GEMM. The in-projection was already 2D -> MMQ; only the output was 3D. The fix collapses the GDN output to 2D [value_dim, n_seq_tokens * n_seqs] (= [6144, 128] at decode) before the ssm_out ggml_mul_mat, so src1->ne[1]=128 routes to the MMQ M=128 tensor-core GEMM. The result is then already 2D, so the redundant post-matmul reshape_2d is dropped. Same contiguous data, just a 2D vs 3D view: bit-identical. Gated to the gated-DeltaNet path (qwen35 / qwen35moe / qwen3next); other archs untouched. Bit-identical greedy (--temp 0 --seed 1) vs the post-SSM baseline on both q36-27b-nvfp4 (dense) and q36-35b-a3b-nvfp4 (MoE), byte/md5-identical. test-backend-ops MUL_MAT and MUL_MAT_ID OK. decode_agg S_TG (llama-batched-bench, -fa on, npp128 ntg128, npl 32/128): dense q36-27b: 170.52 / 254.92 -> 200.00 / 335.80 t/s (+17.3% / +31.7%) MoE q36-35b-a3b: 373.28 / 560.66 -> 420.77 / 691.24 t/s (+12.7% / +23.3%) Dense @128 = 335.80 t/s = 85.9% of vLLM 391 (up from 65%; target 82-85% hit). nsys: the o_proj mul_mat_vec_q<NVFP4,m=1> bucket (132.8 ms / 48 inst) collapses to zero; mul_mat_q<NVFP4,m=128> absorbs it (+1200 inst, +363 ms) at a LOWER per-call average (620.8 -> 582.7 us). Realized o_proj-as-MMQ cost ~0.30 ms/call vs 2.77 ms/call for the old GEMV. Mirrors DGX dev-tree commit df1cc97. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 10:41:38 +00:00
Ettore Di Giacinto	c0e0ed3865	docs(paged): synthesize decode-parity exploration - the o_proj MMVQ lever Cross-check the adversarial validation against the profiler ground-truth and finalize DECODE_PARITY_EXPLORE.md. The post-SSM 254->391 decode gap is one llama-specific defect: the gated-DeltaNet output projection (ssm_out) runs as an FP4 GEMV (mul_mat_vec_q, 132 ms/step = 26% of decode) at batch 128 instead of a tensor-core MMQ GEMM. Mechanism confirmed at source: final_output is 3D [6144,1,n_seqs] so src1->ne[1]=1 trips the MMVQ dispatch (<=8), with the 128 sequences in ne[2]. vLLM packs the same projection into a cutlass M=128 GEMM. GDN recurrence is only +11%/call (not the lever); P2a optimized the wrong FP4 kernel (the 17% MMQ, not the 26% MMVQ); CUDA graphs, host loop, and DRAM bytes are all ruled out. Decode parity is reachable in software (not a hardware floor): identical bytes/floor, vLLM hits 62% util vs llama 40% on the same GB10. Highest-value next step (~free, bit-exact): collapse final_output to 2D before ssm_out so M=128 routes to MMQ. Ranked levers + cumulative ceilings toward 391 documented. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 09:06:50 +00:00
Ettore Di Giacinto	ee13fd18ce	docs(paged): profile-both-engines post-SSM ground-truth decode decomposition Fresh post-SSM nsys of llama (build-cuda-base, patch 0019) AND vLLM 0.23.0 at npl128 decode. Reproduces the 391 reference (vLLM 394 t/s eager / 420 graphs, graphs +6% only) and confirms llama 245 t/s. Both ~98% GPU-busy; the gap is GPU kernel-time, not idle/host/graphs. GDN compute comparable (llama 4.03 vs vLLM 3.62 ms/call, +11%). bytes/step: llama not higher (131 vs 85 MB memcpy; SSM-fix 18GB/step DtoD removal confirmed in-trace). Single biggest llama-specific overage = FP4 matmul path 236 vs 117 ms/step (+119 ms = 64% of the gap), dominated by mul_mat_vec_q (FP4 GEMV at batch 128, 132 ms/step, 26%, one per GDN layer). Track B optimized the wrong FP4 kernel (mul_mat_q, not the GEMV). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 08:56:37 +00:00
LocalAI [bot]	693e3eec05	chore(model gallery): 🤖 add 1 new models via gallery agent (#10505 ) chore(model gallery): 🤖 add new models via gallery agent Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-25 08:11:52 +02:00
LocalAI [bot]	f1e5071321	chore: ⬆️ Update leejet/stable-diffusion.cpp to `8caa3f908ae6d4a4bef531e73b9a969f266a3d1f` (#10493 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-25 08:11:31 +02:00
LocalAI [bot]	93d6255de3	chore: ⬆️ Update ggml-org/llama.cpp to `8be759e6f70d629638a7eb70db3824cbdcea370b` (#10501 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-25 08:11:17 +02:00
LocalAI [bot]	fe4f425fb5	fix: correct scheme/host on self-referential URLs behind an HTTPS reverse proxy (#10482 ) (#10504 ) * fix(http): harden BaseURL proxy scheme/host detection Split comma-separated X-Forwarded-Proto and honor the RFC 7239 Forwarded header so generated links use https behind common reverse-proxy setups. Refs #10482 Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(http): honor explicit external base URL in BaseURL When _external_base_url is set in the request context it dictates the origin (scheme+host+port); the proxy path prefix is still appended. Refs #10482 Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(config): generalize LOCALAI_BASE_URL to ExternalBaseURL LOCALAI_BASE_URL now sets a single instance-wide external base URL used for OAuth callbacks and all self-referential links. A Pre middleware stamps it into the request context for middleware.BaseURL. Refs #10482 Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs: document LOCALAI_BASE_URL and reverse-proxy headers Refs #10482 Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(http): cover parseForwarded edge cases; clarify base-url flag group Adds direct unit coverage for quoted/malformed/multi-element Forwarded headers and regroups the external base URL flag away from auth-only. Refs #10482 Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 08:10:59 +02:00
LocalAI [bot]	fae9f6356f	chore: ⬆️ Update ServeurpersoCom/qwentts.cpp to `9dbe7ea26a01b30fccb117ae5e86807c1dc23d42` (#10499 ) ⬆️ Update ServeurpersoCom/qwentts.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-25 08:10:41 +02:00
LocalAI [bot]	066abf82c0	feat(llama-cpp): cpu_moe/n_cpu_moe options + generic upstream-flag passthrough (#10490 ) * feat(llama-cpp): add main-model cpu_moe/n_cpu_moe options Mirror the existing draft_cpu_moe/draft_n_cpu_moe siblings for the main model, matching upstream --cpu-moe / --n-cpu-moe (common/arg.cpp). Lets users keep MoE expert weights on CPU to manage VRAM on large MoE models. Closes part of #10483 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(llama-cpp): forward unknown '-' options to upstream arg parser Any options: entry starting with '-' is collected and passed verbatim to llama.cpp's own common_params_parse (LLAMA_EXAMPLE_SERVER) at the end of params_parse, so every upstream llama-server flag works without a new hand-wired branch. Passthrough runs last and wins on overlap; n_parallel is snapshotted to survive parser_init's SERVER reset, and help/usage/completion flags are skipped to avoid exiting the backend. Closes #10483 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(llama-cpp): document cpu_moe/n_cpu_moe and option passthrough Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(llama-cpp): terminate tensor/kv override vectors after passthrough The tensor_buft_overrides padding and the kv/draft override terminators ran before the generic option passthrough, so a passthrough flag (--cpu-moe, --override-tensor, --override-kv, ...) appended a real entry after the null sentinel - tripping the model loader's back().pattern == nullptr assertion (crash) or being silently dropped. Move all three termination/padding blocks to the end of params_parse, after both the named-option loop and common_params_parse have pushed their real entries. Also widen the exit()-flag skip list so --version, --license, --list-devices and --cache-list cannot terminate the backend. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 08:10:08 +02:00
LocalAI [bot]	a7fec9a49d	feat(backends): add darwin/metal (MPS) build for trl (#10487 ) * feat(backends): add darwin/metal (MPS) build for trl Authors backend/python/trl/requirements-mps.txt and wires trl into the darwin CI matrix and gallery so the MPS training path can be built and validated on Apple Silicon. The MPS variant installs plain PyPI torch wheels (MPS-capable on macOS arm64) and the trl training stack; bitsandbytes is omitted as it is a CUDA-only dependency with poor Apple Silicon support. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] * fix(trl): guard uv-only --index-strategy for the pip/darwin path The darwin/MPS build installs with pip (USE_PIP=true), which rejects the uv-only --index-strategy flag and failed the darwin backend build. Add it only on the uv path; Linux/CUDA resolution is unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 08:09:36 +02:00
LocalAI [bot]	c678530cf0	fix(backends): darwin/metal support across purego Go backends (#10481 ) * fix(parakeet-cpp): darwin/metal support (libparakeet.dylib + DYLD path) The parakeet-cpp backend had no macOS support and panicked at startup on Apple/Metal nodes when purego.Dlopen could not find "libparakeet.so". Fix it across the same four layers the sibling voxtral backend already handles correctly: - main.go: default the dlopen target to libparakeet.dylib on darwin (runtime.GOOS), libparakeet.so elsewhere; PARAKEET_LIBRARY still wins. - Makefile: also stage the built libparakeet.dylib next to the Go sources. - package.sh: accept either the Linux .so[.X.Y] or the macOS .dylib when bundling instead of hard-failing when no .so is present (the macOS case); note that on Darwin only system frameworks are linked. - run.sh: on Darwin set DYLD_LIBRARY_PATH and PARAKEET_LIBRARY to the packaged .dylib; keep LD_LIBRARY_PATH + .so on Linux. Mirrors backend/go/voxtral. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backends): darwin/metal support across purego Go backends The parakeet-cpp fix in the previous commit was an instance of a bug shared by nearly every purego/dlopen Go backend: the dlopen target was hardcoded to a .so name and run.sh exported only LD_LIBRARY_PATH, so the backend panicked at startup on macOS/Apple-Metal nodes (dyld needs the .dylib name and DYLD_LIBRARY_PATH). voxtral was the only backend handling this correctly. Apply the same four-layer fix (mirroring backend/go/voxtral) to the remaining affected backends: whisper, sherpa-onnx, ced, stablediffusion-ggml, vibevoice-cpp, qwen3-tts-cpp, omnivoice-cpp, crispasr, acestep-cpp, locate-anything-cpp, depth-anything-cpp, rfdetr-cpp, sam3-cpp, localvqe Per backend: - main.go (sherpa-onnx: backend.go, two libraries): default the dlopen target to the .dylib on darwin (runtime.GOOS), .so elsewhere; the existing <BACKEND>_LIBRARY env override still wins. - run.sh: on Darwin set DYLD_LIBRARY_PATH and point <BACKEND>_LIBRARY at the packaged .dylib; keep LD_LIBRARY_PATH + the Linux CPU-variant (avx/avx2/avx512) selection unchanged in the else branch. - package.sh: also bundle the .dylib and stop hard-failing when no .so is present (the macOS case). - Makefile: also stage the built .dylib. Notes: - stablediffusion-ggml and acestep-cpp build their lib as a CMake MODULE, which emits .so (not .dylib) on macOS; run.sh prefers .dylib and falls back to .so so both layouts work. - sherpa-onnx was already partly darwin-aware (Makefile/package.sh); only run.sh and the two dlopen defaults needed fixing. Linux behavior is unchanged. Verified gofmt-clean and `CGO_ENABLED=0 go build` for every backend. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 08:09:18 +02:00
Ettore Di Giacinto	6f0792c3be	feat(paged): qwen35 SSM decode fused recurrent-state gather (patch 0019) Mirror of the llama-paged-dev patch 0019 engine change plus the measured results. Step 2 of the SSM decode work: after Step 1 (in-place state write-back, patch 0018) the largest non-GEMM decode bucket was the recurrent-state get_rows gather (18.8 percent of decode GPU time). This removes that materialization, mirroring ggml_ssm_scan's ids source: ggml_gated_delta_net_inplace_ids reads each sequence's prior state directly from cache[ids[seq]] (src[5] = full cache, src[7] = ids), so combined with Step 1's in-place write the op reads AND writes the cache directly with no state materialization at all. Race-free by construction: identity sequences (ids[seq] == rs_head + seq, the whole AR decode path) read s0 in place from the destination slot; non-identity sequences (reorder / rs_zero, e.g. multi-new-seq prefill) read from a disjoint scratch a small gather kernel populates first. ids stays a device pointer. Bit-identical to the get_rows path. Gated to qwen35 + qwen35moe; qwen3next, kimi-linear, the non-fused and rollback paths are unchanged. Measured (decode_agg S_TG, npp128 ntg128, -fa on, paged on, fusion off): q36-27b-nvfp4 dense: npl32 137.64 -> 170.68 (+24.0 percent), npl128 186.25 -> 256.57 (+37.8 percent, 47.6 -> 65.6 percent of vLLM 391). q36-35b-a3b-nvfp4 MoE: npl32 299.68 -> 366.69 (+22.4 percent), npl128 409.30 -> 553.63 (+35.3 percent). Greedy (--temp 0 --seed 1) llama-completion bit-identical vs the Step-1 build (dense + MoE). nsys k_get_rows_float bucket 18.8 -> 0.7 percent. The residual decode gap to vLLM is now the FP4 GEMM (~48 percent of decode). See SSM_DECODE_FIX_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 23:47:51 +00:00
LocalAI [bot]	3c63431e46	chore: ⬆️ Update ServeurpersoCom/omnivoice.cpp to `0f37401bebe9b20c0160a888e592108fc1d17607` (#10492 ) ⬆️ Update ServeurpersoCom/omnivoice.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-25 00:57:58 +02:00
LocalAI [bot]	3f647a2764	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `d5507e33ae7ee2b7b41475f08044d3bde3b839ee` (#10498 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-25 00:57:42 +02:00
Ettore Di Giacinto	5ce2f1df51	feat(paged): qwen35 gated-DeltaNet in-place SSM state write-back (patch 0018) Mirror of the llama-paged-dev patch 0018 engine change plus the measured results. Per SSM layer per step decode no longer D2D-copies the full ~225 MB recurrent state into the cache: the fused gated_delta_net op writes the final state in place at the active sequences cache slot (new ggml_gated_delta_net_inplace, src[6] = state_dst), mirroring vLLM fused_recurrent_gated_delta_rule. SSM math unchanged (bit-identical greedy). Measured (decode_agg S_TG, npp128 ntg128, -fa on, paged on): q36-27b-nvfp4 dense: npl32 113.74 -> 136.39 (+19.9 percent), npl128 146.23 -> 180.53 (+23.5 percent, = predicted copy-removal ceiling). q36-35b-a3b-nvfp4 MoE: npl128 313.36 -> 372.62 (+18.9 percent). nsys D2D memcpy bucket 18.9 -> 0.23 percent (356 -> 2.93 GB). vLLM share (391 @128) 37.4 -> 46.2 percent. See SSM_DECODE_FIX_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 22:45:49 +00:00
LocalAI [bot]	f88981cdce	feat(ui): data-driven hardware model recommendations + gallery surfacing (#10500 ) * feat(ui): make hardware starter models data-driven The empty-state starter widget recommended from a hardcoded list, which drifts as the gallery evolves. Add useRecommendedModels: it queries the live gallery for chat-capable models (their natural curated order, since the gallery exposes no popularity signal), estimates size/VRAM for the top candidates via the existing estimate endpoint, and ranks by hardware fit - smallest on CPU-only boxes, largest-that-fits on GPUs. StarterModels now renders those live picks and keeps the curated static list only as an offline/trimmed-gallery fallback. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(ui): recommend models for your hardware in the gallery Hardware-aware recommendations were only shown on the first-run empty state. Surface them on the main Models gallery too: a dismissible "Recommended for your hardware" strip at the top, sharing the useRecommendedModels fit-ranking with the starter widget. CPU-only boxes get small models; GPUs get the largest picks that fit VRAM, with size and VRAM shown per card. One-click install; dismissal persists per browser. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(ui): gpu-mid tier + NVIDIA NVFP4 model recommendations Refine the hardware recommendation tiers and curated picks: - Add a gpu-mid tier (8-24GB VRAM) between gpu-small and gpu-large, so ~27B-class models are suggested separately from the 30B+ large tier. - Detect NVIDIA GPUs (resources.gpus[].vendor) and, on NVIDIA only, prefer NVFP4 + MTP variants (Blackwell-optimised); NVFP4 models are filtered out of recommendations on non-NVIDIA hardware where they can't run. This applies to both the live ranking and the static fallback, with an NVFP4 badge shown on those picks. - Refresh the curated fallback to current models: Gemma-4 QAT Q4 builds at every tier, low qwen3.5 (4B distilled / 9B) on CPU/small, qwen3.6-27b and MTP variants at mid, qwen3.6/qwen3.5 35B-A3B apex/distilled at large. All names verified against gallery/index.yaml. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 00:22:45 +02:00
LocalAI [bot]	0d6de15ae9	fix(config): per-device VRAM headroom for Blackwell defaults (#10485 ) (#10494 ) The hardware-tuned defaults from #10411 were measured on a GB10 / DGX Spark (128 GiB unified memory) and over-provisioned multi-GPU consumer Blackwell (e.g. 2x16 GiB RTX 50-series) into CUDA OOM during model init: - The Blackwell physical batch (512 -> 2048) sets both n_batch and n_ubatch. The compute buffer scales ~n_ubatch * n_ctx and is allocated PER DEVICE (it can't be split across GPUs), so a large context turns ub2048 into multi-GiB of scratch that must fit one 16 GiB card. - The VRAM-scaled parallel-slot default tiered off TotalAvailableVRAM(), which SUMS all GPUs (2x16 -> "32 GiB" -> 8 slots), but the allocations are per-device. Make both decisions per-device and context-aware: - xsysinfo.MinPerGPUVRAM() reports the smallest device's VRAM; localGPU() uses it so the parallel tier and batch guard reason about one card. - PhysicalBatchForContext(gpu, ctx) raises the batch only when the extra compute buffer fits VRAM/4 at this model's context (16 GiB crosses over ~174k ctx, 32 GiB ~349k; GB10 reports system RAM so it still clears it). - Apply hardware defaults AFTER runBackendHooks in SetDefaults so the GGUF-guessed context is resolved before the batch decision. - The distributed router gates the node batch the same way. Unified-memory devices (GB10, Apple) report system RAM as their single device's VRAM, so they keep the prefill win. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 00:07:48 +02:00
Ettore Di Giacinto	34cadb64af	docs(paged): A.2 final synthesis - CUDA-graph decode verdict Append the four-point synthesis to A2_CUDAGRAPH_DECODE.md: measured CUDA-graph lever size (<1%, not the guessed 10-20%), the corrected 'eager' premise (default paged decode already captures), the unchanged 37-38% of vLLM at npl128, and the honest verdict that A.2 closes none of the 2.6x gap because paged attention touches ~0.4% of decode on this hybrid-SSM model. Residual lever is the qwen35 gated-DeltaNet SSM path (state D2D copy + get_rows gather), orthogonal to paged attention. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 21:45:42 +00:00
Ettore Di Giacinto	2dd5d68e6d	docs(paged): A.2 Phase 2 - locate the real decode lever (gated-DeltaNet SSM path) Phase 1 ruled out CUDA graphs as the paged-decode lever (GPU 99.4% busy, decode_agg flat graphs on-vs-off) and attributed the 2.6x gap to vLLM to the per-step GPU kernel work (FP4 GEMM + attention at batch 128). Phase 2 decomposed that kernel work directly on the Phase-1 nsys reps and corrects the attribution. Findings (q36-27b-nvfp4 = gguf arch qwen35, a 48:16 hybrid gated-DeltaNet linear-attention + full-attention model; DGX GB10 sm_121, fusion off): - Graphs re-confirmed not the lever: fresh paged graphs-ON 146.03 vs OFF 144.90 t/s (+0.78%, noise); the captured rep is 99.5% busy with the same ~3267ms memcpy (graphs capture memcpy nodes too). - The 99.4% busy is real but ~19% of it is D2D memcpy, not compute: an overlap-correct interval-union sweep gives kernels-only 80.2% busy, the gap filled by 1584 D2D copies/run (~80/step, ~230MB each = the gated-DeltaNet recurrent state). Phase 1's cuda_gpu_trace lumped this into compute. - Decode GPU-time decomposition (% of kernel+memcpy busy): gated_delta_net 23.4%, get_rows 21.9%, D2D state copy 18.9%, FP4 GEMV 15.5%, FP4 GEMM 10.4%, full attention 0.4%. Grouped: SSM/gated-DeltaNet machinery ~67%, FP4 matmul ~28%, full attention (all paged-attn optimizes) ~0.4%. Verdict: not graphs, not the host loop, not primarily FP4 GEMM, not attention. Paged attention touches ~0.4% of decode on this model, so no paged/graph/ block-table change can move decode_agg. The lever is the ggml qwen35 gated-DeltaNet decode: kill the per-layer recurrent-state D2D copy and fuse the get_rows gather into the recurrence (vLLM's fused_recurrent_gated_delta_rule keeps state in place). Ceiling: -copy ~146->180; -copy-and-gather ~146->247 t/s. No code patch (the lever is an SSM-path rewrite, orthogonal to paged attention); patches/paged/0018 stays free. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 21:44:22 +00:00
LocalAI [bot]	5c3d48ab50	feat(ui): usage & UX enhancements (last-used model, polling, starter models, usage cost, a11y) (#10496 ) * feat(ui): remember last-used model per capability ModelSelector auto-selected the first option whenever the bound value was empty or stale, so every visit to the Home chat box, Image, TTS or Talk pages reset the choice to whatever sorted first. Persist the user's pick in localStorage keyed by capability and prefer it on auto-select when the model is still available, falling back to the first option otherwise. Because every modality picker funnels through ModelSelector, this fixes the friction everywhere at once. External-options callers pass no capability and keep the previous first-item behaviour. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(ui): add visibility-aware polling hook The app had 26 hand-rolled setInterval polls, none of which paused when the browser tab was hidden, so backgrounded dashboards kept hitting the server every few seconds for data nobody was looking at. Add usePolling: runs immediately, polls on a fixed interval, pauses while document.hidden, fires a catch-up poll on return, and guards against overlapping slow requests. Route useResources (the highest-frequency shared poll) through it. Further callers can be migrated incrementally. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(ui): hardware-aware starter models on empty home A fresh install dropped admins straight into a 1000+ model gallery with no guidance. Add a StarterModels widget to the empty-state wizard that recommends a small, curated set tuned to the detected hardware: - CPU-only machines (no GPU VRAM) are steered to genuinely small models (1-4B, Q4) that stay responsive without a GPU. - GPU machines get suggestions scaled to available VRAM. Curated names are real gallery entries, intersected against the live gallery at render time so a trimmed/custom gallery degrades gracefully. Install is one click via the existing model-install API. Also routes Home's cluster and system-info polls through usePolling so a backgrounded home page stops fetching. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(ui): optional token-cost estimates on usage dashboard The usage dashboard tracked tokens but had no monetary view. Multi-user deployments that bill back or budget compute had to export and compute cost elsewhere. Add an opt-in pricing control: admins set $ per 1M prompt/completion tokens (stored per-browser). When set, an estimated-cost summary card and per-model / per-user cost columns appear, computed from recorded token counts. The entire cost surface stays hidden until a price is entered, so the default view is unchanged. Cost is clearly labelled an estimate - LocalAI itself has no notion of price. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(ui): label icon-only send buttons for screen readers The chat and agent-chat send buttons were a bare paper-plane icon with no accessible name, so screen readers announced only "button". Add an aria-label/title ("Send message") and mark the icon aria-hidden. An audit of all icon-only buttons found these were the only two unlabeled controls; the rest already carry visible text. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 23:30:08 +02:00
Ettore Di Giacinto	da67fd87e2	docs(paged): A.2 CUDA-graph decode lever measurement and gap diagnosis Phase 1 measures the CUDA-graph lever on the paged decode (q36-27b-nvfp4 dense, GB10 sm_121, fusion off). The 4-cell decode_agg {stock,paged} x {graphs on,off} is flat within ~1%: the graphs-on win is +0.13% at npl128 and +1.1% at npl32 (both within run noise). The default paged decode is not eager: it captures and replays graphs with a 256-token reset cadence identical to stock non-paged (block-table ne0 = GGML_PAD(n_gather,256) only steps at 256-token boundaries); only the gather fallback grows n_gather every step and runs pure eager. 'graphs reused=0' was a uid fast-path false negative (llama rebuilds the cgraph each step, so the reuse log never fires while the graph still replays via the instance path). nsys (reliable eager trace, plus the captured trace re-run with --cuda-graph-trace=node to defeat nsys omitting graph-internal kernels, an artifact that otherwise reads 0.3% busy) shows the steady decode is 99.4-99.5% GPU-busy. Idle is ~0.6% of the step: 0.37% within-step launch gaps (the only thing graphs remove, cut to 0.11% when captured) plus a 0.24% between-step host gap (~2ms per step). Throughput is identical on/off. Verdict: CUDA-graphing the paged decode is not a throughput lever; the decode is GPU-compute-bound and the 2.6x gap to vLLM (148 vs 391) is in the per-step GPU kernel work (FP4 GEMM + attention at batch 128), not launch overhead or the host loop. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 21:26:16 +00:00

1 2 3 4 5 ...

6951 Commits