Commit Graph

1504 Commits

Author SHA1 Message Date
Ettore Di Giacinto
13e6ee89c7 kernel: validate cuBLAS dead-end (sm_80 fallback) + W4A16 Marlin impl plan
Decisive DGX experiment: rebuilt with -DGGML_CUDA_FORCE_CUBLAS (it's a compile
#ifdef, not the runtime env we'd been setting - so prior 'cuBLAS no-op' tests
never engaged it). Real result: cuBLAS is SLOWER than MMQ for dense Q4 (pp2048
690 vs 750) and runs an Ampere cutlass_80_tensorop kernel - CUDA-13 has no sm_121
GEMM, falls back to sm_80. So both MMQ and cuBLAS sit at ~46 TFLOP/s; no library
shortcut to the 213 ceiling on GB10. Confirms a hand-tuned sm_120a kernel is
required. Added the phased W4A16 Marlin-style implementation plan (P0 harness ->
P5 enable) as the committed multi-week build; corrected the cuBLAS note.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-20 21:16:13 +00:00
Ettore Di Giacinto
76cc0b6abc docs(paged): phased plan to make llama.cpp a viable vLLM alternative
Phase 1 (config, PR #10411, DONE): VRAM-scaled n_parallel + Blackwell batch.
Phase 2: paged KV (PR #22569, ~9.5x concurrency). Phase 3: chunked prefill +
n_batch/ubatch split. Phase 4: batched-GEMM kernel tuning. Phase 5: backend
sampling. Cross-cutting: spec-dec for dense.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-20 09:35:53 +00:00
Ettore Di Giacinto
122df1c620 analysis: vLLM throughput gap decomposed - spec-dec is the per-user lever
Per-user decode is at parity without spec-dec (10.2 vs 11.7, bandwidth-bound).
vLLM's per-user speed = speculative decoding (lossless, target-verified). GB10 is
best-case (bandwidth-bound + idle compute); llama.cpp spec-dec measured 2.9x on
dense Qwen2.5-32B. Qwen3-32B has no native MTP - use Qwen3-1.7B draft or EAGLE3
head. Recommendation: make spec-dec easy for dense >=14B on Blackwell (keeps
Q4_K_M quality, no kernel). Prefill-kernel + continuous-batching are separate
(TTFT / aggregate). Our own DGX run pending (box rebooted, llama-cli hangs).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-20 08:40:20 +00:00
Ettore Di Giacinto
14e3da25b6 kernel: dense MXFP4 test = free 1.44x (765->1153) but FP4-MMA untuned (~17% of ceiling)
MXFP4 dense moves prefill off int8-MMQ onto the FP4-MMA path (existing kernel) for
a free 1.44x - shippable as a Blackwell dense-quant recommendation. But it's ~17%
of the FP4 roofline, so the FP4-MMA kernel is itself untuned: ~4-6x still in the
kernel. Sharpens the target to TUNING the FP4-MMA (serves dense+MoE, only path to
beat vLLM). Marlin-style W4A16 BF16 is the alt to match on the BF16 ceiling.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-20 07:48:29 +00:00
Ettore Di Giacinto
f5e9caece1 kernel: reframed Blackwell kernel-gap map (research + profiles)
Key corrections: (1) vLLM 24k is AGGREGATE; single-stream roofline ~3300 t/s
(BF16) / 6600 (FP4). (2) GB10 is 1:1:2 BF16:INT8:FP4 - INT8 == BF16, only FP4 is
2x. (3) Measured: dense int8-MMQ at 21% of ceiling, MoE FP4-MMQ at ~5% - both
EXIST, just untuned for Blackwell. Strategy: to MATCH vLLM, tune MMQ or build a
Marlin-style W4A16 BF16 GEMM (FP4 NOT required); to BEAT, fix the existing FP4
MMA on sm_121 (build/miscompile, not greenfield). Dropped the tcgen05 grouped
GEMM rewrite. Cheap next test: dense MXFP4 quant + existing FP4-MMA.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-20 07:21:56 +00:00
Ettore Di Giacinto
d2651c86d9 bench(dense): root-cause the W4A4 NVFP4 hang; W4A16 vs Q4 is the headline
Researched: W4A4 hangs on GB10 because FlashInfer ships no FP4 cubins for
sm_120/121 (all datacenter Sm100a); dense mm_fp4 is gated-off/returns-zeros on
consumer Blackwell, and the FlashInfer FP4 autotuner spins on the first forward
pass. Not a misconfig - dense W4A4 inference isn't validated on sm_121. W4A16
(4-bit weight / 16-bit act, Marlin) vs llama Q4_K_M is the correct apples-to-
apples (same quant class) AND the fast path. Removed the misleading 'W4A4 would
be faster / lower bound' framing. Sources: vllm #30163/#26381, flashinfer
#2577/#3294, cutlass #3096.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-20 06:59:50 +00:00
Ettore Di Giacinto
19742aee64 bench(dense): FORCE_CUBLAS no-op for dense too (720.8 vs 721.8) - every flag lever exhausted
Confirms parity (dense+MoE, both phases) is strictly the FP4 tensor-core kernel;
no config/flag shortcut remains.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-20 03:59:27 +00:00
Ettore Di Giacinto
ce60737fc5 kernel(doc): dense scope resolved - two FP4 kernels (dense first, then grouped)
Benchmark confirms dense prefill 7.6-32x behind too, so the kernel track needs a
non-grouped FP4 dense GEMM (simpler, land first) + the MoE grouped GEMM. Both
share the e2m1 block-scaled collective; dense is grouped-with-one-group.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-20 03:56:33 +00:00
Ettore Di Giacinto
37cbc089b0 bench(dense): Qwen3-32B dense parity - dense has the kernel gap too (PP 7.6-32x)
vLLM W4A16 vs llama Q4_K_M dense: prefill 7.6-32x behind (llama plateaus ~765,
vLLM scales to 24.4k); decode ~parity at B=1 (weight-bandwidth-bound), 2.2x at
B=64. Full NVFP4 (W4A4) hangs on this vLLM/GB10 stack - W4A16 used. Decision:
the Lever-3 kernel track must ALSO deliver a non-grouped FP4 dense GEMM, not just
the MoE grouped GEMM (dense GEMM is the simpler first kernel to land).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-20 03:55:58 +00:00
Ettore Di Giacinto
b7b2e8291c kernel(fp4-grouped-moe): scaffold the FP4 grouped-GEMM MoE dispatch (Lever 3)
The only work that closes the vLLM gap on Blackwell: mul_mat_q<MXFP4> is 37%
prefill + 54.6% decode-B64 GPU time; paged attention can't touch it (proven).
Scaffold (builds clean on GB10, default byte-identical): fp4-grouped-moe.{cuh,cu}
entry + gated hook in ggml_cuda_mul_mat_id (env GGML_CUDA_FP4_GROUPED), always
falls back to MMQ for now. Design doc has the CUTLASS/tcgen05 implementation
phases + parity harness + the dense-path follow-up (#28).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 23:44:31 +00:00
Ettore Di Giacinto
cb28deda6b bench(paged): decode profile overturns 'engine-addressable' - decode is 54.6% MoE GEMM too
Decode-dominated B=64 nsys: mul_mat_q<MXFP4> 54.6%, attention only 19.8%. Both
phases are FP4-MoE-kernel-bound (Lever 3). The paged series cannot close the vLLM
gap in either phase; its real value is capacity + prefix-sharing, not tok/s parity.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 23:27:35 +00:00
Ettore Di Giacinto
2a500c371f bench(paged): fresh GB10 head-to-head vs vLLM - two distinct gaps
Prefill 6-48x behind and does NOT scale with B (kernel-bound, paging can't fix).
Decode: we win at B=1; 2.5-3.7x behind at B>=8 - THAT concurrency gap is the
engine's domain (0004 pool + 0005 continuous batching target it). Baseline for
the series to improve on.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 23:20:22 +00:00
Ettore Di Giacinto
48fbb9384f docs(paged): refine 0003 plan - used-cell gather, per-ubatch rebuild, single-stream first
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 23:14:25 +00:00
Ettore Di Giacinto
145e45b6f2 docs(paged): exact executable plan for 0003 gather-read
Every edit mapped (gather-index graph input mirroring k_idxs; gather K/V/mask by
one aligned index; n_kv compaction; gated so stock stays byte-identical) with
the token-identical gate and the known risks (mask transpose layout, v_trans).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 23:12:18 +00:00
Ettore Di Giacinto
c4b4f3a3e4 docs(paged): series status 0001/0002 done+verified; honest parity note
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 23:05:14 +00:00
Ettore Di Giacinto
61ff738177 patch(paged) 0002: LLAMA_KV_PAGED block placement, Gate 0 token-identical
find_slot places a sequence's tokens at permuted non-contiguous blocks; greedy
generation is token-identical to stock (verified on Qwen3-0.6B at the pin),
branch confirmed firing. Default off. The placement substrate for the gather-read.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 23:04:28 +00:00
Ettore Di Giacinto
ce48cc0751 patch(paged) 0001: vendor PagedKVManager into llama.cpp src
First patch of the stacking series. Adds src/paged-kv-manager.{h,cpp} (the
CPU-verified vLLM-parity block manager) + CMake entry. No behavior change.
Generated against the pinned LLAMA_VERSION; applies clean.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 22:55:22 +00:00
Ettore Di Giacinto
ba3fa5a633 build(paged): stacking patch-series scaffolding for llama.cpp paged attention
Numbered patches under backend/cpp/llama-cpp/patches/ applied in order against
the pinned LLAMA_VERSION (build hook in the llama.cpp: target). Each phase is one
small, independently-buildable patch so the work rebases cleanly across llama.cpp
bumps (anti-drift). README defines the series (0001 vendor manager -> 0006 prefix
caching) + the regen workflow.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 22:53:20 +00:00
Ettore Di Giacinto
62f0ae17e3 docs(paged): upstream survey - no FP4 MoE GEMM to patch in; phase 3 is from-scratch
No tcgen05/CUTLASS grouped-GEMM MoE kernel exists upstream (merged/in-flight/
draft); CUTLASS not a dep; no fork has one; activation-quant gather already
fused. Matching vLLM needs a from-scratch tcgen05 grouped GEMM (months,
maintainers deferring to cuTile). No tractable patch closes the 27x.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 22:44:11 +00:00
Ettore Di Giacinto
b14214620c docs(paged): Lever-3 phase-1 nwarps tweak = dead end (constants coupled)
static_assert(nwarps*tile_C::I == mmq_y) locks nwarps=8 for mmq_y=128; can't
raise occupancy without co-scaling mmq_y (blows Blackwell smem). MMQ kernel is
not freely tunable -> parity needs the tcgen05/CUTLASS rewrite, not knobs.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 22:32:02 +00:00
Ettore Di Giacinto
1449b806ab docs(paged): Lever-3 + paged-attention implementation plans + upstream ggml issue draft
Plan A (Lever 3): phased path to FP4 MoE GEMM parity — cheap tweaks, act-quant
fusion, then the real lever (tcgen05/CUTLASS grouped GEMM), full-model FP4.
Plan B (paged attention): on-demand pool, gather-read + Gate 0, continuous
batching, prefix sharing; benchmark in memory-pressured/mixed-length regimes.
Upstream issue draft: GB10 numbers, nsys profile, ruled-out config knobs,
tcgen05 proposal.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 22:28:28 +00:00
Ettore Di Giacinto
9f16a907be docs(paged): Lever 3 profiled + Q4/MXFP4 findings, auto-ubatch shipped
Prefill doesn't scale with bigger single prompts (attention O(N^2)); real gap
is batched MoE prefill (B=32: 27x vs vLLM, ~22 effective TFLOP/s). nsys pins
Lever 3 target: mul_mat_q<MXFP4> MoE GEMM 37% + un-fused act-quant 8%; native
FP4 MMA already engaged, inefficiency is the per-expert thin-tile scheduler.
Q4_K_M matches MXFP4 on decode (decode win is generic 4-bit); MXFP4's only edge
is prefill. Auto-ubatch=2048 on Blackwell shipped (PR #10411).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 20:56:46 +00:00
Ettore Di Giacinto
aba0bfd24f feat(backend): auto-default physical batch to 2048 on Blackwell GPUs
On NVIDIA Blackwell consumer GPUs (sm_120/121, incl. GB10/DGX Spark) a larger
physical batch (n_ubatch) materially lifts MoE prefill throughput - measured on
a GB10 with Qwen3-30B-A3B to lift the prefill ceiling and saturate at ~2048.

When a model config leaves `batch:` unset, EffectiveBatchSize now picks 2048 on
Blackwell instead of 512; explicit `batch:` always overrides. Detection is a
shared, cached Go helper (xsysinfo.IsNVIDIABlackwell, nvidia-smi compute_cap
>= 12). Logic is isolated in core/backend/hardware_defaults.go and applied at
the common ModelOptions builder, so it covers the C++ llama.cpp backend too.

Measured (GB10, Qwen3-Coder-30B-A3B MXFP4): prefill ub512 2994 -> ub2048 3316
t/s; saturates past 2048. Also recorded in the DGX gap plan: 4-bit quant alone
captures the decode win (Q4_K_M 93.5 >= MXFP4 86.4 t/s), MXFP4's only edge is
prefill via Blackwell FP4 tensor cores.

Tests: hardware_defaults_internal_test.go; existing NBatch specs pinned to the
no-Blackwell branch for determinism.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 20:46:45 +00:00
Ettore Di Giacinto
7aa61d4c32 docs(paged): DGX Blackwell gap analysis + lever plan (living doc)
Captures the full dgx.casa investigation: Q8/F16/vLLM baselines, concurrency
sweeps, paged-patch (no concurrency effect), nsys+code root-cause (MoE int8
MMQ on Ampere-class tensor cores = 74.5% compute, no FP8 path), and the
lever plan.

Measured wins:
- Lever 1 (MXFP4 / Blackwell FP4 path): decode +50-66% over Q8, prefill
  plateau +66% (2200->3650). MXFP4 decode beats vLLM FP8 at B=1 (83 vs 48),
  near-parity B=8. Prefill still plateaus (fused-MoE-GEMM gap).
- Lever 2 (ubatch): saturates at 2048; ceiling is the kernel, not batch.

Designed (not built): Lever 3 fused FP4/FP8 MoE grouped GEMM, Lever 4 FP8
GEMM (needs ggml_mul_mat_ext scale plumbing), Lever 5 tcgen05 kernels, and
the complete paged attention (on-demand alloc + gather-read + continuous
batching + prefix sharing). Honest scope: each is multi-week kernel/systems
work.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 20:15:14 +00:00
Ettore Di Giacinto
bbc84a9889 feat(paged): Gate 0 in-model - token-identical generation with paged KV placement
Wire paged, non-contiguous fixed-size BLOCK placement into the real
llama.cpp KV cache (find_slot), behind env LLAMA_KV_PAGED, and validate
Gate 0 on a real GGUF: Qwen3-0.6B greedy generation is TOKEN-IDENTICAL to
the contiguous cache while its KV is physically scattered across permuted
blocks (cells 0-15, 144-159, 32-47, ...). Proven non-contiguous via
LLAMA_KV_PAGED_DEBUG, not a silent fallback.

This retires the correctness premise of paged attention IN THE MODEL (not
just at the ggml-op level): attention is invariant to physical KV placement,
because reads use per-cell pos/seq metadata for masking. The patch lives at
patches/0001-paged-kv-block-placement.patch (against llama.cpp 0253fb21f).

Scope: storage/placement layer, single sequence. Remaining (P4): the
gather-read compute path (attend only a seq's own blocks) for the throughput
win, and the multi-sequence driver. README updated with repro + status.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 08:51:42 +00:00
Ettore Di Giacinto
3ed3279739 docs(paged): status + integration map for in-model Gate 0
Capture verified state (P0 manager parity, P1 ggml write/gather, P2 attention
numerics 7.5e-08, P3 capacity 9.2x + prefix-sharing 11.3x) and the exact
remaining work: wire build_attn_paged into llama-graph.cpp and validate
token-identical generation on Qwen3-0.6B (Gate 0), then win-2 throughput.

Records the integration seams (create_memory, find_slot, get_k/get_v,
build_attn, mask) and the honest caveats (unified cache already shares a
pool; vLLM's classic kernel is deprecated) so the next session starts warm.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 08:45:51 +00:00
Ettore Di Giacinto
ddace5fb6a feat(paged): paged-bench - measure capacity & prefix-sharing wins
Quantify the two multi-tenant wins that are properties of the host-side
block model (vLLM-parity), independent of the in-model compute path:

  WIN 1 concurrency capacity @ 512-block budget
    contiguous (reserve n_ctx/seq): 4 sequences
    paged (on-demand blocks):       37 sequences
    --> 9.2x more concurrent sequences

  WIN 3 cross-tenant prefix sharing (32 tenants, 1024-tok shared prefix)
    prefix-cache OFF: 2176 physical blocks
    prefix-cache ON:  192 physical blocks
    --> 11.3x less KV memory

WIN 2 (throughput) is deliberately reported as PENDING: it requires the
paged gather-read path wired into llama-graph.cpp (Gate 0) and is not
measurable at the allocation layer. The win-1 baseline is per-sequence
n_ctx reservation (stream mode); llama.cpp's unified cache already shares
one pool, so the honest win there is on-demand sizing + prefix dedup.

Phase 3 (partial) of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 08:44:41 +00:00
Ettore Di Giacinto
5a5d3df8c8 feat(paged): Phase 2 core - attention over paged KV matches reference
Retire the central numeric risk from the design: feeding gather-to-scratch
KV (a sequence whose blocks are non-contiguous in the shared pool, [2,1,5])
into ggml's standard attention ops produces correct attention.

Path under test: set_rows write -> get_rows gather (K and V) ->
mul_mat(K,Q) -> soft_max_ext -> mul_mat(V^T, probs). Result is compared
against an independent host-computed softmax attention over the same K/V/Q.
Max abs error ~7.5e-08 (n_kv=48, d=8, n_q=4).

This proves the paged read path is numerically sound on CPU with no new
ggml op. Remaining: wire build_attn_paged into llama-graph.cpp and validate
Gate 0 (token-identical greedy generation in a real model).

Phase 2 (core) of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 08:35:35 +00:00
Ettore Di Giacinto
c6698dd4bf feat(paged): Phase 1 - ggml paged write/gather mechanism (CPU)
Validate the paged KV read/write path at the ggml-op level, driven by
PagedKVManager:

- write: ggml_set_rows(pool, k_src, slot_mapping)  scatter K rows by slot
- read:  ggml_get_rows(pool, gather_idx)           gather a seq's slots into
         contiguous scratch (the tensor an attention kernel consumes)

The test forces a non-contiguous, out-of-order physical block layout
(allocate seqA+seqB, free seqA, reallocate seqC -> blocks [2,1,5]) and
proves gather(write(x)) == x plus cross-sequence isolation in the shared
pool. This de-risks the central question (does slot-addressed paged storage
round-trip correctly through ggml) before the llama-graph integration.

Pool is statically allocated via ggml_backend_alloc_ctx_tensors, mirroring
how llama.cpp allocates its KV cache. CPU backend, no new ggml op.
Built against ggml from the vendored llama.cpp checkout.

Phase 1 of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 08:33:26 +00:00
Ettore Di Giacinto
edb1a11abc feat(paged): vLLM-parity KV block manager (Phase 0, CPU-first prototype)
Host-side paged-attention block manager ported faithfully from vLLM V1
(block_pool.py, kv_cache_utils.py, single_type_kv_cache_manager.py):

- KVCacheBlock + intrusive LRU FreeBlockQueue (O(1) middle removal)
- BlockPool: get_new_blocks / touch / free_blocks eviction ordering /
  cache_full_blocks / lazy eviction on reuse
- PagedKVManager: on-demand allocate, block_table, slot arithmetic
  (slot = block_id*block_size + offset), free
- Prefix caching: chained block hashing + find_longest_cache_hit
  (first-miss stop), enabling automatic cross-tenant prefix sharing

Pure C++17, zero ggml/llama.cpp dependency, unit-tested to vLLM behavioral
parity (4/4 suites green). Parity is on algorithm/behavior, not hash bytes.

Phase 0 of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md.
Phases 1-5 (ggml storage, gather-to-scratch read path, Gate 0 correctness,
benchmark wins, prefix-share serving) follow.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 08:26:31 +00:00
LocalAI [bot]
4ad754eea3 chore: ⬆️ Update ikawrakow/ik_llama.cpp to b3dfb7858cfcb9166e92f366e5af87f19ebc94be (#10395)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-19 00:03:37 +02:00
Tai An
c3b3336654 fix(whisperx): use whisperx.diarize.DiarizationPipeline with token kwarg (#10389)
Signed-off-by: Anai-Guo <antai12232931@outlook.com>
2026-06-18 18:50:37 +02:00
Richard Palethorpe
3fa7b2955c feat(pii): NER tier engine — privacy-filter.cpp backend + NER-centric PII filter (#10360)
Squashed feat/pii-ner-tier-engine rebased onto master (was 45 commits; see
backup/pii-ner-tier-engine-prerebase). Net change:

- privacy-filter.cpp: standalone GGML engine for the openai-privacy-filter
  PII/NER token classifier, wired as a LocalAI gRPC backend (CPU/CUDA/Vulkan).
  TokenClassify moves off the patched llama.cpp path onto this backend.
- PII filter reworked to be NER-centric (encoder/NER detection tier scanning
  whole conversations as one document), with a recreated bounded restricted-
  regex secret-matching pattern detector tier alongside it (per-model
  pii_detection.builtins / .patterns + core/services/routing/piipattern).
- Detection labelled by source (ner vs pattern); backend trace / confidence /
  debug observability; analyze/redact exposed as a synchronous API.
- Instance-wide default detector policy + per-usecase default-on; request
  filtering extended to completions, embeddings, edits & Ollama.
- React UI: NER-centric PII editor, detector-models table, pattern/builtins
  editor, middleware default-policy UI.
- Gallery: privacy-filter-multilingual token-classify model + NER install
  filter; token_classify known_usecase; batch sized to context for NER models.
  privacy-filter backend registered in the backend gallery (cpu/vulkan/cuda-13
  meta + image entries with a capabilities map) matching its CI matrix jobs,
  and an /import-model auto-detect importer (PrivacyFilterImporter, narrow
  privacy-filter GGUF detection) replacing the prior pref-only registration.

Reconciled against master's independent evolution:

- Dropped master's PIIPatternOverrides feature (global-pattern runtime
  overrides + /api/pii/patterns API + runtime_settings.json persistence). The
  per-model NER + pattern-detector design supersedes it; it was built on the
  global redactor pattern set this branch replaced.
- Reverted the llama.cpp Score carry-patch (0006-server-task-type-score):
  removed the patch and restored master's grpc-server.cpp Score RPC (direct
  llama_decode, slot-loop bypass) and LLAMA_VERSION pin, plus master's
  model_config validation forbidding score + chat/completion/embeddings on
  llama-cpp. token_classify is unaffected (it runs on the privacy-filter
  backend, not llama-cpp).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-06-18 11:45:22 +01:00
LocalAI [bot]
c133ca39dc chore: ⬆️ Update ggml-org/llama.cpp to f3e182816421c648188b5eab269853bf1531d950 (#10379)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-18 11:43:23 +02:00
LocalAI [bot]
91f97f2a54 chore: ⬆️ Update ggml-org/whisper.cpp to 86c40c3bd6fc86f1187fb751d111b49e0fc18e84 (#10382)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-18 08:34:43 +02:00
LocalAI [bot]
55f9ff6805 chore: ⬆️ Update mudler/parakeet.cpp to 92a5f0306be354c109150fe58ae4cc4f8a21ca45 (#10380)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-18 08:32:13 +02:00
LocalAI [bot]
5c2ae7857a chore: ⬆️ Update antirez/ds4 to 80ebbc396aee40eedc1d829222f3362d10fa4c6c (#10378)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-18 00:32:13 +02:00
LocalAI [bot]
4af360300f chore: ⬆️ Update ikawrakow/ik_llama.cpp to 71af16a6b7f6fb7315b346b4a51aad530599c3f5 (#10381)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-18 00:12:25 +02:00
LocalAI [bot]
95e7149c87 chore: ⬆️ Update ggml-org/llama.cpp to 74ade52741203e5c8f81eaf06a96cb1cfe15f2a3 (#10368)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-17 13:25:29 +02:00
LocalAI [bot]
fd26c8c753 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 064d23a6f816d50491d8c9b35a0cafe546eaf4b5 (#10367)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-17 13:25:14 +02:00
LocalAI [bot]
e60c094a7d feat(ds4): SSD streaming + quality engine options, 128GB DeepSeek gallery models (#10374)
feat(ds4): wire SSD streaming + quality engine options, add 128GB DeepSeek gallery models

The ds4 backend zero-initialized ds4_engine_options and exposed none of the
engine's tunable knobs, so SSD streaming (run a model larger than RAM by
streaming routed MoE experts from the GGUF on SSD) and the quality/perf knobs
were unreachable from LocalAI model YAMLs.

Map ModelOptions.Options onto ds4_engine_options through a declarative table
(kEngineOptSpecs + apply_engine_option) instead of per-field branches: the
struct is fixed C with no reflection, so the field set is enumerated once and a
future knob is a one-line table row. Two fields use ds4's own typed parsers
(GiB budgets, cache-experts count-or-NGB). Bare flags (e.g. "ssd_streaming")
mean true; path-type options (mtp_path, expert_profile_path,
directional_steering_file) resolve relative to the model directory so a gallery
entry can reference a companion file by bare filename. mtp_draft/mtp_margin are
now validated rather than parsed with throwing std::stoi/std::stof.

Add gallery entries for the 128 GB class:
- deepseek-v4-flash-q2-q4 (~91 GB, mixed q2/q4, fits RAM, higher quality)
- deepseek-v4-flash-q4-ssd (~153 GB full 4-bit, runs on 128 GB via SSD streaming)
- deepseek-v4-flash-q2-mtp (~81 GB + MTP speculative draft weights)
- deepseek-v4-pro-q2-ssd (~433 GB Pro, experimental SSD streaming)

SSD streaming is Metal (Darwin) only; the options are inert on CUDA/CPU.


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-17 10:30:06 +02:00
LocalAI [bot]
980ec4a311 chore: ⬆️ Update antirez/ds4 to cafc134f78a5a1890d98808d3102f4313573a1bc (#10369)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-17 09:28:19 +02:00
LocalAI [bot]
dfd5a00e6f chore: ⬆️ Update ggml-org/whisper.cpp to 9efddafb9153e1fb22bdc3dd3057072c99165ed2 (#10366)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-17 09:27:52 +02:00
LocalAI [bot]
63be479066 chore: ⬆️ Update leejet/stable-diffusion.cpp to 7f0e728b7d42f2490dfa5dd9539082d904f2f6b2 (#10370)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-17 09:08:34 +02:00
LocalAI [bot]
4c6750fe6b feat(depth): metric-large + nested metric model gallery entries (#10363)
* feat(depth): add depth-anything-3-metric-large gallery entry

DA3METRIC-LARGE (ViT-L) single-file metric-scale depth + sky, served by the
existing depth-anything backend (same single-GGUF path as mono-large). GGUF
published at mudler/depth-anything.cpp-gguf.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(depth): serve nested metric model (two-file load)

The DA3 nested model needs both branches (anyview GIANT + metric ViT-L) loaded
together. Wire it through the backend:
- Load reads a 'metric_model:<file>' entry from ModelOptions.Options and, when
  present, calls da_capi_load_nested(anyview, metric) instead of da_capi_load
  (registers the new abi-4 symbol; helper optionValue + unit test).
- gallery: depth-anything-3-nested (model=anyview, options=metric branch, both
  GGUFs fetched) for metric-scale depth + pose.
- bump depth-anything.cpp pin to cce5edc (abi 4 / da_capi_load_nested).

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-16 22:03:58 +02:00
LocalAI [bot]
294170d3ed feat(backend): add depth-anything (Depth Anything 3) C++/ggml backend + gallery (#10352)
* feat(backend): add depth-anything (Depth Anything 3) C++/ggml backend + gallery

Mirrors the locate-anything-cpp backend to register a new depth-anything
backend that wraps the Depth Anything 3 ggml port (depth-anything.cpp) via
purego (cgo-less, no Python at inference).

- backend/go/depth-anything-cpp/: gRPC backend (Load + Predict + GenerateImage),
  purego binding to the da_capi_* C ABI, CMake/Makefile/run/package/test scripts
  building depth-anything.cpp's DA_SHARED static .so per CPU variant.
- backend/index.yaml: depth-anything backend meta + all hardware-variant
  capability entries (cpu/cuda12/cuda13/intel-sycl-f32+f16/vulkan/nvidia-l4t).
- gallery/index.yaml: 8 Depth Anything 3 GGUF models (base q4_k/q8_0/f16/f32,
  small, large, giant, mono-large).
- .github/backend-matrix.yml: one build entry per hardware variant.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(depth): typed Depth RPC + REST endpoint exposing full DA3 data

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(depth): pin depth-anything.cpp to e0b6814 (ABI 3 dense C-API)

The Depth RPC handler calls da_capi_depth_dense / da_capi_points (C-API ABI 3);
pin the native build to the commit that exports them.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(depth): pin depth-anything.cpp to v0.1.0 release (b515c31)

Repoint the native version from the now-orphaned e0b6814 to the
b515c31 release commit, kept alive by the upstream v0.1.0 tag.
C-API is unchanged (da_capi_abi_version == 3).

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(depth): wire depth-anything-cpp into build, CI bump, and importer

The backend dir, gallery index, and CI build-matrix were present but the
backend was never wired into the integration points that adding-backends.md
requires:

- root Makefile: add to .NOTPARALLEL, the test-extra chain, a BACKEND_*
  definition, the docker-build target eval, and docker-build-backends
  (mirrors parakeet-cpp; the backend's own Makefile already documented that
  its `test` target is driven by test-extra).
- bump_deps.yaml: register the DEPTHANYTHING_VERSION pin so the daily
  auto-bump bot tracks mudler/depth-anything.cpp master (it cannot see an
  unregistered Makefile pin).
- import form: add a preference-only KnownBackend entry so depth-anything is
  selectable at /import-model (mirrors sam3-cpp; no reliable GGUF auto-detect
  signal, so pref-only per the doc's default).

changed-backends.js needs no entry: the generic golang suffix branch already
resolves backend/go/depth-anything-cpp/.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(depth): auto-detect importer for depth-anything GGUFs

Replace the preference-only entry with a real auto-detect importer
(mirrors parakeet-cpp / locate-anything):

- DepthAnythingImporter matches a .gguf whose name carries a
  depth-anything token (depth-anything-<size>-<quant>.gguf), so
  /import-model recognises mudler/depth-anything.cpp-gguf repos and direct
  GGUF URLs without an explicit backend preference. preferences.backend=
  "depth-anything" still forces it.
- Registered before LlamaCPPImporter so its GGUF bundles aren't claimed by
  the generic .gguf importer; the narrow name match means it cannot claim
  arbitrary llama GGUFs or the upstream safetensors PyTorch repos.
- Multi-quant repos pick the smallest quant by default (q4_k -> ... -> f32,
  depth stays >0.998 corr even at q4_k); quantizations preference overrides.
- Drops the now-redundant knownPrefOnlyBackends entry (importer-backed
  backends are not listed there, matching parakeet-cpp).
- Table-driven Ginkgo test covers detection, negative cases (llama GGUF,
  upstream safetensors), default/override/fallback quant pick, and direct
  URL import. 10/10 specs pass.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(depth): check conn.Close error in grpc Depth client (errcheck)

The new Depth() client method used a bare `defer conn.Close()`. golangci-lint
runs with new-from-merge-base, so although the 39 sibling methods use the same
bare form (grandfathered), the newly added line trips errcheck. Drop the result
explicitly to satisfy the linter.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8

* fix(depth): bump depth-anything.cpp to v0.1.1 (embeddable CMake)

v0.1.0 (b515c31) used ${CMAKE_SOURCE_DIR} for its include dirs, which
points at the parent project when built via add_subdirectory() as this
backend does, so the container build failed with missing stb_image.h /
da_gguf_keys.h. v0.1.1 (2d42897) switches to project-relative paths.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8

* fix(depth): resolve gosec findings in the backend wrapper

The code-scanning gate flagged three new failure-level alerts in
godepthanythingcpp.go (gosec runs with -no-fail; GitHub gates on new alerts):

- G301: export dirs were created with 0o755. Tighten to 0o750 (no world
  access needed for backend-written export output).
- G304: writeDepthPNG creates req.GetDst(). That path is chosen by the
  LocalAI core as the intended output destination (same pattern every
  image backend uses), not attacker input, so annotate with #nosec G304
  and document why.

The remaining G103 "audit unsafe" notes on the unsafe.Slice C-buffer copies
are warning-level (the same purego interop whisper/parakeet use) and do not
gate the check, per the supertonic exclusion precedent in secscan.yaml.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8

* fix(depth): bump depth-anything.cpp to v0.1.2 (CUDA cross-build arch)

v0.1.1 forced CMAKE_CUDA_ARCHITECTURES=native, which breaks the GPU-less
l4t/cublas CI builds (nvcc "Unsupported gpu architecture 'compute_'" on
CMake 3.22). v0.1.2 (442eea4) drops the override and lets ggml pick its
default cross-build arch list.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-16 16:28:28 +02:00
LocalAI [bot]
1ab61a0875 feat: generic chat_template_kwargs (model config + per-request metadata) (#10359)
* feat(config): add chat_template_kwargs model field + resolver

Adds the ChatTemplateKwargs model-config map and RequestMetadata carrier,
plus ResolveChatTemplateKwargs which layers the config map under coerced
request metadata. Foundation for generic jinja chat-template kwargs (issue #10329).

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend): forward resolved chat_template_kwargs blob to backends

gRPCPredictOpts now merges per-request client metadata over the server-derived
enable_thinking/reasoning_effort (reaching all backends via the standalone keys)
and serialises the resolved chat_template_kwargs map into a JSON blob for
llama.cpp, written last so a client cannot clobber it. Issue #10329.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(http): wire request metadata to config.RequestMetadata

The OpenAI request metadata field was parsed but unused; stamp it onto the
per-request ModelConfig so gRPCPredictOpts forwards it as chat_template_kwargs
overrides. Issue #10329.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama-cpp): generic chat_template_kwargs merge (drop per-key blocks)

Replace the per-key enable_thinking/reasoning_effort handling in both the
streaming and non-streaming chat paths with a single block that parses the
chat_template_kwargs JSON blob resolved by the Go layer and merges every key
into body_json. New jinja template levers (e.g. preserve_thinking) now need
no C++ change. Issue #10329.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: document custom chat_template_kwargs (model + per-request)

Issue #10329.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(backend): pin reasoning_effort as a string in the chat_template_kwargs blob

Issue #10329.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(http): e2e guard pinning chat_template_kwargs forwarded to gRPC

Adds an ECHO_PREDICT_METADATA marker to the mock-backend that echoes the
received PredictOptions.Metadata, and an app_test.go spec that drives a real
/v1/chat/completions request (model chat_template_kwargs + per-request metadata
override) and asserts the exact metadata + chat_template_kwargs blob the REST
layer forwards to gRPC. Locks the REST->gRPC contract against regressions. Issue #10329.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(config): grandfather chat_template_kwargs in registry coverage

chat_template_kwargs is a free-form map[string]any (like engine_args, already
on the list), not a scalar the config UI registry can surface, so it is exempt
from the registry-entry requirement. Fixes the TestAllFieldsHaveRegistryEntries
failure introduced by the new field. Issue #10329.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-16 12:16:34 +02:00
LocalAI [bot]
f44034021e chore: ⬆️ Update leejet/stable-diffusion.cpp to 5a34bc7f6e0621dd2f899daa64476eac667d7ed3 (#10335)
* ⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix(stablediffusion-ggml): adapt gosd.cpp to upstream sd_ctx_params_t API

The bump to 5a34bc7 restructured sd_ctx_params_t: the boolean CPU-offload
knobs (offload_params_to_cpu, keep_clip_on_cpu, keep_vae_on_cpu,
keep_control_net_on_cpu) were replaced by backend assignment specs
(backend/params_backend), and vae_decode_only / free_params_immediately
were dropped entirely. The build broke with "no member named ..." on
every arch.

Translate the legacy options we still accept from gallery configs into
the new backend assignment specs, mirroring prepare_backend_assignments()
in the upstream CLI, so offload_params_to_cpu / keep_*_on_cpu keep
working. vae_decode_only is parsed and ignored for config compatibility.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(stablediffusion-ggml): expose backend/params placement options

The upstream bump introduced new sd_ctx_params_t fields for device and
memory placement (backend, params_backend, rpc_servers, max_vram,
stream_layers) plus PuLID-Flux weights (pulid_weights_path). Wire them up
as backend options so models can be split across CPU/GPU/disk/RPC:

- backend: per-component compute placement (e.g. clip=cpu,vae=cuda0)
- params_backend: per-component weight storage incl. disk mmap
- max_vram / stream_layers: graph-cut segmented parameter offload budget
- rpc_servers: offload compute to remote RPC servers
- pulid_weights_path: PuLID-Flux identity injection

The legacy keep_*_on_cpu / offload_params_to_cpu booleans now seed and
compose with the explicit backend/params_backend specs, matching upstream
prepare_backend_assignments(). Option values are taken as everything after
the first ':' so colon-bearing values (rpc_servers host:port) survive
parsing. Documented the new options in the image-generation guide.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(stablediffusion-ggml): distributed RPC across ggml workers

Enable the ggml RPC backend (-DSD_RPC=ON) so image generation can be
sharded across remote rpc-server workers. The ggml rpc-server is
backend-agnostic, so this reuses the exact same worker pool as the
llama.cpp backend - one set of `local-ai worker llama-cpp-rpc` /
`p2p-llama-cpp-rpc` workers accelerates both text and image generation.

RPC servers are selected by precedence:
- the explicit `rpc_servers` option, else
- the LLAMACPP_GRPC_SERVERS env var, which LocalAI's p2p worker mode
  populates automatically with discovered workers (the backend inherits
  it from the parent process env), so distributed image generation needs
  no per-model configuration.

Documented manual and p2p setup in the image-generation guide.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-16 12:15:45 +02:00
LocalAI [bot]
6b9f1bd4b3 chore: ⬆️ Update antirez/ds4 to e34a8086693ba7ca5cfabd2b9028ee52f0bfac2e (#10350)
* ⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix(ds4): add Homebrew include/lib prefix for Darwin grpc-proto build

The darwin/metal ds4 backend job runs for the first time on this bump
(it was skipped on prior ds4 PRs) and fails compiling backend.pb.cc with
'google/protobuf/runtime_version.h' file not found.

hw_grpc_proto links neither protobuf::libprotobuf nor gRPC::grpc++, so
the generated proto sources rely on default system include paths. That
works on Linux (/usr/include) but not on macOS, where Homebrew installs
under /opt/homebrew. Add the Homebrew prefix to include/link dirs on
Darwin, mirroring the llama-cpp backend that already builds on Darwin CI.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(ds4): install nlohmann-json on Darwin CI for ds4 backend

After the protobuf include-path fix the ds4 darwin build advances to
compiling dsml_renderer.cpp, which includes <nlohmann/json.hpp> and
#errors when absent. On Linux the header comes from apt nlohmann-json3-dev
in the build image; the macOS runner had no equivalent. Add the
header-only nlohmann-json formula to the shared Darwin backend brew
install/link list and Homebrew cache, alongside the existing deps.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(ds4): build proper OCI image tar for Darwin backend

The darwin packaging referenced scripts/build/oci-pack.sh, which was
never added to the tree, so it fell back to a plain 'tar' that omits
manifest.json. 'local-ai backends install' then rejects the tarball
with 'file manifest.json not found in tar'.

Use './local-ai util create-oci-image' (already built by the 'build'
prerequisite of the backends/ds4-darwin target), mirroring
llama-cpp-darwin.sh, to emit a real OCI image the installer accepts.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-16 09:59:50 +02:00
LocalAI [bot]
3d295adfa8 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 2f524850a1f67716bc0ba80ffa30ce39c5b8bd5f (#10336)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-06-16 09:04:35 +02:00