LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 16:19:07 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	f5e9caece1	kernel: reframed Blackwell kernel-gap map (research + profiles) Key corrections: (1) vLLM 24k is AGGREGATE; single-stream roofline ~3300 t/s (BF16) / 6600 (FP4). (2) GB10 is 1:1:2 BF16:INT8:FP4 - INT8 == BF16, only FP4 is 2x. (3) Measured: dense int8-MMQ at 21% of ceiling, MoE FP4-MMQ at ~5% - both EXIST, just untuned for Blackwell. Strategy: to MATCH vLLM, tune MMQ or build a Marlin-style W4A16 BF16 GEMM (FP4 NOT required); to BEAT, fix the existing FP4 MMA on sm_121 (build/miscompile, not greenfield). Dropped the tcgen05 grouped GEMM rewrite. Cheap next test: dense MXFP4 quant + existing FP4-MMA. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 07:21:56 +00:00
Ettore Di Giacinto	d2651c86d9	bench(dense): root-cause the W4A4 NVFP4 hang; W4A16 vs Q4 is the headline Researched: W4A4 hangs on GB10 because FlashInfer ships no FP4 cubins for sm_120/121 (all datacenter Sm100a); dense mm_fp4 is gated-off/returns-zeros on consumer Blackwell, and the FlashInfer FP4 autotuner spins on the first forward pass. Not a misconfig - dense W4A4 inference isn't validated on sm_121. W4A16 (4-bit weight / 16-bit act, Marlin) vs llama Q4_K_M is the correct apples-to- apples (same quant class) AND the fast path. Removed the misleading 'W4A4 would be faster / lower bound' framing. Sources: vllm #30163/#26381, flashinfer #2577/#3294, cutlass #3096. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 06:59:50 +00:00
Ettore Di Giacinto	19742aee64	bench(dense): FORCE_CUBLAS no-op for dense too (720.8 vs 721.8) - every flag lever exhausted Confirms parity (dense+MoE, both phases) is strictly the FP4 tensor-core kernel; no config/flag shortcut remains. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 03:59:27 +00:00
Ettore Di Giacinto	ce60737fc5	kernel(doc): dense scope resolved - two FP4 kernels (dense first, then grouped) Benchmark confirms dense prefill 7.6-32x behind too, so the kernel track needs a non-grouped FP4 dense GEMM (simpler, land first) + the MoE grouped GEMM. Both share the e2m1 block-scaled collective; dense is grouped-with-one-group. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 03:56:33 +00:00
Ettore Di Giacinto	37cbc089b0	bench(dense): Qwen3-32B dense parity - dense has the kernel gap too (PP 7.6-32x) vLLM W4A16 vs llama Q4_K_M dense: prefill 7.6-32x behind (llama plateaus ~765, vLLM scales to 24.4k); decode ~parity at B=1 (weight-bandwidth-bound), 2.2x at B=64. Full NVFP4 (W4A4) hangs on this vLLM/GB10 stack - W4A16 used. Decision: the Lever-3 kernel track must ALSO deliver a non-grouped FP4 dense GEMM, not just the MoE grouped GEMM (dense GEMM is the simpler first kernel to land). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 03:55:58 +00:00
Ettore Di Giacinto	b7b2e8291c	kernel(fp4-grouped-moe): scaffold the FP4 grouped-GEMM MoE dispatch (Lever 3) The only work that closes the vLLM gap on Blackwell: mul_mat_q<MXFP4> is 37% prefill + 54.6% decode-B64 GPU time; paged attention can't touch it (proven). Scaffold (builds clean on GB10, default byte-identical): fp4-grouped-moe.{cuh,cu} entry + gated hook in ggml_cuda_mul_mat_id (env GGML_CUDA_FP4_GROUPED), always falls back to MMQ for now. Design doc has the CUTLASS/tcgen05 implementation phases + parity harness + the dense-path follow-up (#28). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:44:31 +00:00
Ettore Di Giacinto	cb28deda6b	bench(paged): decode profile overturns 'engine-addressable' - decode is 54.6% MoE GEMM too Decode-dominated B=64 nsys: mul_mat_q<MXFP4> 54.6%, attention only 19.8%. Both phases are FP4-MoE-kernel-bound (Lever 3). The paged series cannot close the vLLM gap in either phase; its real value is capacity + prefix-sharing, not tok/s parity. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:27:35 +00:00
Ettore Di Giacinto	2a500c371f	bench(paged): fresh GB10 head-to-head vs vLLM - two distinct gaps Prefill 6-48x behind and does NOT scale with B (kernel-bound, paging can't fix). Decode: we win at B=1; 2.5-3.7x behind at B>=8 - THAT concurrency gap is the engine's domain (0004 pool + 0005 continuous batching target it). Baseline for the series to improve on. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:20:22 +00:00
Ettore Di Giacinto	48fbb9384f	docs(paged): refine 0003 plan - used-cell gather, per-ubatch rebuild, single-stream first Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:14:25 +00:00
Ettore Di Giacinto	145e45b6f2	docs(paged): exact executable plan for 0003 gather-read Every edit mapped (gather-index graph input mirroring k_idxs; gather K/V/mask by one aligned index; n_kv compaction; gated so stock stays byte-identical) with the token-identical gate and the known risks (mask transpose layout, v_trans). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:12:18 +00:00
Ettore Di Giacinto	c4b4f3a3e4	docs(paged): series status 0001/0002 done+verified; honest parity note Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:05:14 +00:00
Ettore Di Giacinto	61ff738177	patch(paged) 0002: LLAMA_KV_PAGED block placement, Gate 0 token-identical find_slot places a sequence's tokens at permuted non-contiguous blocks; greedy generation is token-identical to stock (verified on Qwen3-0.6B at the pin), branch confirmed firing. Default off. The placement substrate for the gather-read. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:04:28 +00:00
Ettore Di Giacinto	ce48cc0751	patch(paged) 0001: vendor PagedKVManager into llama.cpp src First patch of the stacking series. Adds src/paged-kv-manager.{h,cpp} (the CPU-verified vLLM-parity block manager) + CMake entry. No behavior change. Generated against the pinned LLAMA_VERSION; applies clean. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 22:55:22 +00:00
Ettore Di Giacinto	ba3fa5a633	build(paged): stacking patch-series scaffolding for llama.cpp paged attention Numbered patches under backend/cpp/llama-cpp/patches/ applied in order against the pinned LLAMA_VERSION (build hook in the llama.cpp: target). Each phase is one small, independently-buildable patch so the work rebases cleanly across llama.cpp bumps (anti-drift). README defines the series (0001 vendor manager -> 0006 prefix caching) + the regen workflow. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 22:53:20 +00:00
Ettore Di Giacinto	62f0ae17e3	docs(paged): upstream survey - no FP4 MoE GEMM to patch in; phase 3 is from-scratch No tcgen05/CUTLASS grouped-GEMM MoE kernel exists upstream (merged/in-flight/ draft); CUTLASS not a dep; no fork has one; activation-quant gather already fused. Matching vLLM needs a from-scratch tcgen05 grouped GEMM (months, maintainers deferring to cuTile). No tractable patch closes the 27x. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 22:44:11 +00:00
Ettore Di Giacinto	b14214620c	docs(paged): Lever-3 phase-1 nwarps tweak = dead end (constants coupled) static_assert(nwarps*tile_C::I == mmq_y) locks nwarps=8 for mmq_y=128; can't raise occupancy without co-scaling mmq_y (blows Blackwell smem). MMQ kernel is not freely tunable -> parity needs the tcgen05/CUTLASS rewrite, not knobs. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 22:32:02 +00:00
Ettore Di Giacinto	1449b806ab	docs(paged): Lever-3 + paged-attention implementation plans + upstream ggml issue draft Plan A (Lever 3): phased path to FP4 MoE GEMM parity — cheap tweaks, act-quant fusion, then the real lever (tcgen05/CUTLASS grouped GEMM), full-model FP4. Plan B (paged attention): on-demand pool, gather-read + Gate 0, continuous batching, prefix sharing; benchmark in memory-pressured/mixed-length regimes. Upstream issue draft: GB10 numbers, nsys profile, ruled-out config knobs, tcgen05 proposal. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 22:28:28 +00:00
Ettore Di Giacinto	9f16a907be	docs(paged): Lever 3 profiled + Q4/MXFP4 findings, auto-ubatch shipped Prefill doesn't scale with bigger single prompts (attention O(N^2)); real gap is batched MoE prefill (B=32: 27x vs vLLM, ~22 effective TFLOP/s). nsys pins Lever 3 target: mul_mat_q<MXFP4> MoE GEMM 37% + un-fused act-quant 8%; native FP4 MMA already engaged, inefficiency is the per-expert thin-tile scheduler. Q4_K_M matches MXFP4 on decode (decode win is generic 4-bit); MXFP4's only edge is prefill. Auto-ubatch=2048 on Blackwell shipped (PR #10411). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 20:56:46 +00:00
Ettore Di Giacinto	aba0bfd24f	feat(backend): auto-default physical batch to 2048 on Blackwell GPUs On NVIDIA Blackwell consumer GPUs (sm_120/121, incl. GB10/DGX Spark) a larger physical batch (n_ubatch) materially lifts MoE prefill throughput - measured on a GB10 with Qwen3-30B-A3B to lift the prefill ceiling and saturate at ~2048. When a model config leaves `batch:` unset, EffectiveBatchSize now picks 2048 on Blackwell instead of 512; explicit `batch:` always overrides. Detection is a shared, cached Go helper (xsysinfo.IsNVIDIABlackwell, nvidia-smi compute_cap >= 12). Logic is isolated in core/backend/hardware_defaults.go and applied at the common ModelOptions builder, so it covers the C++ llama.cpp backend too. Measured (GB10, Qwen3-Coder-30B-A3B MXFP4): prefill ub512 2994 -> ub2048 3316 t/s; saturates past 2048. Also recorded in the DGX gap plan: 4-bit quant alone captures the decode win (Q4_K_M 93.5 >= MXFP4 86.4 t/s), MXFP4's only edge is prefill via Blackwell FP4 tensor cores. Tests: hardware_defaults_internal_test.go; existing NBatch specs pinned to the no-Blackwell branch for determinism. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 20:46:45 +00:00
Ettore Di Giacinto	7aa61d4c32	docs(paged): DGX Blackwell gap analysis + lever plan (living doc) Captures the full dgx.casa investigation: Q8/F16/vLLM baselines, concurrency sweeps, paged-patch (no concurrency effect), nsys+code root-cause (MoE int8 MMQ on Ampere-class tensor cores = 74.5% compute, no FP8 path), and the lever plan. Measured wins: - Lever 1 (MXFP4 / Blackwell FP4 path): decode +50-66% over Q8, prefill plateau +66% (2200->3650). MXFP4 decode beats vLLM FP8 at B=1 (83 vs 48), near-parity B=8. Prefill still plateaus (fused-MoE-GEMM gap). - Lever 2 (ubatch): saturates at 2048; ceiling is the kernel, not batch. Designed (not built): Lever 3 fused FP4/FP8 MoE grouped GEMM, Lever 4 FP8 GEMM (needs ggml_mul_mat_ext scale plumbing), Lever 5 tcgen05 kernels, and the complete paged attention (on-demand alloc + gather-read + continuous batching + prefix sharing). Honest scope: each is multi-week kernel/systems work. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 20:15:14 +00:00
Ettore Di Giacinto	bbc84a9889	feat(paged): Gate 0 in-model - token-identical generation with paged KV placement Wire paged, non-contiguous fixed-size BLOCK placement into the real llama.cpp KV cache (find_slot), behind env LLAMA_KV_PAGED, and validate Gate 0 on a real GGUF: Qwen3-0.6B greedy generation is TOKEN-IDENTICAL to the contiguous cache while its KV is physically scattered across permuted blocks (cells 0-15, 144-159, 32-47, ...). Proven non-contiguous via LLAMA_KV_PAGED_DEBUG, not a silent fallback. This retires the correctness premise of paged attention IN THE MODEL (not just at the ggml-op level): attention is invariant to physical KV placement, because reads use per-cell pos/seq metadata for masking. The patch lives at patches/0001-paged-kv-block-placement.patch (against llama.cpp 0253fb21f). Scope: storage/placement layer, single sequence. Remaining (P4): the gather-read compute path (attend only a seq's own blocks) for the throughput win, and the multi-sequence driver. README updated with repro + status. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 08:51:42 +00:00
Ettore Di Giacinto	3ed3279739	docs(paged): status + integration map for in-model Gate 0 Capture verified state (P0 manager parity, P1 ggml write/gather, P2 attention numerics 7.5e-08, P3 capacity 9.2x + prefix-sharing 11.3x) and the exact remaining work: wire build_attn_paged into llama-graph.cpp and validate token-identical generation on Qwen3-0.6B (Gate 0), then win-2 throughput. Records the integration seams (create_memory, find_slot, get_k/get_v, build_attn, mask) and the honest caveats (unified cache already shares a pool; vLLM's classic kernel is deprecated) so the next session starts warm. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 08:45:51 +00:00
Ettore Di Giacinto	ddace5fb6a	feat(paged): paged-bench - measure capacity & prefix-sharing wins Quantify the two multi-tenant wins that are properties of the host-side block model (vLLM-parity), independent of the in-model compute path: WIN 1 concurrency capacity @ 512-block budget contiguous (reserve n_ctx/seq): 4 sequences paged (on-demand blocks): 37 sequences --> 9.2x more concurrent sequences WIN 3 cross-tenant prefix sharing (32 tenants, 1024-tok shared prefix) prefix-cache OFF: 2176 physical blocks prefix-cache ON: 192 physical blocks --> 11.3x less KV memory WIN 2 (throughput) is deliberately reported as PENDING: it requires the paged gather-read path wired into llama-graph.cpp (Gate 0) and is not measurable at the allocation layer. The win-1 baseline is per-sequence n_ctx reservation (stream mode); llama.cpp's unified cache already shares one pool, so the honest win there is on-demand sizing + prefix dedup. Phase 3 (partial) of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 08:44:41 +00:00
Ettore Di Giacinto	5a5d3df8c8	feat(paged): Phase 2 core - attention over paged KV matches reference Retire the central numeric risk from the design: feeding gather-to-scratch KV (a sequence whose blocks are non-contiguous in the shared pool, [2,1,5]) into ggml's standard attention ops produces correct attention. Path under test: set_rows write -> get_rows gather (K and V) -> mul_mat(K,Q) -> soft_max_ext -> mul_mat(V^T, probs). Result is compared against an independent host-computed softmax attention over the same K/V/Q. Max abs error ~7.5e-08 (n_kv=48, d=8, n_q=4). This proves the paged read path is numerically sound on CPU with no new ggml op. Remaining: wire build_attn_paged into llama-graph.cpp and validate Gate 0 (token-identical greedy generation in a real model). Phase 2 (core) of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 08:35:35 +00:00
Ettore Di Giacinto	c6698dd4bf	feat(paged): Phase 1 - ggml paged write/gather mechanism (CPU) Validate the paged KV read/write path at the ggml-op level, driven by PagedKVManager: - write: ggml_set_rows(pool, k_src, slot_mapping) scatter K rows by slot - read: ggml_get_rows(pool, gather_idx) gather a seq's slots into contiguous scratch (the tensor an attention kernel consumes) The test forces a non-contiguous, out-of-order physical block layout (allocate seqA+seqB, free seqA, reallocate seqC -> blocks [2,1,5]) and proves gather(write(x)) == x plus cross-sequence isolation in the shared pool. This de-risks the central question (does slot-addressed paged storage round-trip correctly through ggml) before the llama-graph integration. Pool is statically allocated via ggml_backend_alloc_ctx_tensors, mirroring how llama.cpp allocates its KV cache. CPU backend, no new ggml op. Built against ggml from the vendored llama.cpp checkout. Phase 1 of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 08:33:26 +00:00
Ettore Di Giacinto	edb1a11abc	feat(paged): vLLM-parity KV block manager (Phase 0, CPU-first prototype) Host-side paged-attention block manager ported faithfully from vLLM V1 (block_pool.py, kv_cache_utils.py, single_type_kv_cache_manager.py): - KVCacheBlock + intrusive LRU FreeBlockQueue (O(1) middle removal) - BlockPool: get_new_blocks / touch / free_blocks eviction ordering / cache_full_blocks / lazy eviction on reuse - PagedKVManager: on-demand allocate, block_table, slot arithmetic (slot = block_id*block_size + offset), free - Prefix caching: chained block hashing + find_longest_cache_hit (first-miss stop), enabling automatic cross-tenant prefix sharing Pure C++17, zero ggml/llama.cpp dependency, unit-tested to vLLM behavioral parity (4/4 suites green). Parity is on algorithm/behavior, not hash bytes. Phase 0 of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md. Phases 1-5 (ggml storage, gather-to-scratch read path, Gate 0 correctness, benchmark wins, prefix-share serving) follow. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 08:26:31 +00:00
LocalAI [bot]	4ad754eea3	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `b3dfb7858cfcb9166e92f366e5af87f19ebc94be` (#10395 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-19 00:03:37 +02:00
Richard Palethorpe	3fa7b2955c	feat(pii): NER tier engine — privacy-filter.cpp backend + NER-centric PII filter (#10360 ) Squashed feat/pii-ner-tier-engine rebased onto master (was 45 commits; see backup/pii-ner-tier-engine-prerebase). Net change: - privacy-filter.cpp: standalone GGML engine for the openai-privacy-filter PII/NER token classifier, wired as a LocalAI gRPC backend (CPU/CUDA/Vulkan). TokenClassify moves off the patched llama.cpp path onto this backend. - PII filter reworked to be NER-centric (encoder/NER detection tier scanning whole conversations as one document), with a recreated bounded restricted- regex secret-matching pattern detector tier alongside it (per-model pii_detection.builtins / .patterns + core/services/routing/piipattern). - Detection labelled by source (ner vs pattern); backend trace / confidence / debug observability; analyze/redact exposed as a synchronous API. - Instance-wide default detector policy + per-usecase default-on; request filtering extended to completions, embeddings, edits & Ollama. - React UI: NER-centric PII editor, detector-models table, pattern/builtins editor, middleware default-policy UI. - Gallery: privacy-filter-multilingual token-classify model + NER install filter; token_classify known_usecase; batch sized to context for NER models. privacy-filter backend registered in the backend gallery (cpu/vulkan/cuda-13 meta + image entries with a capabilities map) matching its CI matrix jobs, and an /import-model auto-detect importer (PrivacyFilterImporter, narrow privacy-filter GGUF detection) replacing the prior pref-only registration. Reconciled against master's independent evolution: - Dropped master's PIIPatternOverrides feature (global-pattern runtime overrides + /api/pii/patterns API + runtime_settings.json persistence). The per-model NER + pattern-detector design supersedes it; it was built on the global redactor pattern set this branch replaced. - Reverted the llama.cpp Score carry-patch (0006-server-task-type-score): removed the patch and restored master's grpc-server.cpp Score RPC (direct llama_decode, slot-loop bypass) and LLAMA_VERSION pin, plus master's model_config validation forbidding score + chat/completion/embeddings on llama-cpp. token_classify is unaffected (it runs on the privacy-filter backend, not llama-cpp). Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-06-18 11:45:22 +01:00
LocalAI [bot]	c133ca39dc	chore: ⬆️ Update ggml-org/llama.cpp to `f3e182816421c648188b5eab269853bf1531d950` (#10379 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-18 11:43:23 +02:00
LocalAI [bot]	5c2ae7857a	chore: ⬆️ Update antirez/ds4 to `80ebbc396aee40eedc1d829222f3362d10fa4c6c` (#10378 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-18 00:32:13 +02:00
LocalAI [bot]	4af360300f	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `71af16a6b7f6fb7315b346b4a51aad530599c3f5` (#10381 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-18 00:12:25 +02:00
LocalAI [bot]	95e7149c87	chore: ⬆️ Update ggml-org/llama.cpp to `74ade52741203e5c8f81eaf06a96cb1cfe15f2a3` (#10368 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-17 13:25:29 +02:00
LocalAI [bot]	fd26c8c753	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `064d23a6f816d50491d8c9b35a0cafe546eaf4b5` (#10367 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-17 13:25:14 +02:00
LocalAI [bot]	e60c094a7d	feat(ds4): SSD streaming + quality engine options, 128GB DeepSeek gallery models (#10374 ) feat(ds4): wire SSD streaming + quality engine options, add 128GB DeepSeek gallery models The ds4 backend zero-initialized ds4_engine_options and exposed none of the engine's tunable knobs, so SSD streaming (run a model larger than RAM by streaming routed MoE experts from the GGUF on SSD) and the quality/perf knobs were unreachable from LocalAI model YAMLs. Map ModelOptions.Options onto ds4_engine_options through a declarative table (kEngineOptSpecs + apply_engine_option) instead of per-field branches: the struct is fixed C with no reflection, so the field set is enumerated once and a future knob is a one-line table row. Two fields use ds4's own typed parsers (GiB budgets, cache-experts count-or-NGB). Bare flags (e.g. "ssd_streaming") mean true; path-type options (mtp_path, expert_profile_path, directional_steering_file) resolve relative to the model directory so a gallery entry can reference a companion file by bare filename. mtp_draft/mtp_margin are now validated rather than parsed with throwing std::stoi/std::stof. Add gallery entries for the 128 GB class: - deepseek-v4-flash-q2-q4 (~91 GB, mixed q2/q4, fits RAM, higher quality) - deepseek-v4-flash-q4-ssd (~153 GB full 4-bit, runs on 128 GB via SSD streaming) - deepseek-v4-flash-q2-mtp (~81 GB + MTP speculative draft weights) - deepseek-v4-pro-q2-ssd (~433 GB Pro, experimental SSD streaming) SSD streaming is Metal (Darwin) only; the options are inert on CUDA/CPU. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-17 10:30:06 +02:00
LocalAI [bot]	980ec4a311	chore: ⬆️ Update antirez/ds4 to `cafc134f78a5a1890d98808d3102f4313573a1bc` (#10369 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-17 09:28:19 +02:00
LocalAI [bot]	1ab61a0875	feat: generic chat_template_kwargs (model config + per-request metadata) (#10359 ) * feat(config): add chat_template_kwargs model field + resolver Adds the ChatTemplateKwargs model-config map and RequestMetadata carrier, plus ResolveChatTemplateKwargs which layers the config map under coerced request metadata. Foundation for generic jinja chat-template kwargs (issue #10329). Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend): forward resolved chat_template_kwargs blob to backends gRPCPredictOpts now merges per-request client metadata over the server-derived enable_thinking/reasoning_effort (reaching all backends via the standalone keys) and serialises the resolved chat_template_kwargs map into a JSON blob for llama.cpp, written last so a client cannot clobber it. Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(http): wire request metadata to config.RequestMetadata The OpenAI request metadata field was parsed but unused; stamp it onto the per-request ModelConfig so gRPCPredictOpts forwards it as chat_template_kwargs overrides. Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(llama-cpp): generic chat_template_kwargs merge (drop per-key blocks) Replace the per-key enable_thinking/reasoning_effort handling in both the streaming and non-streaming chat paths with a single block that parses the chat_template_kwargs JSON blob resolved by the Go layer and merges every key into body_json. New jinja template levers (e.g. preserve_thinking) now need no C++ change. Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs: document custom chat_template_kwargs (model + per-request) Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(backend): pin reasoning_effort as a string in the chat_template_kwargs blob Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(http): e2e guard pinning chat_template_kwargs forwarded to gRPC Adds an ECHO_PREDICT_METADATA marker to the mock-backend that echoes the received PredictOptions.Metadata, and an app_test.go spec that drives a real /v1/chat/completions request (model chat_template_kwargs + per-request metadata override) and asserts the exact metadata + chat_template_kwargs blob the REST layer forwards to gRPC. Locks the REST->gRPC contract against regressions. Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(config): grandfather chat_template_kwargs in registry coverage chat_template_kwargs is a free-form map[string]any (like engine_args, already on the list), not a scalar the config UI registry can surface, so it is exempt from the registry-entry requirement. Fixes the TestAllFieldsHaveRegistryEntries failure introduced by the new field. Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-16 12:16:34 +02:00
LocalAI [bot]	6b9f1bd4b3	chore: ⬆️ Update antirez/ds4 to `e34a8086693ba7ca5cfabd2b9028ee52f0bfac2e` (#10350 ) * ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix(ds4): add Homebrew include/lib prefix for Darwin grpc-proto build The darwin/metal ds4 backend job runs for the first time on this bump (it was skipped on prior ds4 PRs) and fails compiling backend.pb.cc with 'google/protobuf/runtime_version.h' file not found. hw_grpc_proto links neither protobuf::libprotobuf nor gRPC::grpc++, so the generated proto sources rely on default system include paths. That works on Linux (/usr/include) but not on macOS, where Homebrew installs under /opt/homebrew. Add the Homebrew prefix to include/link dirs on Darwin, mirroring the llama-cpp backend that already builds on Darwin CI. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(ds4): install nlohmann-json on Darwin CI for ds4 backend After the protobuf include-path fix the ds4 darwin build advances to compiling dsml_renderer.cpp, which includes <nlohmann/json.hpp> and #errors when absent. On Linux the header comes from apt nlohmann-json3-dev in the build image; the macOS runner had no equivalent. Add the header-only nlohmann-json formula to the shared Darwin backend brew install/link list and Homebrew cache, alongside the existing deps. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(ds4): build proper OCI image tar for Darwin backend The darwin packaging referenced scripts/build/oci-pack.sh, which was never added to the tree, so it fell back to a plain 'tar' that omits manifest.json. 'local-ai backends install' then rejects the tarball with 'file manifest.json not found in tar'. Use './local-ai util create-oci-image' (already built by the 'build' prerequisite of the backends/ds4-darwin target), mirroring llama-cpp-darwin.sh, to emit a real OCI image the installer accepts. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-16 09:59:50 +02:00
LocalAI [bot]	3d295adfa8	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `2f524850a1f67716bc0ba80ffa30ce39c5b8bd5f` (#10336 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-06-16 09:04:35 +02:00
LocalAI [bot]	4fa2064875	chore: ⬆️ Update ggml-org/llama.cpp to `7dad2f1a17d65b5e2034c277125bc9f97573a779` (#10337 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-16 08:22:26 +02:00
LocalAI [bot]	f648f07b13	chore: ⬆️ Update ggml-org/llama.cpp to `4988f6e866057afd130c1515ecef0c9bab9a15f8` (#10280 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-14 21:53:25 +02:00
LocalAI [bot]	61cde6fd77	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `5f917a64b391b7d31839845153a473a65f630458` (#10240 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-14 16:46:49 +02:00
Richard Palethorpe	085fc53bbc	fix(router): production-ready request router + auto-size batch for embedding/rerank (#10104 ) * fix(router): score classifier production-readiness Conversation trimming runs through the classifier model's chat template and trims by exact token count, sized to the model's n_batch which is now scaled to context so long probes can't crash the backend. Missing chat_message templates are a hard error at router build time. Router- facing factories (Embedder/Scorer/Reranker/TokenCounter) re-resolve ModelConfig per call so a model installed post-startup doesn't bind a stub Backend="" config and silently fall into the loader's auto- iterate path. New 'vector_store' backend trace recorded inside localVectorStore on every Search/Insert — including the backend-load-failure path that previously vanished into an xlog.Warn — with outcome tagging (hit/miss/empty_store/backend_load_error/find_error/insert_error/ok). Companion cleanup drops misleading similarity:0 and input_tokens_count:0 from non-hit and text-mode traces. Gallery local-store-development aliases to 'local-store' so the master image satisfies pkg/model.LocalStoreBackend lookups from the embedding cache. Misc: llama-cpp TokenizeString reads the correct 'prompt' JSON key (the original bug); ModelTokenize nil-guard; non-fatal mitm proxy startup; PII 'route_local' renamed to 'allow' with docs/UI in sync; model-editor footer no longer eats the edit area on small screens; several config-editor template/dropdown/section fixes. Tests: e2e router specs (casual/code-hint + long-conversation trim), vector_store trace specs, lazy-factory specs, gallery dev-alias resolution, Playwright trace badge + scroll regression. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(backend): auto-size batch to context for embedding and rerank models Embedding and rerank models pool over the whole input in a single physical batch (n_ubatch). With batch left at the 512 default, the backend rejects longer inputs with "input is too large to process", silently capping a large-context embedder (e.g. 8k/32k) at 512 tokens. Size n_batch to the context for these single-pass usecases, mirroring the existing FLAG_SCORE behaviour; an explicit batch: still wins. Extracts EffectiveContextSize/EffectiveBatchSize from grpcModelOpts so the effective decode window has one home for other callers to reuse. Adds an e2e-aio regression test that embeds a >512-token input. The AIO embedding model is switched to nomic-embed-text-v1.5 (2048 context) because the previous granite model was capped at 512 tokens and could not exercise the larger batch. Assisted-by: claude-code:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(gallery): raise arch-router scoring output cap via parallel:64 Scoring decodes the whole prompt+candidate in a single llama_decode and reads one logit row per candidate token. The vendored llama.cpp server caps causal output rows at n_parallel, so the default of 1 aborts with GGML_ASSERT(n_outputs_max <= cparams.n_outputs_max) on multi-token route labels. Set options: [parallel:64] on both arch-router quant entries to lift the cap; kv_unified (the grpc-server default) keeps the full context per sequence, so this does not split the KV cache. Assisted-by: claude-code:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-06-12 16:21:15 +02:00
LocalAI [bot]	a53f34e78f	chore: ⬆️ Update ggml-org/llama.cpp to `4c6595503fe45d5a39f88d194e270f64c7424677` (#10261 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-12 14:57:52 +02:00
LocalAI [bot]	892ce951ce	chore: ⬆️ Update antirez/ds4 to `d881f2a05e8ff6bec001315a36b794b4aa310173` (#10262 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-12 09:18:07 +02:00
LocalAI [bot]	ff09683d84	chore: ⬆️ Update ggml-org/llama.cpp to `ac4cddeb0dbd778f650bf568f6f08344a06abe3a` (#10239 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-11 18:33:38 +02:00
LocalAI [bot]	51a92b6093	chore: ⬆️ Update antirez/ds4 to `8384adf0f9fa0f3bb342dd925372de778b95b263` (#10242 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-11 00:10:34 +02:00
LocalAI [bot]	8b8506d01a	chore: ⬆️ Update ggml-org/llama.cpp to `039e20a2db9e87b2477c76cc04905f3e1acad77f` (#10223 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-10 12:22:03 +02:00
LocalAI [bot]	6910a0bb48	chore: ⬆️ Update antirez/ds4 to `91bafb5acd5a6cf00b1e55ef68bf40ddd207bee7` (#10234 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-10 12:08:19 +02:00
LocalAI [bot]	cffd03b522	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `e6f8112f3ba126eed3ff5b30cdd08085414a7516` (#10233 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-10 12:07:49 +02:00
LocalAI [bot]	da4ed05429	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `2768b6251548b78b6610e95edad13f888ad95982` (#10219 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-10 01:15:54 +02:00

1 2 3 4 5 ...

585 Commits