LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-05-17 04:56:52 -04:00

Author	SHA1	Message	Date
LocalAI [bot]	6e1dbae256	feat(llama-cpp): expose 12 missing common_params via options[] (#9814 ) The llama.cpp backend already accepts a free-form options: array in the model config that maps to common_params fields, but a coverage audit against upstream pin 7f3f843c flagged 12 user-visible knobs that were neither set via the typed proto fields nor reachable via options:. Wire them up under the existing if/else chain in params_parse, before the speculative section. Each new option follows the file's prevailing patterns (try/catch around numeric parses, the same true/1/yes/on bool form used elsewhere, hardware_concurrency() fallback for thread counts, mirror of draft_override_tensor for override_tensor). Top-level / batching / IO: - n_ubatch (alias ubatch) -- physical batch size; was previously force-aliased to n_batch at line 482, blocking embedding/rerank workloads that need independent control - threads_batch (alias n_threads_batch) -- main-model batch threads; mirrors the existing draft_threads_batch - direct_io (alias use_direct_io) -- O_DIRECT model loads - verbosity -- llama.cpp log threshold (line 479 had this commented out) - override_tensor (alias tensor_buft_overrides) -- per-tensor buffer overrides for the main model; mirrors draft_override_tensor Embedding / multimodal: - pooling_type (alias pooling) -- mean/cls/last/rank/none; previously only auto-flipped to RANK for rerankers - embd_normalize (alias embedding_normalize) -- and the embedding handler now reads params_base.embd_normalize instead of a hardcoded 2 at the previous embd_normalize literal in Embedding() - mmproj_use_gpu (alias mmproj_offload) -- mmproj on CPU vs GPU - image_min_tokens / image_max_tokens -- per-image vision token budget Reasoning surface (the audit-focus three; LocalAI's existing ReasoningConfig.DisableReasoning only feeds the per-request chat_template_kwargs.enable_thinking and does not touch any of these): - reasoning_format -- none/auto/deepseek/deepseek-legacy parser - enable_reasoning (alias reasoning_budget) -- -1/0/>0 thinking budget - prefill_assistant -- trailing-assistant-message prefill toggle All 14 referenced fields exist on both the upstream pin and the turboquant fork's common.h, so no LOCALAI_LEGACY_LLAMA_CPP_SPEC guard is needed. Docs: extend model-configuration.md with new "Reasoning Models", "Multimodal Backend Options", "Embedding & Reranking Backend Options", and "Other Backend Tuning Options" subsections; also refresh the Speculative Type Values table to show the new dash-separated canonical names alongside the underscore aliases LocalAI still accepts. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-14 08:53:34 +02:00
LocalAI [bot]	53bdb18d10	chore: ⬆️ Update ggml-org/llama.cpp to `7f3f843c31cd32dc4adc10b393342dfee071c332` (#9809 ) * ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix(llama-cpp): adapt to upstream COMMON_SPECULATIVE_TYPE_DRAFT rename ggml-org/llama.cpp#22964 ("spec: update CLI arguments for better consistency") renamed the speculative type enum values: COMMON_SPECULATIVE_TYPE_DRAFT -> COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE COMMON_SPECULATIVE_TYPE_EAGLE3 -> COMMON_SPECULATIVE_TYPE_DRAFT_EAGLE3 and the registered name strings flipped from underscore- to dash- separated form (e.g. ngram_simple -> ngram-simple), with the bare draft/eagle3 aliases replaced by draft-simple/draft-eagle3. This broke the build with the new LLAMA_VERSION on every variant (vulkan/arm64, darwin and likely all the rest) at grpc-server.cpp:461. Update the upstream branch of the speculative-type fallback to use the new identifier (the LOCALAI_LEGACY_LLAMA_CPP_SPEC fork branch keeps the old name), and normalize spec_type option tokens before passing them to common_speculative_types_from_names so existing model configs that say spec_type:draft / spec_type:ngram_simple keep working. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:claude-opus-4-7 --------- Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-14 08:53:23 +02:00
LocalAI [bot]	ec49995190	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `949bb8f1d660fc1264c137a6f3dbd619375f6134` (#9807 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-14 00:15:32 +02:00
LocalAI [bot]	4430fae779	chore: ⬆️ Update antirez/ds4 to `0cba357ca1bc0e7510421cc26888e420ea942123` (#9806 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-14 00:14:23 +02:00
LocalAI [bot]	ddbbdf45b9	chore: ⬆️ Update TheTom/llama-cpp-turboquant to `5aeb2fdbe26cd4c534c6fa15de73cb5749bd0403` (#9740 ) ⬆️ Update TheTom/llama-cpp-turboquant Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-13 21:58:33 +02:00
LocalAI [bot]	a645c1f4aa	chore: ⬆️ Update ggml-org/llama.cpp to `a9883db8ee021cf16783016a60996d41820b5195` (#9796 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-13 21:40:31 +02:00
LocalAI [bot]	957619af53	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `f9a93c37e2fc021760c3c1aa99cf74c73b7591a7` (#9795 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-13 00:40:48 +02:00
LocalAI [bot]	0b81e36504	chore: ⬆️ Update antirez/ds4 to `f8b4ed635d559b3a5b44bf2df6a77e21b3e9178f` (#9794 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-13 00:40:09 +02:00
LocalAI [bot]	bc4cd3dd85	feat(llama-cpp): bump to `1ec7ba0c`, adapt grpc-server, expose new spec-decoding options (#9765 ) * chore(llama.cpp): bump to 1ec7ba0c14f33f17e980daeeda5f35b225d41994 Picks up the upstream `spec : parallel drafting support` change (ggml-org/llama.cpp#22838) which reshapes the speculative-decoding API and `server_context_impl`. Adapt the grpc-server wrapper accordingly: * `common_params_speculative::type` (single enum) became `types` (`std::vector<common_speculative_type>`). Update both the "default to draft when a draft model is set" branch and the `spec_type`/`speculative_type` option parser. The parser now also tolerates comma-separated lists, mirroring the upstream `common_speculative_types_from_names` semantics. * `common_params_speculative_draft::n_ctx` is gone (draft now shares the target context size). Keep the `draft_ctx_size` option name for backward compatibility and ignore the value rather than failing. * `server_context_impl::model` was renamed to `model_tgt`; update the two reranker / model-metadata call sites. Replaces #9763. Builds cleanly under the linux/amd64 cpu-llama-cpp target locally. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(llama-cpp): expose new speculative-decoding option keys Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838) adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative families and beefs up the draft-model knobs. The previous bump only adapted the API; this exposes the new fields through the grpc-server options dictionary so model configs can drive them. New `options:` keys (all under `backend: llama-cpp`): ngram_mod (`ngram_mod` type): spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match ngram_map_k (`ngram_map_k` type): spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits ngram_map_k4v (`ngram_map_k4v` type): spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m / spec_ngram_map_k4v_min_hits ngram lookup caches (`ngram_cache` type): spec_lookup_cache_static / lookup_cache_static spec_lookup_cache_dynamic / lookup_cache_dynamic Draft-model tuning (active when `spec_type` is `draft`): draft_cache_type_k / spec_draft_cache_type_k draft_cache_type_v / spec_draft_cache_type_v draft_threads / spec_draft_threads draft_threads_batch / spec_draft_threads_batch draft_cpu_moe / spec_draft_cpu_moe (bool flag) draft_n_cpu_moe / spec_draft_n_cpu_moe (first N MoE layers on CPU) draft_override_tensor / spec_draft_override_tensor (comma-separated <tensor regex>=<buffer type>; re-implements upstream's static parse_tensor_buffer_overrides since it isn't exported) `spec_type` already accepted comma-separated lists after the previous commit, matching upstream's `common_speculative_types_from_names`. Docs: refresh `docs/content/advanced/model-configuration.md` with per-family tables and a note about multi-type chaining. Builds locally with `make docker-build-llama-cpp` (linux/amd64 cpu-llama-cpp AVX variant). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(turboquant): bridge new llama.cpp spec API to the legacy fork layout The previous commits in this series adapted backend/cpp/llama-cpp/grpc-server.cpp to the post-#22838 (parallel drafting) llama.cpp API. The turboquant build reuses the same grpc-server.cpp through backend/cpp/turboquant/Makefile, which copies it into turboquant-<flavor>-build/ and runs patch-grpc-server.sh on the copy. The fork branched before the API refactor, so it errors out on: * `ctx_server.impl->model_tgt` (fork still has `model`) * `params.speculative.{ngram_mod,ngram_map_k,ngram_map_k4v,ngram_cache}.` (none of these sub-structs exist in the fork) `params.speculative.draft.{cache_type_k/v, cpuparams[, _batch].n_threads, tensor_buft_overrides}` (fork uses the pre-#22397 flat layout) * `params.speculative.types` vector / `common_speculative_types_from_names` (fork has a scalar `type` and only the singular helper) Approach: 1. backend/cpp/llama-cpp/grpc-server.cpp: introduce a single feature switch `LOCALAI_LEGACY_LLAMA_CPP_SPEC`. When defined, the two `speculative.type[s]` discriminations (the "default to draft when a draft model is set" branch and the `spec_type` / `speculative_type` option parser) fall back to the singular scalar form, and the entire new-option block (ngram_mod / map_k / map_k4v / ngram_cache / draft.{cache_type_, cpuparams, tensor_buft_overrides}) is preprocessed out. The macro is not defined in the source tree — stock llama-cpp builds get the full new API. 2. backend/cpp/turboquant/patch-grpc-server.sh: two new patch steps applied to the per-flavor build copy at turboquant-<flavor>-build/grpc-server.cpp: - substitute `ctx_server.impl->model_tgt` -> `ctx_server.impl->model` - inject `#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1` before the first `#include`, so the guarded blocks above drop out for the fork build. Both patches are idempotent and follow the existing sed/awk pattern in this script (KV cache types, `get_media_marker`, flat speculative renames). Stock llama-cpp's `grpc-server.cpp` is never touched. Drop both legacy patches once the turboquant fork rebases past ggml-org/llama.cpp#22397 / #22838. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(turboquant): close draft_ctx_size brace inside legacy guard The previous turboquant fix wrapped the new option-handler blocks in `#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC ... #endif` but placed the guard in the middle of an `else if` chain — the `} else if` openings of the new blocks were responsible for closing the previous block's brace. With the macro defined the new blocks vanish, draft_ctx_size's `{` loses its closer, the for-loop's `}` is consumed instead, and the file ends with a stray opening brace — clang reports it as `function-definition is not allowed here before '{'` on the next top-level `int main(...)` and `expected '}' at end of input`. Move the chain split inside the draft_ctx_size branch: } else if (... "draft_ctx_size") { // ... #ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC } // legacy: chain ends here #else } else if (... "spec_ngram_mod_n_min") { // modern: chain continues ... } else if (... "draft_override_tensor") { ... } // closes last branch #endif } // closes for-loop Brace count is now balanced under both preprocessor branches (verified with `tr -cd '{' \| wc -c` against the patched and unpatched outputs). Local `make docker-build-turboquant` builds the linux/amd64 cpu-llama-cpp `turboquant-avx` variant cleanly. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ci): forward AMDGPU_TARGETS into Dockerfile.turboquant builder-prebuilt Dockerfile.turboquant's `builder-prebuilt` stage was missing the `ARG AMDGPU_TARGETS` / `ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}` pair that `builder-fromsource` already has (and that `Dockerfile.llama-cpp` mirrors across both stages). When CI uses the prebuilt base image (quay.io/go-skynet/ci-cache:base-grpc-, the common path) the build-arg passed by the workflow never reaches the env inside the compile stage. backend/cpp/llama-cpp/Makefile:38 (introduced by #9626) errors out on hipblas builds when AMDGPU_TARGETS is empty, and the turboquant Makefile reuses backend/cpp/llama-cpp via a sibling build dir, so the same check fires from turboquant-fallback under BUILD_TYPE=hipblas: Makefile:38: AMDGPU_TARGETS is empty — set it to a comma-separated list of gfx targets e.g. gfx1100,gfx1101. Stop. make: * [Makefile:66: turboquant-fallback] Error 2 The bug is latent on master because the docker layer cache stays warm across builds — the compile step rarely re-runs from scratch. The llama.cpp bump in this PR invalidates the cache, so the missing env var becomes load-bearing and the hipblas turboquant CI job fails. Mirror the existing pattern from Dockerfile.llama-cpp. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-12 17:22:37 +02:00
LocalAI [bot]	78722caedc	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `eb570eb96689c235933b813693ca28ab9d3d26de` (#9764 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-12 00:02:22 +02:00
LocalAI [bot]	621c612b2d	ci(bump-deps): register ds4 + move version pin into the Makefile (#9761 ) * ci(bump-deps): register ds4 + move version pin into the Makefile The initial ds4 PR (#9758) put the upstream commit pin in backend/cpp/ds4/prepare.sh as a shell variable. The auto-bump bot at .github/bump_deps.sh greps for ^$VAR?= in a Makefile, so DS4_VERSION was invisible to it - other backends (llama-cpp, ik-llama-cpp, turboquant, voxtral, etc.) all pin in their Makefile. This change: - Moves DS4_VERSION?= and DS4_REPO?= to the top of backend/cpp/ds4/Makefile. - Inlines the git init/fetch/checkout recipe into the 'ds4:' target (matches llama-cpp's 'llama.cpp:' target pattern). Directory acts as the target so make only re-clones when missing. - Deletes the now-redundant prepare.sh. - Adds antirez/ds4 + DS4_VERSION + main + backend/cpp/ds4/Makefile to the .github/workflows/bump_deps.yaml matrix so the daily bot opens PRs against this pin. - Updates .agents/ds4-backend.md to point at the Makefile. Verified: $ grep -m1 '^DS4_VERSION?=' backend/cpp/ds4/Makefile DS4_VERSION?=ae302c2fa18cc6d9aefc021d0f27ae03c9ad2fc0 $ make -C backend/cpp/ds4 ds4 # clones into ds4/ at the pin $ make -C backend/cpp/ds4 ds4 # no-op on second invocation make: 'ds4' is up to date. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: route backend/cpp/ds4/ changes through changed-backends.js scripts/changed-backends.js:inferBackendPath has an explicit branch per cpp dockerfile suffix (ik-llama-cpp, turboquant, llama-cpp). Without a matching branch the function returns null, the backend never lands in the path map, and PR change-detection cannot map "backend/cpp/ds4/X changed" -> "rebuild ds4 image". This is why PR #9761 produced zero ds4 jobs even though it directly edits backend/cpp/ds4/Makefile. Adds the missing branch (Dockerfile.ds4 -> backend/cpp/ds4/), placed before the llama-cpp branch (since both share the .cpp ancestry but ds4 is more specific - same ordering rule documented in .agents/adding-backends.md). Verified with a local Node simulation of the script against this PR's diff: the path map now contains 'ds4 -> backend/cpp/ds4/' and a 'backend/cpp/ds4/Makefile' change correctly triggers the ds4 backend in the rebuild set. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(adding-backends): harden the two gotchas that bit ds4 Both omissions are silent at the time you ADD a backend - the failure mode only appears later (the bump bot stays silent forever, or the path filter shows up on the next PR that touches your backend with zero CI jobs and looks broken for unrelated reasons). Expanding the `scripts/changed-backends.js` paragraph from a one-liner to a fully worked example, and adding a new sibling paragraph for the `bump_deps.yaml` + Makefile-pin contract. Both call out the specific mistakes from the ds4 timeline (#9758 → #9761) so future contributors can pattern-match on the cause. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-11 22:46:02 +02:00
LocalAI [bot]	d892e4af80	feat: add ds4 backend (DeepSeek V4 Flash) with tool calls, thinking, KV cache (#9758 ) * test(e2e-backends): allow BACKEND_BINARY for native-built backends Adds an escape hatch for hardware-gated backends (e.g. ds4) where the model is too large for Docker build context. When BACKEND_BINARY points at a run.sh produced by 'make -C backend/cpp/<name> package', the suite skips docker image extraction and drives the binary directly. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(e2e-backends): validate BACKEND_BINARY basename + log actual source Two follow-ups from the `cbcf5148` code review: - BACKEND_BINARY now requires a path whose basename is `run.sh`. Without this check, `filepath.Dir(binary)` silently discarded the filename, so pointing the env var at an arbitrary binary failed later with a confusing assertion that named a path the user never typed. - The "Testing image=..." debug line printed an empty string when the binary path was used, hiding the actual source in CI logs. The line now reports whichever of BACKEND_IMAGE / BACKEND_BINARY is in effect as `src=...`. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): scaffold ds4 backend dir Adds prepare.sh, run.sh, and a .gitignore. CMakeLists, Makefile, and the implementation arrive in follow-up commits. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): add backend Makefile Drives ds4's upstream Makefile to produce engine .o files (CUDA on Linux when BUILD_TYPE=cublas, Metal on Darwin, otherwise CPU debug path), then invokes CMake on our wrapper. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): add CMakeLists for grpc-server Generates protoc stubs from backend.proto, links grpc-server.cpp + dsml_parser.cpp + dsml_renderer.cpp + kv_cache.cpp against pre-built ds4 engine .o files. DS4_GPU=cuda\|metal\|cpu selects the backend. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): grpc-server skeleton + module stubs The minimum that links: Backend service with Health + Free; other RPCs default to UNIMPLEMENTED. Stub headers/sources for dsml_parser, dsml_renderer, and kv_cache are in place so CMake links cleanly even before those modules ship. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement LoadModel Opens engine + creates session sized to ContextSize (default 32768). Backend is compile-time: CPU when DS4_NO_GPU, Metal on __APPLE__, else CUDA. MTP/speculative options are accepted via ModelOptions.Options[] (mtp_path, mtp_draft, mtp_margin). kv_cache_dir option is captured into g_kv_cache_dir for the cache module (Task 19 wires it in). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement TokenizeString Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement Predict (plain text) Tool calls + thinking-mode split arrive in Task 13 once dsml_parser is in. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement PredictStream (plain text) ChatDelta + reasoning/tool_calls split arrives in Task 14. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement Status RPC Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): add DSML streaming parser Classifies raw model-emitted token text into CONTENT / REASONING / TOOL_START / TOOL_ARGS / TOOL_END events. Markers it watches for are the literal DSML strings rendered by ds4_server.c's prompt template (<｜DSML｜tool_calls>, <｜DSML｜invoke name=...>, <think>, etc.) - these are plain text the model emits, not special tokens. Partial markers split across token chunks are buffered until a full marker or a definitively-not-a-marker '<' is observed. RandomToolId() generates the API-side tool call id (call_xxx) that exact-replay would key on. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): split hex escapes in DSML markers + add cstring/cstdio includes C++ \x hex escapes have no length cap. '\x9cD' was read as a single escape producing byte 0xCD, eating the 'D'. The markers were never actually matching the DSML text the model emits. Split each escape with adjacent string literal concatenation so the byte sequence is exactly EF BD 9C 44 (｜D) at runtime. Also adds <cstring> and <cstdio> includes (libstdc++ 13 does not transitively expose std::strlen / std::snprintf via <string>). The local plan file (uncommitted) was also updated with the same fixes so Task 16's dsml_renderer.cpp does not re-introduce the bug. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): wire DsmlParser into Predict (ChatDelta) Non-streaming Predict now emits one ChatDelta carrying content, reasoning_content, and tool_calls[] parsed from the model's DSML output. Reply.message still carries the raw model bytes for backends that prefer the regex fallback path. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): wire DsmlParser into PredictStream Per-token ChatDelta writes: content/reasoning_content go incrementally, tool_calls emit TOOL_START as one delta (id + name) followed by TOOL_ARGS deltas with incremental JSON. The Go-side aggregator (pkg/functions/chat_deltas.go) reassembles them. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): chat template + reasoning_effort mapping UseTokenizerTemplate=true + Messages -> ds4_chat_begin / append / assistant_prefix. PredictOptions.Metadata['enable_thinking'] and ['reasoning_effort'] map to ds4_think_mode (DS4_THINK_HIGH default; 'max'/'xhigh' -> DS4_THINK_MAX; disabled -> DS4_THINK_NONE). Tool-call rendering for assistant turns with tool_calls JSON arrives in the next commit (dsml_renderer). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): render assistant tool_calls + tool results to DSML Closes the round-trip: when an OpenAI client sends a multi-turn chat where prior turns contain tool_calls or role=tool messages, build_prompt serializes them back to the DSML shape the model was trained on. Mirrors ds4_server.c's prompt renderer; uses nlohmann::json for parsing the OpenAI tool_calls payload. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): disk KV cache module Dir-based cache keyed by SHA1(rendered prompt prefix). File format: 'DS4G' magic + version + ctx_size + prefix_len + prefix + payload_bytes + ds4_session_save_payload output. NOT bit-compatible with ds4-server's KVC files - that interop is a follow-up plan. LoadLongestPrefix walks the dir picking the longest stored prefix that prefixes the incoming prompt. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): wire KvCache into Predict/PredictStream LoadModel reads 'kv_cache_dir' from ModelOptions.Options[], passes it to g_kv_cache.SetDir. Each Predict/PredictStream computes a render text for the request, tries LoadLongestPrefix to recover state, then Saves the new state after generation. ds4_session_sync handles the live-cache fast path internally, so the disk cache only matters for cold-starts and cross-session reuse. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): add package.sh Linux: bundles libc + ld + libstdc++ + libgomp + GPU runtime libs into package/lib so the FROM scratch image boots without a host libc. Darwin is handled by scripts/build/ds4-darwin.sh which uses otool -L. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): rename namespace ds4_backend -> ds4cpp ds4.h defines 'typedef enum {...} ds4_backend' which collides with our C++ 'namespace ds4_backend' anywhere a TU includes both. kv_cache.h includes ds4.h directly and surfaces the conflict immediately; other TUs would hit it once gRPC dev headers are available. Renames the C++ namespace to ds4cpp across all wrapper files and the plan, leaving the upstream ds4 typedef untouched. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend): add Dockerfile.ds4 Single-stage builder (CUDA devel image for cublas, ubuntu:24.04 for cpu) -> FROM scratch with packaged grpc-server + bundled runtime libs. nlohmann-json3-dev is required for dsml_renderer's JSON handling. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(make): wire backend/cpp/ds4 + ds4-darwin into root Makefile BACKEND_DS4 entry + generate-docker-build-target eval + docker-build-ds4 in docker-build-backends + .NOTPARALLEL guards. Also adds the backends/ds4-darwin target which delegates to scripts/build/ds4-darwin.sh (landed in Task 24). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: add backend-matrix entries for ds4 (cpu + cuda13, per-arch) Two entries per build (amd64 + arm64) so backend-merge-jobs assembles a multi-arch manifest. Skipping cuda12 - ds4 was validated against CUDA 13. Darwin Metal is handled outside this matrix by backend_build_darwin.yml. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/index): add ds4 meta + image entries cpu + cuda13 x latest + master. Darwin Metal builds publish under ds4-darwin via the existing llama-cpp-darwin OCI pipeline. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(scripts/build): add ds4-darwin.sh Native macOS/Metal build for the ds4 backend. Mirrors llama-cpp-darwin.sh: make grpc-server -> otool -L for dylib bundling -> OCI tar that 'local-ai backends install' consumes via the backends/ds4-darwin Makefile target. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(darwin): build ds4-darwin in backend_build_darwin Adds a 'Build ds4 backend (Darwin Metal)' step that runs the backends/ds4-darwin Makefile target on the macOS runner. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(import): auto-detect ds4 weights via DS4Importer Adds core/gallery/importers/ds4.go which matches on the antirez/deepseek-v4-gguf repo URI and the DeepSeek-V4-Flash-.gguf filename pattern. Registered before LlamaCPPImporter so ds4 weights route to backend: ds4 instead of falling through to llama-cpp. Also lists ds4 in /backends/known so the /import-model UI surfaces it as a manual choice for users who want to force the backend on a non-canonical URI. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> feat(gallery): add deepseek-v4-flash-q2 (ds4 backend) One-click install of the q2 weights with backend: ds4. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(.agents): add ds4-backend.md Documents the backend shape, DSML state machine, thinking-mode mapping, disk KV cache, build matrix (cpu/cuda13/Darwin), and the BACKEND_BINARY hardware-validation path. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): pass UBUNTU_VERSION + arch env vars to install-base-deps The .docker/install-base-deps.sh script needs UBUNTU_VERSION (defaults to 2404), TARGETARCH, SKIP_DRIVERS, and APT_MIRROR/APT_PORTS_MIRROR exported into the environment so it can pick the right cuda-keyring / cudss / nvpl debs and apt mirrors. Dockerfile.ds4 was declaring some of the ARGs but not re-exporting them via ENV. Mirrors Dockerfile.llama-cpp's pattern. Without this fix 'make docker-build-ds4 BUILD_TYPE=cublas CUDA_MAJOR_VERSION=13' failed at: /usr/local/sbin/install-base-deps: line 120: UBUNTU_VERSION: unbound variable Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/index): add Metal image entries for ds4 Adds metal-ds4 + metal-ds4-development image entries pointing at quay.io/go-skynet/local-ai-backends:{latest,master}-metal-darwin-arm64-ds4 (built by scripts/build/ds4-darwin.sh on macOS arm64 runners), plus the 'metal' and 'metal-darwin-arm64' capability mappings on the ds4 meta and ds4-development variant. Closes a gap from the initial Task 23 landing - the Darwin Metal build script and CI workflow step were already wired (Tasks 24-25), but the gallery had no image entry for users to install the Metal variant. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ci): use ubuntu:24.04 base for ds4 cuda13 matrix entries The initial Task 22 matrix landing used base-image: 'nvidia/cuda:13.0.0-devel-ubuntu24.04' which clashes with install-base-deps.sh's cuda-keyring step: E: Conflicting values set for option Signed-By regarding source https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/ The canonical pattern (llama-cpp, ik-llama-cpp, turboquant) uses plain 'ubuntu:24.04' + 'skip-drivers: false' so install-base-deps installs CUDA from scratch via its own keyring setup. Adopting that here. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): drop install-base-deps.sh dependency The .docker/install-base-deps.sh pipeline is built around the llama-cpp needs: NVIDIA keyring + cuda-toolkit apt + gRPC-from-source build at /opt/grpc. For ds4 we don't need any of that: - CUDA: nvidia/cuda:13.0.0-devel-ubuntu24.04 ships /usr/local/cuda ready to go; install-base-deps's keyring step then conflicts with the pre-installed Signed-By. - gRPC: ds4's grpc-server.cpp only links against grpc++; system libgrpc++-dev (apt) is sufficient, no source build needed. Replaced the install-base-deps invocation in Dockerfile.ds4 with a direct 'apt-get install libgrpc++-dev libprotobuf-dev protobuf-compiler-grpc nlohmann-json3-dev cmake build-essential pkg-config git'. Matrix entries back to nvidia/cuda base + skip-drivers=true so install-base-deps would no-op even if some downstream tooling calls it. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): correct proto accessors + alias grpc::Status as GStatus Two compile bugs caught by the docker build: 1. proto::Message uses snake_case accessors. The build_prompt loop called m.toolcalls() / m.toolcallid() - the protoc-generated names are m.tool_calls() / m.tool_call_id(). Plan-text bug propagated to the wrapper. 2. The Status RPC method shadowed the 'using grpc::Status' alias, so any later method declaration using Status as a return type failed to parse ('Status does not name a type' starting at LoadModel). Solution: alias grpc::Status as GStatus instead, with no 'using' clause that would conflict. All RPC method declarations and return-statement constructions now use GStatus. Pre-existing code reviewer flagged the Status-shadow concern as 'minor' in the original Task 10 commit; it turned out to be a real compile blocker under libstdc++ 13 once the surrounding methods were filled in. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): preserve TOOL_ARGS content in dsml_parser Flush When the model emitted a parameter value that arrived in the same buffer as the surrounding tool_call markers (e.g. the buffered tail after a literal '</think>' opened the model output), the parser deferred all buffered bytes to Flush() because looks_like_prefix() always returns true while buf starts with '<'. Flush() then drained the buffer as plain CONTENT/REASONING regardless of parser state, so the bytes between the parameter open and close markers were classified as CONTENT instead of TOOL_ARGS. Symptom: the model emitted <\|DSML\|parameter name="location" string="true">Paris, France</\|DSML\|parameter> and the assembled tool_call arguments came out as {"location":""} - the opener and closer were emitted into the args stream but the "Paris, France" content went to the assistant message instead. Fix: 1. Flush() now uses the same state-aware emit logic as DrainPlain: PARAM_VALUE bytes become TOOL_ARGS (json-escaped when string), THINK bytes become REASONING, TEXT bytes become CONTENT, and INVOKE / TOOL_CALLS structural whitespace is discarded. 2. looks_like_prefix() restricts its leading-'<' fallback to buffers that have not yet seen a '>'. Without that change, char-by-char feeds would discard the '<' of '<\|DSML\|invoke name="..."' once the marker prefix length was reached but the closing quote/'>' were still in flight. Verified with a standalone harness that runs the failing input three ways (single Feed, split-after-'>', and char-by-char) and aggregates TOOL_ARGS for tool index 0: all three now produce {"location":"Paris, France"}. Assisted-by: Claude:opus-4.7 [Read,Edit,Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): use ds4_session_sync + manual generation loop for KV persistence ds4_engine_generate_argmax() is a self-contained helper that doesn't take or update a ds4_session - it manages its own internal state. Our Predict and PredictStream methods created g_session via ds4_session_create() but then called ds4_engine_generate_argmax(), so g_session's KV state never advanced. ds4_session_payload_bytes(g_session) returned 0 and the disk KV cache save correctly rejected with 'session has no valid checkpoint to save'. Switch both RPCs to the proper session API: ds4_session_sync(g_session, &prompt, ...) loop: int token = ds4_session_argmax(g_session) if token == eos: break emit(token) ds4_session_eval(g_session, token, ...) After the loop the session has a real checkpoint and ds4_session_save_payload writes the KV state to disk. Verified end-to-end on a DGX Spark GB10: three .kv files (15-30 MB each) are written when BACKEND_TEST_OPTIONS sets kv_cache_dir, and the e2e tool-call assertion still passes. Also added stderr diagnostics to KvCache (enabled/disabled at SetDir; per-save path + payload_bytes + result) so future failures are visible instead of silent. The 'wrote ok' lines are low-volume - one per Predict/PredictStream when the cache is enabled - and skipped entirely when the option is unset. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): use ds4_session_eval_speculative_argmax when MTP loaded Wires MTP (Multi-Token Prediction) speculative decoding into the manual generation loop in both Predict and PredictStream. When the upstream MTP weights are loaded via 'mtp_path:' option AND we're on CUDA / Metal, ds4_engine_mtp_draft_tokens() returns >0 and we switch the inner loop to ds4_session_eval_speculative_argmax(), which can accept N>1 tokens per verifier step. When MTP is not loaded (no option, CPU backend, or weights absent), we fall through to the simple ds4_session_argmax + ds4_session_eval path with no behavior change. Validated on a DGX Spark GB10 with the optional MTP GGUF (DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf, ~3.6 GB). LoadModel logs 'ds4: MTP support model loaded ... (draft=2)' on stderr. Caveat per upstream README: 'currently provides at most a slight speedup, not a meaningful generation-speed win'. Wired now mainly to track the upstream API; bigger speedups arrive when ds4 improves the speculative path. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): honor PredictOptions sampling with DSML-aware override Mirrors ds4_server.c:7102-7115 sampling-policy semantics on the LocalAI gRPC side. The generation loop now consults compute_sample_params() per token to pick the effective (temperature, top_k, top_p, min_p), based on: 1. Request defaults: PredictOptions.temperature / .topk / .topp / .minp 2. Thinking-mode override: when enable_thinking != false, force T=1.0, top_k=0, top_p=1.0, min_p=0.0 (creativity for the reasoning pass and the trailing content) 3. DSML structural override: when DsmlParser::IsInDsmlStructural() returns true (we are between tool-call markers but NOT in a param value payload), force T=0.0 so protocol bytes parse cleanly When the effective temperature is 0, we keep using ds4_session_argmax + MTP speculative path (matches ds4-server's gate that only enables MTP for greedy positions). When > 0, we call ds4_session_sample(s, T, ...) with a per-thread RNG seeded from system_clock and fall back to single-token ds4_session_eval. New public method on DsmlParser: IsInDsmlStructural() encodes which states need protocol-byte determinism. PARAM_VALUE is excluded (payload uses user sampling); TEXT and THINK are excluded (no tool-call context to protect). Verified on the DGX Spark GB10: the e2e suite still passes with all 5 specs including tools, and the Predict output now varies between runs (creative sampling active) while the tool-call args remain a clean '{"location":"Paris, France"}' because the parser-state check forces greedy on the structural bytes. UX note: thinking mode is ON by default (matching ds4-server). Users who want deterministic output should set Metadata.enable_thinking = false. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): add sha256 to deepseek-v4-flash-q2 entry Per HF LFS metadata for antirez/deepseek-v4-gguf: size: 86720111200 bytes (~80.76 GiB) sha256: 31598c67c8b8744d3bcebcd19aa62253c6dc43cef3b8adf9f593656c9e86fd8c LocalAI's downloader verifies sha256 when present, so users who install deepseek-v4-flash-q2 from the gallery get integrity-checked weights and the partial-download issue (an 81 GB file is easy to truncate) becomes recoverable instead of silently producing a broken backend. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-11 22:15:47 +02:00
LocalAI [bot]	b9e81dbfd4	chore: ⬆️ Update ggml-org/llama.cpp to `389ff61d77b5c71cec0cf92fe4e5d01ace80b797` (#9752 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-05-11 08:14:07 +02:00
LocalAI [bot]	a435f7cc69	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `23127139cb6fa314899c3b5f4935b88b3374c56c` (#9748 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-10 21:32:28 +02:00
LocalAI [bot]	f6c9c20911	chore: ⬆️ Update ggml-org/llama.cpp to `2b2babd1243c67ca811c0a5852cedf92b1a20024` (#9747 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-05-10 21:17:38 +02:00
LocalAI [bot]	6cbf69dc29	chore: ⬆️ Update ggml-org/llama.cpp to `1e5ad35d560b90a8ac447d149c8f8447ae1fcaa0` (#9739 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-05-10 00:06:29 +02:00
LocalAI [bot]	a91e718473	chore: ⬆️ Update ggml-org/llama.cpp to `00d56b11c3477b99bc18562dc1d1834f0d961778` (#9733 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-05-09 12:05:11 +02:00
LocalAI [bot]	d1eef05852	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `ab0f22b819ac57b7e7484f69c00c10fc755d5c6c` (#9734 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-09 11:18:59 +02:00
LocalAI [bot]	4542833cb4	chore: ⬆️ Update ggml-org/llama.cpp to `9f5f0e689c9e977e5f23a27e344aa36082f44738` (#9724 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-09 10:18:05 +02:00
LocalAI [bot]	14a3275329	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `98950267c67fd95937a54ebd6e3c66cf2679b710` (#9725 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-09 00:06:05 +02:00
LocalAI [bot]	3b84582567	chore: ⬆️ Update ggml-org/llama.cpp to `05ff59cb57860cc992fc6dcede32c696efea711c` (#9714 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-08 01:44:17 +02:00
LocalAI [bot]	907929ce60	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `9a26522af234f8db079ae3735f35ab6c20fe2c66` (#9713 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-08 01:43:44 +02:00
LocalAI [bot]	151d6c9cf0	chore: ⬆️ Update ggml-org/llama.cpp to `2496f9c14965c39589f53eea31bdb6d762b1d360` (#9698 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-07 08:29:27 +02:00
LocalAI [bot]	659939db9b	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `b93721902b4662f9b973b1c412006081c958d085` (#9697 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-07 08:29:12 +02:00
LocalAI [bot]	a315c321c1	chore: ⬆️ Update TheTom/llama-cpp-turboquant to `69d8e4be47243e83b3d0d71e932bc7aa61c644dc` (#9638 ) ⬆️ Update TheTom/llama-cpp-turboquant Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-06 00:29:05 +02:00
LocalAI [bot]	d5ce823b83	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `8b56d813a9ed04fa7b7fe2588fddd845cf64eccb` (#9677 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-05 23:46:09 +02:00
LocalAI [bot]	c9141098b6	chore: ⬆️ Update ggml-org/llama.cpp to `bbeb89d76c41bc250f16e4a6fefcc9b530d6e3f3` (#9676 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-05 23:45:54 +02:00
LocalAI [bot]	1634eece6b	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `45dfd80371785731bc2ed05a76252497a4e7a282` (#9644 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-05 15:09:40 +02:00
LocalAI [bot]	b88ddce0f3	chore: ⬆️ Update ggml-org/llama.cpp to `eff06702b2a52e1020ea009ebd86cb9f5acabab5` (#9637 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-05 09:52:28 +02:00
Russell Sim	18e039f305	fix(ci): fix AMDGPU_TARGETS empty-string bypass in hipblas builds (#9626 ) * fix(ci): fix AMDGPU_TARGETS empty-string bypass in hipblas builds `399c1dec` wired amdgpu-targets through the backend_build workflow_call interface, intending the input's default value to cover matrix entries that don't specify targets. However, GitHub Actions only applies a workflow_call input default when the caller omits the input entirely. When backend.yml passes `amdgpu-targets: ${{ matrix.amdgpu-targets }}` and the matrix entry has no amdgpu-targets key, the expression evaluates to an empty string, which is treated as an explicit value — bypassing the default. The result is Docker receiving AMDGPU_TARGETS="" which in turn causes Make's ?= default to be skipped (since the variable is already set in the environment, even to empty), and cmake gets -DAMDGPU_TARGETS= with no targets, so the HIP backend compiles for an indeterminate target rather than the intended GPU list. Fix this at two levels: 1. backend.yml: use a \|\| fallback in the expression so that an undefined matrix.amdgpu-targets never reaches the reusable workflow as an empty string. The target list is the canonical default and lives here. 2. backend_build.yml: remove the now-misleading default value from the input declaration. The default never fired due to the above bug, so keeping it implied a guarantee that didn't exist. 3. backend/cpp/llama-cpp/Makefile: add an explicit $(error ...) guard after the ?= assignment so that if AMDGPU_TARGETS is empty (whether from environment or any future CI wiring mistake) the build fails immediately with a clear message rather than silently producing a binary compiled for an unknown GPU target. Assisted-by: Claude Code:claude-sonnet-4-6 Signed-off-by: Russell Sim <rsl@simopolis.xyz> * fix(build): plumb AMDGPU_TARGETS through to Docker builds The docker-build-backend Makefile macro and Dockerfile.golang did not pass AMDGPU_TARGETS to the inner make invocation, so hipblas builds always used the backend Makefile's hardcoded default GPU targets regardless of what was specified via environment or CI inputs. Signed-off-by: Russell Sim <rsl@simopolis.xyz> --------- Signed-off-by: Russell Sim <rsl@simopolis.xyz>	2026-05-02 15:53:14 +02:00
LocalAI [bot]	9c4c3f9d8f	chore: ⬆️ Update ggml-org/llama.cpp to `beb42fffa45eded44804a1fd4916146222371581` (#9624 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-01 02:02:56 +02:00
LocalAI [bot]	273416f54b	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `a8aecbf15933295af96504f9a693998322185b5c` (#9625 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-01 02:02:29 +02:00
Ettore Di Giacinto	c02a50f2ab	feat(llama-cpp): bump to d775992 and adapt to spec params refactor (#9618 ) Bumps backend/cpp/llama-cpp/Makefile LLAMA_VERSION from 665abc6 to d775992, picking up upstream PR ggml-org/llama.cpp#22397 which splits common_params_speculative into nested draft / ngram_simple / ngram_mod sub-structs. Renames every grpc-server.cpp reference to match: speculative.mparams_dft.path -> speculative.draft.mparams.path speculative.{n_max,n_min} -> speculative.draft.{n_max,n_min} speculative.{p_min,p_split} -> speculative.draft.{p_min,p_split} speculative.{n_gpu_layers,n_ctx} -> speculative.draft.{n_gpu_layers,n_ctx} speculative.ngram_size_n -> speculative.ngram_simple.size_n speculative.ngram_size_m -> speculative.ngram_simple.size_m speculative.ngram_min_hits -> speculative.ngram_simple.min_hits The "speculative.n_max" JSON key sent to the upstream server stays unchanged — server-task.cpp still reads it and routes the value into draft.n_max internally. The turboquant fork (TheTom/llama-cpp-turboquant @ 11a241d) branched before #22397 and still exposes the flat layout. Since turboquant reuses the shared backend/cpp/llama-cpp/grpc-server.cpp, extend patch-grpc-server.sh with an idempotent sed block that reverts the ten field references back to the legacy flat names on the build copy only — the original under backend/cpp/llama-cpp/ stays compiling against vanilla upstream. Drop the block once the fork rebases. ik-llama-cpp has its own grpc-server.cpp with no speculative refs (0/2661 lines), so it is unaffected. Validated locally with `make docker-build-llama-cpp` (avx, avx2, avx512, fallback, grpc + rpc-server all built; image exported). Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-30 08:44:43 +02:00
LocalAI [bot]	55afda22e3	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `453a027c17e4d63a7f16b871197a396240a65138` (#9608 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-29 00:18:19 +02:00
LocalAI [bot]	b69bacfcdc	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `d6f3e4e28fbf75e6181e6ea32e734de9ce9304fd` (#9585 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-28 08:43:51 +02:00
LocalAI [bot]	8e50066fa2	chore: ⬆️ Update ggml-org/llama.cpp to `665abc609740d397d30c0d8ef4157dbf900bd1a3` (#9584 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-28 08:43:33 +02:00
LocalAI [bot]	05e94bd9e7	chore: ⬆️ Update ggml-org/llama.cpp to `f53577432541bb9edc1588c4ef45c66bf07e4468` (#9577 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-27 08:57:24 +02:00
LocalAI [bot]	d9cb0d6133	chore: ⬆️ Update ggml-org/llama.cpp to `dcad77cc3b0865153f486327064fb0320a57a476` (#9572 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-26 12:38:35 +02:00
LocalAI [bot]	f5c268deac	chore: ⬆️ Update TheTom/llama-cpp-turboquant to `11a241d0db78a68e0a5b99fe6f36de6683100f6a` (#9571 ) ⬆️ Update TheTom/llama-cpp-turboquant Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-26 12:38:25 +02:00
LocalAI [bot]	1c45227346	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `3a945af45d45936341a45bbf7deda56776a4af26` (#9570 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-26 08:26:37 +02:00
LocalAI [bot]	806ea24ff4	chore: ⬆️ Update TheTom/llama-cpp-turboquant to `67559e580b10e4e47e9a6fd6218873997976886d` (#9497 ) ⬆️ Update TheTom/llama-cpp-turboquant Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-25 14:03:46 +02:00
Ettore Di Giacinto	21eace40ec	feat(llama-cpp): expose split_mode option for multi-GPU placement (#9560 ) Adds split_mode (alias sm) to the llama.cpp backend options allowlist, accepting none\|layer\|row\|tensor. The tensor value targets the experimental backend-agnostic tensor parallelism from ggml-org/llama.cpp#19378 and requires a llama.cpp build that includes that PR, FlashAttention enabled, KV-cache quantization disabled, and a manually set context size. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-25 14:02:57 +02:00
LocalAI [bot]	08e393f7db	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `cb58a561f0c49f68b6d125cdfda037ed80433821` (#9549 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-25 08:59:48 +02:00
LocalAI [bot]	47cc3dc8d7	chore: ⬆️ Update ggml-org/llama.cpp to `361fe72acb7b9bd79059cc177cbeda99b35b5db9` (#9548 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-25 08:58:27 +02:00
Ettore Di Giacinto	c0920f3273	fix(ik-llama-cpp): patch clip.cpp for new ggml_quantize_chunk signature (#9531 ) Bumps ik_llama.cpp pin to 16996aeab7. Upstream 286ce32...16996ae adds a trailing `const struct quantize_user_data *` parameter to `ggml_quantize_chunk` (PR ikawrakow/ik_llama.cpp#1677) but leaves `examples/llava/clip.cpp` unchanged because their build has moved to `examples/mtmd/`. LocalAI's prepare.sh still copies from `examples/llava/`, so the dead 7-arg call reaches the grpc-server compile and fails. Patch the call site to pass `nullptr` for the new param. Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash]	2026-04-24 13:07:26 +02:00
LocalAI [bot]	7c1934b183	chore: ⬆️ Update ggml-org/llama.cpp to `187a45637054881ecacf17f8e2f6f8f2ba7df1c7` (#9520 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-24 09:17:06 +02:00
Ettore Di Giacinto	ed648b3b4e	fix(llama-cpp): include server-chat.cpp in grpc-server translation unit (#9511 ) * fix(llama-cpp): include server-chat.cpp in grpc-server translation unit Upstream llama.cpp refactor (ggml-org/llama.cpp#20690) moved the OAI/Anthropic/Responses and transcription conversion helpers out of server-common.cpp into a new server-chat.cpp, and server-task.cpp and server-context.cpp now call those symbols (convert_transcriptions_to_chatcmpl, server_chat_convert_responses_to_chatcmpl, server_chat_convert_anthropic_to_oai, server_chat_msg_diff_to_json_oaicompat) via server-chat.h. grpc-server.cpp builds as a single translation unit by #include-ing the upstream .cpp files directly. Without including server-chat.cpp, the declarations are satisfied at compile time via server-chat.h but the link step fails with undefined references once LLAMA_VERSION crosses the refactor commit (134d6e54). Guard the include with __has_include so the same source stays buildable on older LLAMA_VERSION pins that predate the refactor (where prepare.sh won't copy server-chat.cpp into tools/grpc-server/). Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(llama-cpp): bump LLAMA_VERSION to 0d0764dfd Bump to ggml-org/llama.cpp@0d0764dfd2. Paired with the preceding grpc-server server-chat.cpp include so the refactor at 134d6e54 links cleanly. Supersedes PR #9494. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-23 14:59:39 +02:00
Ettore Di Giacinto	04f1a0285d	fix(ik-llama-cpp): adapt to common_grammar struct in sampling.h (#9512 ) Upstream ik_llama.cpp commit e0596bf6 ("Autoparser") changed common_params_sampling::grammar from std::string to a common_grammar struct (type + grammar), which broke our two direct accesses: - JSON ingest fed the field through json_value<common_grammar>(...), for which nlohmann has no from_json adapter. - JSON export emitted the struct directly, for which nlohmann has no to_json adapter. Wrap the incoming JSON string in common_grammar{COMMON_GRAMMAR_TYPE_USER, ...} and serialize via the inner .grammar member, mirroring upstream's examples/server/server-context.cpp. Also bump IK_LLAMA_VERSION to 286ce324baed17c95faec77792eaa6bdb1c7a5f5 so the local-ai side lines up with the dependency bump in #9496. Assisted-by: Claude-Code:claude-opus-4-7	2026-04-23 13:45:06 +02:00
orbisai0security	bbeacf140d	fix: remove unsafe sprintf() in grpc-server.cpp (#9486 ) fix: V-001 security vulnerability Automated security fix generated by Orbis Security AI	2026-04-22 21:57:29 +02:00
LocalAI [bot]	cd7b035716	chore: ⬆️ Update ggml-org/llama.cpp to `5a4cd6741fc33227cdacb329f355ab21f8481de2` (#9479 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-22 08:58:19 +02:00

1 2 3 4 5 ...

466 Commits