LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-05-17 13:10:23 -04:00

Author	SHA1	Message	Date
LocalAI [bot]	6e1dbae256	feat(llama-cpp): expose 12 missing common_params via options[] (#9814 ) The llama.cpp backend already accepts a free-form options: array in the model config that maps to common_params fields, but a coverage audit against upstream pin 7f3f843c flagged 12 user-visible knobs that were neither set via the typed proto fields nor reachable via options:. Wire them up under the existing if/else chain in params_parse, before the speculative section. Each new option follows the file's prevailing patterns (try/catch around numeric parses, the same true/1/yes/on bool form used elsewhere, hardware_concurrency() fallback for thread counts, mirror of draft_override_tensor for override_tensor). Top-level / batching / IO: - n_ubatch (alias ubatch) -- physical batch size; was previously force-aliased to n_batch at line 482, blocking embedding/rerank workloads that need independent control - threads_batch (alias n_threads_batch) -- main-model batch threads; mirrors the existing draft_threads_batch - direct_io (alias use_direct_io) -- O_DIRECT model loads - verbosity -- llama.cpp log threshold (line 479 had this commented out) - override_tensor (alias tensor_buft_overrides) -- per-tensor buffer overrides for the main model; mirrors draft_override_tensor Embedding / multimodal: - pooling_type (alias pooling) -- mean/cls/last/rank/none; previously only auto-flipped to RANK for rerankers - embd_normalize (alias embedding_normalize) -- and the embedding handler now reads params_base.embd_normalize instead of a hardcoded 2 at the previous embd_normalize literal in Embedding() - mmproj_use_gpu (alias mmproj_offload) -- mmproj on CPU vs GPU - image_min_tokens / image_max_tokens -- per-image vision token budget Reasoning surface (the audit-focus three; LocalAI's existing ReasoningConfig.DisableReasoning only feeds the per-request chat_template_kwargs.enable_thinking and does not touch any of these): - reasoning_format -- none/auto/deepseek/deepseek-legacy parser - enable_reasoning (alias reasoning_budget) -- -1/0/>0 thinking budget - prefill_assistant -- trailing-assistant-message prefill toggle All 14 referenced fields exist on both the upstream pin and the turboquant fork's common.h, so no LOCALAI_LEGACY_LLAMA_CPP_SPEC guard is needed. Docs: extend model-configuration.md with new "Reasoning Models", "Multimodal Backend Options", "Embedding & Reranking Backend Options", and "Other Backend Tuning Options" subsections; also refresh the Speculative Type Values table to show the new dash-separated canonical names alongside the underscore aliases LocalAI still accepts. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-14 08:53:34 +02:00
LocalAI [bot]	53bdb18d10	chore: ⬆️ Update ggml-org/llama.cpp to `7f3f843c31cd32dc4adc10b393342dfee071c332` (#9809 ) * ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix(llama-cpp): adapt to upstream COMMON_SPECULATIVE_TYPE_DRAFT rename ggml-org/llama.cpp#22964 ("spec: update CLI arguments for better consistency") renamed the speculative type enum values: COMMON_SPECULATIVE_TYPE_DRAFT -> COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE COMMON_SPECULATIVE_TYPE_EAGLE3 -> COMMON_SPECULATIVE_TYPE_DRAFT_EAGLE3 and the registered name strings flipped from underscore- to dash- separated form (e.g. ngram_simple -> ngram-simple), with the bare draft/eagle3 aliases replaced by draft-simple/draft-eagle3. This broke the build with the new LLAMA_VERSION on every variant (vulkan/arm64, darwin and likely all the rest) at grpc-server.cpp:461. Update the upstream branch of the speculative-type fallback to use the new identifier (the LOCALAI_LEGACY_LLAMA_CPP_SPEC fork branch keeps the old name), and normalize spec_type option tokens before passing them to common_speculative_types_from_names so existing model configs that say spec_type:draft / spec_type:ngram_simple keep working. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:claude-opus-4-7 --------- Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-14 08:53:23 +02:00
LocalAI [bot]	bc4cd3dd85	feat(llama-cpp): bump to `1ec7ba0c`, adapt grpc-server, expose new spec-decoding options (#9765 ) * chore(llama.cpp): bump to 1ec7ba0c14f33f17e980daeeda5f35b225d41994 Picks up the upstream `spec : parallel drafting support` change (ggml-org/llama.cpp#22838) which reshapes the speculative-decoding API and `server_context_impl`. Adapt the grpc-server wrapper accordingly: * `common_params_speculative::type` (single enum) became `types` (`std::vector<common_speculative_type>`). Update both the "default to draft when a draft model is set" branch and the `spec_type`/`speculative_type` option parser. The parser now also tolerates comma-separated lists, mirroring the upstream `common_speculative_types_from_names` semantics. * `common_params_speculative_draft::n_ctx` is gone (draft now shares the target context size). Keep the `draft_ctx_size` option name for backward compatibility and ignore the value rather than failing. * `server_context_impl::model` was renamed to `model_tgt`; update the two reranker / model-metadata call sites. Replaces #9763. Builds cleanly under the linux/amd64 cpu-llama-cpp target locally. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(llama-cpp): expose new speculative-decoding option keys Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838) adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative families and beefs up the draft-model knobs. The previous bump only adapted the API; this exposes the new fields through the grpc-server options dictionary so model configs can drive them. New `options:` keys (all under `backend: llama-cpp`): ngram_mod (`ngram_mod` type): spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match ngram_map_k (`ngram_map_k` type): spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits ngram_map_k4v (`ngram_map_k4v` type): spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m / spec_ngram_map_k4v_min_hits ngram lookup caches (`ngram_cache` type): spec_lookup_cache_static / lookup_cache_static spec_lookup_cache_dynamic / lookup_cache_dynamic Draft-model tuning (active when `spec_type` is `draft`): draft_cache_type_k / spec_draft_cache_type_k draft_cache_type_v / spec_draft_cache_type_v draft_threads / spec_draft_threads draft_threads_batch / spec_draft_threads_batch draft_cpu_moe / spec_draft_cpu_moe (bool flag) draft_n_cpu_moe / spec_draft_n_cpu_moe (first N MoE layers on CPU) draft_override_tensor / spec_draft_override_tensor (comma-separated <tensor regex>=<buffer type>; re-implements upstream's static parse_tensor_buffer_overrides since it isn't exported) `spec_type` already accepted comma-separated lists after the previous commit, matching upstream's `common_speculative_types_from_names`. Docs: refresh `docs/content/advanced/model-configuration.md` with per-family tables and a note about multi-type chaining. Builds locally with `make docker-build-llama-cpp` (linux/amd64 cpu-llama-cpp AVX variant). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(turboquant): bridge new llama.cpp spec API to the legacy fork layout The previous commits in this series adapted backend/cpp/llama-cpp/grpc-server.cpp to the post-#22838 (parallel drafting) llama.cpp API. The turboquant build reuses the same grpc-server.cpp through backend/cpp/turboquant/Makefile, which copies it into turboquant-<flavor>-build/ and runs patch-grpc-server.sh on the copy. The fork branched before the API refactor, so it errors out on: * `ctx_server.impl->model_tgt` (fork still has `model`) * `params.speculative.{ngram_mod,ngram_map_k,ngram_map_k4v,ngram_cache}.` (none of these sub-structs exist in the fork) `params.speculative.draft.{cache_type_k/v, cpuparams[, _batch].n_threads, tensor_buft_overrides}` (fork uses the pre-#22397 flat layout) * `params.speculative.types` vector / `common_speculative_types_from_names` (fork has a scalar `type` and only the singular helper) Approach: 1. backend/cpp/llama-cpp/grpc-server.cpp: introduce a single feature switch `LOCALAI_LEGACY_LLAMA_CPP_SPEC`. When defined, the two `speculative.type[s]` discriminations (the "default to draft when a draft model is set" branch and the `spec_type` / `speculative_type` option parser) fall back to the singular scalar form, and the entire new-option block (ngram_mod / map_k / map_k4v / ngram_cache / draft.{cache_type_, cpuparams, tensor_buft_overrides}) is preprocessed out. The macro is not defined in the source tree — stock llama-cpp builds get the full new API. 2. backend/cpp/turboquant/patch-grpc-server.sh: two new patch steps applied to the per-flavor build copy at turboquant-<flavor>-build/grpc-server.cpp: - substitute `ctx_server.impl->model_tgt` -> `ctx_server.impl->model` - inject `#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1` before the first `#include`, so the guarded blocks above drop out for the fork build. Both patches are idempotent and follow the existing sed/awk pattern in this script (KV cache types, `get_media_marker`, flat speculative renames). Stock llama-cpp's `grpc-server.cpp` is never touched. Drop both legacy patches once the turboquant fork rebases past ggml-org/llama.cpp#22397 / #22838. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(turboquant): close draft_ctx_size brace inside legacy guard The previous turboquant fix wrapped the new option-handler blocks in `#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC ... #endif` but placed the guard in the middle of an `else if` chain — the `} else if` openings of the new blocks were responsible for closing the previous block's brace. With the macro defined the new blocks vanish, draft_ctx_size's `{` loses its closer, the for-loop's `}` is consumed instead, and the file ends with a stray opening brace — clang reports it as `function-definition is not allowed here before '{'` on the next top-level `int main(...)` and `expected '}' at end of input`. Move the chain split inside the draft_ctx_size branch: } else if (... "draft_ctx_size") { // ... #ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC } // legacy: chain ends here #else } else if (... "spec_ngram_mod_n_min") { // modern: chain continues ... } else if (... "draft_override_tensor") { ... } // closes last branch #endif } // closes for-loop Brace count is now balanced under both preprocessor branches (verified with `tr -cd '{' \| wc -c` against the patched and unpatched outputs). Local `make docker-build-turboquant` builds the linux/amd64 cpu-llama-cpp `turboquant-avx` variant cleanly. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ci): forward AMDGPU_TARGETS into Dockerfile.turboquant builder-prebuilt Dockerfile.turboquant's `builder-prebuilt` stage was missing the `ARG AMDGPU_TARGETS` / `ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}` pair that `builder-fromsource` already has (and that `Dockerfile.llama-cpp` mirrors across both stages). When CI uses the prebuilt base image (quay.io/go-skynet/ci-cache:base-grpc-, the common path) the build-arg passed by the workflow never reaches the env inside the compile stage. backend/cpp/llama-cpp/Makefile:38 (introduced by #9626) errors out on hipblas builds when AMDGPU_TARGETS is empty, and the turboquant Makefile reuses backend/cpp/llama-cpp via a sibling build dir, so the same check fires from turboquant-fallback under BUILD_TYPE=hipblas: Makefile:38: AMDGPU_TARGETS is empty — set it to a comma-separated list of gfx targets e.g. gfx1100,gfx1101. Stop. make: * [Makefile:66: turboquant-fallback] Error 2 The bug is latent on master because the docker layer cache stays warm across builds — the compile step rarely re-runs from scratch. The llama.cpp bump in this PR invalidates the cache, so the missing env var becomes load-bearing and the hipblas turboquant CI job fails. Mirror the existing pattern from Dockerfile.llama-cpp. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-12 17:22:37 +02:00
Ettore Di Giacinto	c02a50f2ab	feat(llama-cpp): bump to d775992 and adapt to spec params refactor (#9618 ) Bumps backend/cpp/llama-cpp/Makefile LLAMA_VERSION from 665abc6 to d775992, picking up upstream PR ggml-org/llama.cpp#22397 which splits common_params_speculative into nested draft / ngram_simple / ngram_mod sub-structs. Renames every grpc-server.cpp reference to match: speculative.mparams_dft.path -> speculative.draft.mparams.path speculative.{n_max,n_min} -> speculative.draft.{n_max,n_min} speculative.{p_min,p_split} -> speculative.draft.{p_min,p_split} speculative.{n_gpu_layers,n_ctx} -> speculative.draft.{n_gpu_layers,n_ctx} speculative.ngram_size_n -> speculative.ngram_simple.size_n speculative.ngram_size_m -> speculative.ngram_simple.size_m speculative.ngram_min_hits -> speculative.ngram_simple.min_hits The "speculative.n_max" JSON key sent to the upstream server stays unchanged — server-task.cpp still reads it and routes the value into draft.n_max internally. The turboquant fork (TheTom/llama-cpp-turboquant @ 11a241d) branched before #22397 and still exposes the flat layout. Since turboquant reuses the shared backend/cpp/llama-cpp/grpc-server.cpp, extend patch-grpc-server.sh with an idempotent sed block that reverts the ten field references back to the legacy flat names on the build copy only — the original under backend/cpp/llama-cpp/ stays compiling against vanilla upstream. Drop the block once the fork rebases. ik-llama-cpp has its own grpc-server.cpp with no speculative refs (0/2661 lines), so it is unaffected. Validated locally with `make docker-build-llama-cpp` (avx, avx2, avx512, fallback, grpc + rpc-server all built; image exported). Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-30 08:44:43 +02:00
Ettore Di Giacinto	21eace40ec	feat(llama-cpp): expose split_mode option for multi-GPU placement (#9560 ) Adds split_mode (alias sm) to the llama.cpp backend options allowlist, accepting none\|layer\|row\|tensor. The tensor value targets the experimental backend-agnostic tensor parallelism from ggml-org/llama.cpp#19378 and requires a llama.cpp build that includes that PR, FlashAttention enabled, KV-cache quantization disabled, and a manually set context size. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-25 14:02:57 +02:00
Ettore Di Giacinto	ed648b3b4e	fix(llama-cpp): include server-chat.cpp in grpc-server translation unit (#9511 ) * fix(llama-cpp): include server-chat.cpp in grpc-server translation unit Upstream llama.cpp refactor (ggml-org/llama.cpp#20690) moved the OAI/Anthropic/Responses and transcription conversion helpers out of server-common.cpp into a new server-chat.cpp, and server-task.cpp and server-context.cpp now call those symbols (convert_transcriptions_to_chatcmpl, server_chat_convert_responses_to_chatcmpl, server_chat_convert_anthropic_to_oai, server_chat_msg_diff_to_json_oaicompat) via server-chat.h. grpc-server.cpp builds as a single translation unit by #include-ing the upstream .cpp files directly. Without including server-chat.cpp, the declarations are satisfied at compile time via server-chat.h but the link step fails with undefined references once LLAMA_VERSION crosses the refactor commit (134d6e54). Guard the include with __has_include so the same source stays buildable on older LLAMA_VERSION pins that predate the refactor (where prepare.sh won't copy server-chat.cpp into tools/grpc-server/). Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(llama-cpp): bump LLAMA_VERSION to 0d0764dfd Bump to ggml-org/llama.cpp@0d0764dfd2. Paired with the preceding grpc-server server-chat.cpp include so the refactor at 134d6e54 links cleanly. Supersedes PR #9494. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-23 14:59:39 +02:00
Ettore Di Giacinto	7809c5f5d0	fix(vision): propagate mtmd media marker from backend via ModelMetadata (#9412 ) Upstream llama.cpp (PR #21962) switched the server-side mtmd media marker to a random per-server string and removed the legacy "<__media__>" backward-compat replacement in mtmd_tokenizer. The Go layer still emitted the hardcoded "<__media__>", so on the non-tokenizer-template path the prompt arrived with a marker mtmd did not recognize and tokenization failed with "number of bitmaps (1) does not match number of markers (0)". Report the active media marker via ModelMetadataResponse.media_marker and substitute the sentinel "<__media__>" with it right before the gRPC call, after the backend has been loaded and probed. Also skip the Go-side multimodal templating entirely when UseTokenizerTemplate is true — llama.cpp's oaicompat_chat_params_parse already injects its own marker and StringContent is unused in that path. Backends that do not expose the field keep the legacy "<__media__>" behavior.	2026-04-18 20:30:13 +02:00
Ettore Di Giacinto	87e6de1989	feat: wire transcription for llama.cpp, add streaming support (#9353 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-14 16:13:40 +02:00
Ettore Di Giacinto	9748a1cbc6	fix(streaming): skip chat deltas for role-init elements to prevent first token duplication (#9299 ) When TASK_RESPONSE_TYPE_OAI_CHAT is used, the first streaming token produces a JSON array with two elements: a role-init chunk and the actual content chunk. The grpc-server loop called attach_chat_deltas for both elements with the same raw_result pointer, stamping the first token's ChatDelta.Content on both replies. The Go side accumulated both, emitting the first content token twice to SSE clients. Fix: in the array iteration loops in PredictStream, detect role-init elements (delta has "role" key) and skip attach_chat_deltas for them. Only content/reasoning elements get chat deltas attached. Reasoning models are unaffected because their first token goes into reasoning_content, not content.	2026-04-10 08:45:47 +02:00
Ettore Di Giacinto	2b05420f95	chore(llama.cpp): bump to 'd12cc3d1ca6bba741cd77887ac9c9ee18c8415c7' (#9282 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-09 08:12:05 +02:00
Ettore Di Giacinto	773489eeb1	fix(chat): do not retry if we had chatdeltas or tooldeltas from backend (#9244 ) * fix(chat): do not retry if we had chatdeltas or tooldeltas from backend Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: use oai compat for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: apply to non-streaming path too Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * map also other fields Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-06 10:52:23 +02:00
Ettore Di Giacinto	06fbe48b3f	feat(llama.cpp): wire speculative decoding settings (#9238 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-05 14:56:30 +02:00
Ettore Di Giacinto	53deeb1107	fix(reasoning): suppress partial tag tokens during autoparser warm-up The C++ PEG parser needs a few tokens to identify the reasoning format (e.g. "<\|channel>thought\n" for Gemma 4). During this warm-up, the gRPC layer was sending raw partial tag tokens to Go, which leaked into the reasoning field. - Clear reply.message in gRPC when autoparser is active but has no diffs yet, matching llama.cpp server behavior of only emitting classified output - Prefer C++ autoparser chat deltas for reasoning/content in all streaming paths, falling back to Go-side extraction for backends without autoparser (e.g. vLLM) - Override non-streaming no-tools result with chat delta content when available - Guard PrependThinkingTokenIfNeeded against partial tag prefixes during streaming accumulation - Reorder default thinking tokens so <\|channel>thought is checked before <\|think\|> (Gemma 4 templates contain both)	2026-04-04 20:45:57 +00:00
Ettore Di Giacinto	1ed6b9e5ed	fix(llama.cpp): correctly parse grpc header for bearer token auth Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-03 21:38:41 +00:00
Ettore Di Giacinto	59108fbe32	feat: add distributed mode (#9124 ) * feat: add distributed mode (experimental) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix data races, mutexes, transactions Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactorings Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix events and tool stream in agent chat Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * use ginkgo Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(cron): compute correctly time boundaries avoiding re-triggering Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * enhancements, refactorings Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * do not flood of healthy checks Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * do not list obvious backends as text backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * tests fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop redundant healthcheck Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * enhancements, refactorings Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-30 00:47:27 +02:00
Ettore Di Giacinto	031a36c995	feat: inferencing default, automatic tool parsing fallback and wire min_p (#9092 ) * feat: wire min_p Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: inferencing defaults Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(refactor): re-use iterative parser Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: generate automatically inference defaults from unsloth Instead of trying to re-invent the wheel and maintain here the inference defaults, prefer to consume unsloth ones, and contribute there as necessary. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: apply defaults also to models installed via gallery Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: be consistent and apply fallback to all endpoint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-22 00:57:15 +01:00
Ettore Di Giacinto	c3174f9543	chore(deps): bump llama-cpp to 'a0bbcdd9b6b83eeeda6f1216088f42c33d464e38' (#9079 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-20 08:12:21 +01:00
Richard Palethorpe	b24ca51287	fix(llama-cpp): Set enable_thinking in the correct place (#8973 ) Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-03-12 13:32:29 +01:00
Ettore Di Giacinto	b2f81bfa2e	feat(functions): add peg-based parsing and allow backends to return tool calls directly (#8838 ) * feat(functions): add peg-based parsing Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: support returning toolcalls directly from backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: do run PEG only if backend didn't send deltas Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-08 22:21:57 +01:00
Ettore Di Giacinto	580517f9db	feat: pass-by metadata to predict options (#8795 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-05 22:50:10 +01:00
Ettore Di Giacinto	1c5dc83232	chore(deps): bump llama.cpp to 'ecbcb7ea9d3303097519723b264a8b5f1e977028' (#8672 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-02-28 00:33:56 +01:00
Richard Palethorpe	9e692967c3	fix(llama-cpp): Pass parameters when using embedded template (#8590 ) Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-02-17 18:50:05 +01:00
Austen	42cb7bda19	fix(llama-cpp): populate tensor_buft_override buffer so llama-cpp properly performs fit calculations (#8560 ) fix auto-fit for llama-cpp	2026-02-14 10:07:37 +01:00
Ettore Di Giacinto	48e08772f3	chore(llama.cpp): bump to 'f6b533d898ce84bae8d9fa8dfc6697ac087800bf' (#8275 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-01-29 00:22:25 +01:00
Ettore Di Giacinto	c0b21a921b	feat: detect thinking support from backend automatically if not explicitly set (#8167 ) detect thinking support from backend automatically if not explicitly set Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-01-23 00:38:28 +01:00
Ettore Di Giacinto	f6daaa7c35	chore(deps): Bump llama.cpp to '1c7cf94b22a9dc6b1d32422f72a627787a4783a3' (#8136 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-01-21 00:12:13 +01:00
Ettore Di Giacinto	f4b0a304d7	chore(llama.cpp): propagate errors during model load (#7937 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-01-09 07:52:49 +01:00
Ettore Di Giacinto	d16ec7aa9e	chore(deps): Bump llama.cpp to '480160d47297df43b43746294963476fc0a6e10f' (#7933 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-01-09 07:52:32 +01:00
Ettore Di Giacinto	5f6c941399	fix(llama.cpp/mmproj): fix loading mmproj in nested sub-dirs different from model path (#7832 ) fix(mmproj): fix loading mmproj in nested sub-dirs Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-01-02 20:17:30 +01:00
Ettore Di Giacinto	0a168830ea	chore(deps): Bump llama.cpp to '5b6c9bc0f3c8f55598b9999b65aff7ce4119bc15' and refactor usage of base params (#7706 ) * chore(deps): Bump llama.cpp to '5b6c9bc0f3c8f55598b9999b65aff7ce4119bc15' and refactor usage of base params Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: update AGENTS.md Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-12-24 00:28:27 +01:00
Ettore Di Giacinto	fc6057a952	chore(deps): bump llama.cpp to '0e1ccf15c7b6d05c720551b537857ecf6194d420' (#7684 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-12-22 09:50:42 +01:00
Ettore Di Giacinto	2387b266d8	chore(llama.cpp): Add Missing llama.cpp Options to gRPC Server (#7584 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-12-15 21:55:20 +01:00
Simon Redman	5de539ab07	fix(7355): Update llama-cpp grpc for v3 interface (#7566 ) * fix(7355): Update llama-cpp grpc for v3 interface Signed-off-by: Simon Redman <simon@ergotech.com> * feat(llama-gprc): Trim whitespace from servers list Signed-off-by: Simon Redman <simon@ergotech.com> * Trim trailing spaces in grpc-server.cpp Signed-off-by: Simon Redman <simon@ergotech.com> --------- Signed-off-by: Simon Redman <simon@ergotech.com>	2025-12-14 11:40:33 +01:00
Ettore Di Giacinto	0b130fb811	fix(llama.cpp): handle corner cases with tool array content (#7528 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-12-12 08:15:45 +01:00
Ettore Di Giacinto	74ee1463fe	chore(deps/llama-cpp): bump to '2fa51c19b028180b35d316e9ed06f5f0f7ada2c1' (#7484 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-12-09 15:41:37 +01:00
Ettore Di Giacinto	024aa6a55b	chore(deps): bump llama.cpp to 'bde188d60f58012ada0725c6dd5ba7c69fe4dd87' (#7434 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-12-05 00:17:35 +01:00
Ettore Di Giacinto	e3bcba5c45	chore: ⬆️ Update ggml-org/llama.cpp to `7f8ef50cce40e3e7e4526a3696cb45658190e69a` (#7402 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-12-01 07:50:40 +01:00
Ettore Di Giacinto	468ac608f3	chore(deps): bump llama.cpp to 'd82b7a7c1d73c0674698d9601b1bbb0200933f29' (#7392 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-11-29 08:58:07 +01:00
Ettore Di Giacinto	7a94d237c4	chore(deps): bump llama.cpp to '583cb83416467e8abf9b37349dcf1f6a0083745a (#7358 ) chore(deps): bump llama.cpp to '583cb83416467e8abf9b37349dcf1f6a0083745a' Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-11-26 08:23:21 +01:00
Ettore Di Giacinto	e88db7d142	fix(llama.cpp): handle corner cases with tool content (#7324 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-11-21 09:21:49 +01:00
Ettore Di Giacinto	d7f9f3ac93	feat: add support to logitbias and logprobs (#7283 ) * feat: add support to logprobs in results Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: add support to logitbias Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-11-16 13:27:36 +01:00
Ettore Di Giacinto	03e9f4b140	fix: handle tool errors (#7271 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-11-14 17:23:56 +01:00
Ettore Di Giacinto	7129409bf6	chore(deps): bump llama.cpp to `c4abcb2457217198efdd67d02675f5fddb7071c2` (#7266 ) * chore(deps): bump llama.cpp to '92bb442ad999a0d52df0af2730cd861012e8ac5c' Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * DEBUG Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Bump Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test/debug Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Revert "DEBUG" This reverts commit 2501ca3ff242076d623c13c86b3d6afcec426281. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-11-14 12:16:52 +01:00
Ettore Di Giacinto	3728552e94	feat: import models via URI (#7245 ) * feat: initial hook to install elements directly Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP: ui changes Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Move HF api client to pkg Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add simple importer for gguf files Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add opcache Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * wire importers to CLI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add omitempty to config fields Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add MLX importer Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small refactors to star to use HF for discovery Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Common preferences Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add support to bare HF repos Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(importer/llama.cpp): add support for mmproj files Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * add mmproj quants to common preferences Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix vlm usage in tokenizer mode with llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-11-12 20:48:56 +01:00
Mikhail Khludnev	04fe0b0da8	fix(reranker): llama-cpp sort score desc, crop top_n (#7211 ) Signed-off-by: Mikhail Khludnev <mkhl@apache.org>	2025-11-12 09:13:01 +01:00
Ettore Di Giacinto	679d43c2f5	feat: respect context and add request cancellation (#7187 ) * feat: respect context Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * workaround fasthttp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ui): allow to abort call Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Refactor Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: improving error Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Respect context also with MCP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Tie to both contexts Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Make detection more robust Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-11-09 18:19:19 +01:00
Ettore Di Giacinto	02cc8cbcaa	feat(llama.cpp): consolidate options and respect tokenizer template when enabled (#7120 ) * feat(llama.cpp): expose env vars as options for consistency This allows to configure everything in the YAML file of the model rather than have global configurations Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(llama.cpp): respect usetokenizertemplate and use llama.cpp templating system to process messages Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Detect template exists if use tokenizer template is enabled Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Better recognization of chat Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixes to support tool calls while using templates from tokenizer Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop template guessing, fix passing tools to tokenizer Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Extract grammar and other options from chat template, add schema struct Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Automatically set use_jinja Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Cleanups, identify by default gguf models for chat Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-11-07 21:23:50 +01:00
Ettore Di Giacinto	424acd66ad	feat(llama.cpp): allow to set cache-ram and ctx_shift (#7009 ) * feat(llama.cpp): allow to set cache-ram and ctx_shift Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2025-11-02 17:33:29 +01:00
Chakib Benziane	32c0ab3a7f	fix: properly terminate llama.cpp kv_overrides array with empty key + updated doc (#6672 ) * fix: properly terminate kv_overrides array with empty key The llama model loading function expects KV overrides to be terminated with an empty key (key[0] == 0). Previously, the kv_overrides vector was not being properly terminated, causing an assertion failure. This commit ensures that after parsing all KV override strings, we add a final terminating entry with an empty key to satisfy the C-style array termination requirement. This fixes the assertion error and allows the model to load correctly with custom KV overrides. Fixes #6643 - Also included a reference to the usage of the `overrides` option in the advanced-usage section. Signed-off-by: blob42 <contact@blob42.xyz> * doc: document the `overrides` option --------- Signed-off-by: blob42 <contact@blob42.xyz>	2025-10-23 09:31:55 +02:00
Ettore Di Giacinto	cd1e1124ea	fix(llama.cpp): correctly set grammar triggers (#6432 ) * fix(llama.cpp): correctly set grammar triggers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Do not enable lazy by default Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-10-10 19:50:17 +02:00

1 2

59 Commits