LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-07-02 12:26:49 -04:00

Author	SHA1	Message	Date
LocalAI [bot]	6e1dbae256	feat(llama-cpp): expose 12 missing common_params via options[] (#9814 ) The llama.cpp backend already accepts a free-form options: array in the model config that maps to common_params fields, but a coverage audit against upstream pin 7f3f843c flagged 12 user-visible knobs that were neither set via the typed proto fields nor reachable via options:. Wire them up under the existing if/else chain in params_parse, before the speculative section. Each new option follows the file's prevailing patterns (try/catch around numeric parses, the same true/1/yes/on bool form used elsewhere, hardware_concurrency() fallback for thread counts, mirror of draft_override_tensor for override_tensor). Top-level / batching / IO: - n_ubatch (alias ubatch) -- physical batch size; was previously force-aliased to n_batch at line 482, blocking embedding/rerank workloads that need independent control - threads_batch (alias n_threads_batch) -- main-model batch threads; mirrors the existing draft_threads_batch - direct_io (alias use_direct_io) -- O_DIRECT model loads - verbosity -- llama.cpp log threshold (line 479 had this commented out) - override_tensor (alias tensor_buft_overrides) -- per-tensor buffer overrides for the main model; mirrors draft_override_tensor Embedding / multimodal: - pooling_type (alias pooling) -- mean/cls/last/rank/none; previously only auto-flipped to RANK for rerankers - embd_normalize (alias embedding_normalize) -- and the embedding handler now reads params_base.embd_normalize instead of a hardcoded 2 at the previous embd_normalize literal in Embedding() - mmproj_use_gpu (alias mmproj_offload) -- mmproj on CPU vs GPU - image_min_tokens / image_max_tokens -- per-image vision token budget Reasoning surface (the audit-focus three; LocalAI's existing ReasoningConfig.DisableReasoning only feeds the per-request chat_template_kwargs.enable_thinking and does not touch any of these): - reasoning_format -- none/auto/deepseek/deepseek-legacy parser - enable_reasoning (alias reasoning_budget) -- -1/0/>0 thinking budget - prefill_assistant -- trailing-assistant-message prefill toggle All 14 referenced fields exist on both the upstream pin and the turboquant fork's common.h, so no LOCALAI_LEGACY_LLAMA_CPP_SPEC guard is needed. Docs: extend model-configuration.md with new "Reasoning Models", "Multimodal Backend Options", "Embedding & Reranking Backend Options", and "Other Backend Tuning Options" subsections; also refresh the Speculative Type Values table to show the new dash-separated canonical names alongside the underscore aliases LocalAI still accepts. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-14 08:53:34 +02:00
LocalAI [bot]	53bdb18d10	chore: ⬆️ Update ggml-org/llama.cpp to `7f3f843c31cd32dc4adc10b393342dfee071c332` (#9809 ) * ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix(llama-cpp): adapt to upstream COMMON_SPECULATIVE_TYPE_DRAFT rename ggml-org/llama.cpp#22964 ("spec: update CLI arguments for better consistency") renamed the speculative type enum values: COMMON_SPECULATIVE_TYPE_DRAFT -> COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE COMMON_SPECULATIVE_TYPE_EAGLE3 -> COMMON_SPECULATIVE_TYPE_DRAFT_EAGLE3 and the registered name strings flipped from underscore- to dash- separated form (e.g. ngram_simple -> ngram-simple), with the bare draft/eagle3 aliases replaced by draft-simple/draft-eagle3. This broke the build with the new LLAMA_VERSION on every variant (vulkan/arm64, darwin and likely all the rest) at grpc-server.cpp:461. Update the upstream branch of the speculative-type fallback to use the new identifier (the LOCALAI_LEGACY_LLAMA_CPP_SPEC fork branch keeps the old name), and normalize spec_type option tokens before passing them to common_speculative_types_from_names so existing model configs that say spec_type:draft / spec_type:ngram_simple keep working. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:claude-opus-4-7 --------- Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-14 08:53:23 +02:00
LocalAI [bot]	a645c1f4aa	chore: ⬆️ Update ggml-org/llama.cpp to `a9883db8ee021cf16783016a60996d41820b5195` (#9796 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-13 21:40:31 +02:00
LocalAI [bot]	bc4cd3dd85	feat(llama-cpp): bump to `1ec7ba0c`, adapt grpc-server, expose new spec-decoding options (#9765 ) * chore(llama.cpp): bump to 1ec7ba0c14f33f17e980daeeda5f35b225d41994 Picks up the upstream `spec : parallel drafting support` change (ggml-org/llama.cpp#22838) which reshapes the speculative-decoding API and `server_context_impl`. Adapt the grpc-server wrapper accordingly: * `common_params_speculative::type` (single enum) became `types` (`std::vector<common_speculative_type>`). Update both the "default to draft when a draft model is set" branch and the `spec_type`/`speculative_type` option parser. The parser now also tolerates comma-separated lists, mirroring the upstream `common_speculative_types_from_names` semantics. * `common_params_speculative_draft::n_ctx` is gone (draft now shares the target context size). Keep the `draft_ctx_size` option name for backward compatibility and ignore the value rather than failing. * `server_context_impl::model` was renamed to `model_tgt`; update the two reranker / model-metadata call sites. Replaces #9763. Builds cleanly under the linux/amd64 cpu-llama-cpp target locally. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(llama-cpp): expose new speculative-decoding option keys Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838) adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative families and beefs up the draft-model knobs. The previous bump only adapted the API; this exposes the new fields through the grpc-server options dictionary so model configs can drive them. New `options:` keys (all under `backend: llama-cpp`): ngram_mod (`ngram_mod` type): spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match ngram_map_k (`ngram_map_k` type): spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits ngram_map_k4v (`ngram_map_k4v` type): spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m / spec_ngram_map_k4v_min_hits ngram lookup caches (`ngram_cache` type): spec_lookup_cache_static / lookup_cache_static spec_lookup_cache_dynamic / lookup_cache_dynamic Draft-model tuning (active when `spec_type` is `draft`): draft_cache_type_k / spec_draft_cache_type_k draft_cache_type_v / spec_draft_cache_type_v draft_threads / spec_draft_threads draft_threads_batch / spec_draft_threads_batch draft_cpu_moe / spec_draft_cpu_moe (bool flag) draft_n_cpu_moe / spec_draft_n_cpu_moe (first N MoE layers on CPU) draft_override_tensor / spec_draft_override_tensor (comma-separated <tensor regex>=<buffer type>; re-implements upstream's static parse_tensor_buffer_overrides since it isn't exported) `spec_type` already accepted comma-separated lists after the previous commit, matching upstream's `common_speculative_types_from_names`. Docs: refresh `docs/content/advanced/model-configuration.md` with per-family tables and a note about multi-type chaining. Builds locally with `make docker-build-llama-cpp` (linux/amd64 cpu-llama-cpp AVX variant). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(turboquant): bridge new llama.cpp spec API to the legacy fork layout The previous commits in this series adapted backend/cpp/llama-cpp/grpc-server.cpp to the post-#22838 (parallel drafting) llama.cpp API. The turboquant build reuses the same grpc-server.cpp through backend/cpp/turboquant/Makefile, which copies it into turboquant-<flavor>-build/ and runs patch-grpc-server.sh on the copy. The fork branched before the API refactor, so it errors out on: * `ctx_server.impl->model_tgt` (fork still has `model`) * `params.speculative.{ngram_mod,ngram_map_k,ngram_map_k4v,ngram_cache}.` (none of these sub-structs exist in the fork) `params.speculative.draft.{cache_type_k/v, cpuparams[, _batch].n_threads, tensor_buft_overrides}` (fork uses the pre-#22397 flat layout) * `params.speculative.types` vector / `common_speculative_types_from_names` (fork has a scalar `type` and only the singular helper) Approach: 1. backend/cpp/llama-cpp/grpc-server.cpp: introduce a single feature switch `LOCALAI_LEGACY_LLAMA_CPP_SPEC`. When defined, the two `speculative.type[s]` discriminations (the "default to draft when a draft model is set" branch and the `spec_type` / `speculative_type` option parser) fall back to the singular scalar form, and the entire new-option block (ngram_mod / map_k / map_k4v / ngram_cache / draft.{cache_type_, cpuparams, tensor_buft_overrides}) is preprocessed out. The macro is not defined in the source tree — stock llama-cpp builds get the full new API. 2. backend/cpp/turboquant/patch-grpc-server.sh: two new patch steps applied to the per-flavor build copy at turboquant-<flavor>-build/grpc-server.cpp: - substitute `ctx_server.impl->model_tgt` -> `ctx_server.impl->model` - inject `#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1` before the first `#include`, so the guarded blocks above drop out for the fork build. Both patches are idempotent and follow the existing sed/awk pattern in this script (KV cache types, `get_media_marker`, flat speculative renames). Stock llama-cpp's `grpc-server.cpp` is never touched. Drop both legacy patches once the turboquant fork rebases past ggml-org/llama.cpp#22397 / #22838. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(turboquant): close draft_ctx_size brace inside legacy guard The previous turboquant fix wrapped the new option-handler blocks in `#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC ... #endif` but placed the guard in the middle of an `else if` chain — the `} else if` openings of the new blocks were responsible for closing the previous block's brace. With the macro defined the new blocks vanish, draft_ctx_size's `{` loses its closer, the for-loop's `}` is consumed instead, and the file ends with a stray opening brace — clang reports it as `function-definition is not allowed here before '{'` on the next top-level `int main(...)` and `expected '}' at end of input`. Move the chain split inside the draft_ctx_size branch: } else if (... "draft_ctx_size") { // ... #ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC } // legacy: chain ends here #else } else if (... "spec_ngram_mod_n_min") { // modern: chain continues ... } else if (... "draft_override_tensor") { ... } // closes last branch #endif } // closes for-loop Brace count is now balanced under both preprocessor branches (verified with `tr -cd '{' \| wc -c` against the patched and unpatched outputs). Local `make docker-build-turboquant` builds the linux/amd64 cpu-llama-cpp `turboquant-avx` variant cleanly. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ci): forward AMDGPU_TARGETS into Dockerfile.turboquant builder-prebuilt Dockerfile.turboquant's `builder-prebuilt` stage was missing the `ARG AMDGPU_TARGETS` / `ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}` pair that `builder-fromsource` already has (and that `Dockerfile.llama-cpp` mirrors across both stages). When CI uses the prebuilt base image (quay.io/go-skynet/ci-cache:base-grpc-, the common path) the build-arg passed by the workflow never reaches the env inside the compile stage. backend/cpp/llama-cpp/Makefile:38 (introduced by #9626) errors out on hipblas builds when AMDGPU_TARGETS is empty, and the turboquant Makefile reuses backend/cpp/llama-cpp via a sibling build dir, so the same check fires from turboquant-fallback under BUILD_TYPE=hipblas: Makefile:38: AMDGPU_TARGETS is empty — set it to a comma-separated list of gfx targets e.g. gfx1100,gfx1101. Stop. make: * [Makefile:66: turboquant-fallback] Error 2 The bug is latent on master because the docker layer cache stays warm across builds — the compile step rarely re-runs from scratch. The llama.cpp bump in this PR invalidates the cache, so the missing env var becomes load-bearing and the hipblas turboquant CI job fails. Mirror the existing pattern from Dockerfile.llama-cpp. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-12 17:22:37 +02:00
LocalAI [bot]	b9e81dbfd4	chore: ⬆️ Update ggml-org/llama.cpp to `389ff61d77b5c71cec0cf92fe4e5d01ace80b797` (#9752 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-05-11 08:14:07 +02:00
LocalAI [bot]	f6c9c20911	chore: ⬆️ Update ggml-org/llama.cpp to `2b2babd1243c67ca811c0a5852cedf92b1a20024` (#9747 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-05-10 21:17:38 +02:00
LocalAI [bot]	6cbf69dc29	chore: ⬆️ Update ggml-org/llama.cpp to `1e5ad35d560b90a8ac447d149c8f8447ae1fcaa0` (#9739 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-05-10 00:06:29 +02:00
LocalAI [bot]	a91e718473	chore: ⬆️ Update ggml-org/llama.cpp to `00d56b11c3477b99bc18562dc1d1834f0d961778` (#9733 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-05-09 12:05:11 +02:00
LocalAI [bot]	4542833cb4	chore: ⬆️ Update ggml-org/llama.cpp to `9f5f0e689c9e977e5f23a27e344aa36082f44738` (#9724 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-09 10:18:05 +02:00
LocalAI [bot]	3b84582567	chore: ⬆️ Update ggml-org/llama.cpp to `05ff59cb57860cc992fc6dcede32c696efea711c` (#9714 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-08 01:44:17 +02:00
LocalAI [bot]	151d6c9cf0	chore: ⬆️ Update ggml-org/llama.cpp to `2496f9c14965c39589f53eea31bdb6d762b1d360` (#9698 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-07 08:29:27 +02:00
LocalAI [bot]	c9141098b6	chore: ⬆️ Update ggml-org/llama.cpp to `bbeb89d76c41bc250f16e4a6fefcc9b530d6e3f3` (#9676 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-05 23:45:54 +02:00
LocalAI [bot]	b88ddce0f3	chore: ⬆️ Update ggml-org/llama.cpp to `eff06702b2a52e1020ea009ebd86cb9f5acabab5` (#9637 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-05 09:52:28 +02:00
Russell Sim	18e039f305	fix(ci): fix AMDGPU_TARGETS empty-string bypass in hipblas builds (#9626 ) * fix(ci): fix AMDGPU_TARGETS empty-string bypass in hipblas builds `399c1dec` wired amdgpu-targets through the backend_build workflow_call interface, intending the input's default value to cover matrix entries that don't specify targets. However, GitHub Actions only applies a workflow_call input default when the caller omits the input entirely. When backend.yml passes `amdgpu-targets: ${{ matrix.amdgpu-targets }}` and the matrix entry has no amdgpu-targets key, the expression evaluates to an empty string, which is treated as an explicit value — bypassing the default. The result is Docker receiving AMDGPU_TARGETS="" which in turn causes Make's ?= default to be skipped (since the variable is already set in the environment, even to empty), and cmake gets -DAMDGPU_TARGETS= with no targets, so the HIP backend compiles for an indeterminate target rather than the intended GPU list. Fix this at two levels: 1. backend.yml: use a \|\| fallback in the expression so that an undefined matrix.amdgpu-targets never reaches the reusable workflow as an empty string. The target list is the canonical default and lives here. 2. backend_build.yml: remove the now-misleading default value from the input declaration. The default never fired due to the above bug, so keeping it implied a guarantee that didn't exist. 3. backend/cpp/llama-cpp/Makefile: add an explicit $(error ...) guard after the ?= assignment so that if AMDGPU_TARGETS is empty (whether from environment or any future CI wiring mistake) the build fails immediately with a clear message rather than silently producing a binary compiled for an unknown GPU target. Assisted-by: Claude Code:claude-sonnet-4-6 Signed-off-by: Russell Sim <rsl@simopolis.xyz> * fix(build): plumb AMDGPU_TARGETS through to Docker builds The docker-build-backend Makefile macro and Dockerfile.golang did not pass AMDGPU_TARGETS to the inner make invocation, so hipblas builds always used the backend Makefile's hardcoded default GPU targets regardless of what was specified via environment or CI inputs. Signed-off-by: Russell Sim <rsl@simopolis.xyz> --------- Signed-off-by: Russell Sim <rsl@simopolis.xyz>	2026-05-02 15:53:14 +02:00
LocalAI [bot]	9c4c3f9d8f	chore: ⬆️ Update ggml-org/llama.cpp to `beb42fffa45eded44804a1fd4916146222371581` (#9624 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-01 02:02:56 +02:00
Ettore Di Giacinto	c02a50f2ab	feat(llama-cpp): bump to d775992 and adapt to spec params refactor (#9618 ) Bumps backend/cpp/llama-cpp/Makefile LLAMA_VERSION from 665abc6 to d775992, picking up upstream PR ggml-org/llama.cpp#22397 which splits common_params_speculative into nested draft / ngram_simple / ngram_mod sub-structs. Renames every grpc-server.cpp reference to match: speculative.mparams_dft.path -> speculative.draft.mparams.path speculative.{n_max,n_min} -> speculative.draft.{n_max,n_min} speculative.{p_min,p_split} -> speculative.draft.{p_min,p_split} speculative.{n_gpu_layers,n_ctx} -> speculative.draft.{n_gpu_layers,n_ctx} speculative.ngram_size_n -> speculative.ngram_simple.size_n speculative.ngram_size_m -> speculative.ngram_simple.size_m speculative.ngram_min_hits -> speculative.ngram_simple.min_hits The "speculative.n_max" JSON key sent to the upstream server stays unchanged — server-task.cpp still reads it and routes the value into draft.n_max internally. The turboquant fork (TheTom/llama-cpp-turboquant @ 11a241d) branched before #22397 and still exposes the flat layout. Since turboquant reuses the shared backend/cpp/llama-cpp/grpc-server.cpp, extend patch-grpc-server.sh with an idempotent sed block that reverts the ten field references back to the legacy flat names on the build copy only — the original under backend/cpp/llama-cpp/ stays compiling against vanilla upstream. Drop the block once the fork rebases. ik-llama-cpp has its own grpc-server.cpp with no speculative refs (0/2661 lines), so it is unaffected. Validated locally with `make docker-build-llama-cpp` (avx, avx2, avx512, fallback, grpc + rpc-server all built; image exported). Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-30 08:44:43 +02:00
LocalAI [bot]	8e50066fa2	chore: ⬆️ Update ggml-org/llama.cpp to `665abc609740d397d30c0d8ef4157dbf900bd1a3` (#9584 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-28 08:43:33 +02:00
LocalAI [bot]	05e94bd9e7	chore: ⬆️ Update ggml-org/llama.cpp to `f53577432541bb9edc1588c4ef45c66bf07e4468` (#9577 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-27 08:57:24 +02:00
LocalAI [bot]	d9cb0d6133	chore: ⬆️ Update ggml-org/llama.cpp to `dcad77cc3b0865153f486327064fb0320a57a476` (#9572 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-26 12:38:35 +02:00
Ettore Di Giacinto	21eace40ec	feat(llama-cpp): expose split_mode option for multi-GPU placement (#9560 ) Adds split_mode (alias sm) to the llama.cpp backend options allowlist, accepting none\|layer\|row\|tensor. The tensor value targets the experimental backend-agnostic tensor parallelism from ggml-org/llama.cpp#19378 and requires a llama.cpp build that includes that PR, FlashAttention enabled, KV-cache quantization disabled, and a manually set context size. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-25 14:02:57 +02:00
LocalAI [bot]	47cc3dc8d7	chore: ⬆️ Update ggml-org/llama.cpp to `361fe72acb7b9bd79059cc177cbeda99b35b5db9` (#9548 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-25 08:58:27 +02:00
LocalAI [bot]	7c1934b183	chore: ⬆️ Update ggml-org/llama.cpp to `187a45637054881ecacf17f8e2f6f8f2ba7df1c7` (#9520 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-24 09:17:06 +02:00
Ettore Di Giacinto	ed648b3b4e	fix(llama-cpp): include server-chat.cpp in grpc-server translation unit (#9511 ) * fix(llama-cpp): include server-chat.cpp in grpc-server translation unit Upstream llama.cpp refactor (ggml-org/llama.cpp#20690) moved the OAI/Anthropic/Responses and transcription conversion helpers out of server-common.cpp into a new server-chat.cpp, and server-task.cpp and server-context.cpp now call those symbols (convert_transcriptions_to_chatcmpl, server_chat_convert_responses_to_chatcmpl, server_chat_convert_anthropic_to_oai, server_chat_msg_diff_to_json_oaicompat) via server-chat.h. grpc-server.cpp builds as a single translation unit by #include-ing the upstream .cpp files directly. Without including server-chat.cpp, the declarations are satisfied at compile time via server-chat.h but the link step fails with undefined references once LLAMA_VERSION crosses the refactor commit (134d6e54). Guard the include with __has_include so the same source stays buildable on older LLAMA_VERSION pins that predate the refactor (where prepare.sh won't copy server-chat.cpp into tools/grpc-server/). Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(llama-cpp): bump LLAMA_VERSION to 0d0764dfd Bump to ggml-org/llama.cpp@0d0764dfd2. Paired with the preceding grpc-server server-chat.cpp include so the refactor at 134d6e54 links cleanly. Supersedes PR #9494. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-23 14:59:39 +02:00
LocalAI [bot]	cd7b035716	chore: ⬆️ Update ggml-org/llama.cpp to `5a4cd6741fc33227cdacb329f355ab21f8481de2` (#9479 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-22 08:58:19 +02:00
LocalAI [bot]	8bb1e8f21f	chore: ⬆️ Update ggml-org/llama.cpp to `cf8b0dbda9ac0eac30ee33f87bc6702ead1c4664` (#9448 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-21 11:15:45 +02:00
LocalAI [bot]	babbbc6ec8	chore: ⬆️ Update ggml-org/llama.cpp to `4eac5b45095a4e8a1ff1cce4f6d030e0872fb4ad` (#9429 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-19 23:39:19 +02:00
LocalAI [bot]	6e49dba27c	chore: ⬆️ Update ggml-org/llama.cpp to `4f02d4733934179386cbc15b3454be26237940bb` (#9415 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-19 09:26:05 +02:00
Keith Mattix II	8839a71c87	fix(rocm): add gfx1151 support and expose AMDGPU_TARGETS build-arg (#9410 ) Add gfx1151 (AMD Strix Halo / Ryzen AI MAX) to the default AMDGPU_TARGETS list in the llama-cpp backend Makefile. ROCm 7.2.1 ships with gfx1151 Tensile libraries, so this architecture should be included in default builds. Also expose AMDGPU_TARGETS as an ARG/ENV in Dockerfile.llama-cpp so that users building for non-default GPU architectures can override the target list via --build-arg AMDGPU_TARGETS=<arch>. Previously, passing -DAMDGPU_TARGETS=<arch> through CMAKE_ARGS was silently overridden by the Makefile's own append of the default target list. Fixes #9374 Signed-off-by: Keith Mattix <keithmattix2@gmail.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-04-18 20:39:40 +02:00
Ettore Di Giacinto	117f6430b8	fix(turboquant): resolve common.h by detecting llama-common vs common target (#9413 ) The shared grpc-server CMakeLists hardcoded `llama-common`, the post-rename target name in upstream llama.cpp. The turboquant fork branched before that rename and still exposes the helpers library as `common`, so the name silently degraded to a plain `-llama-common` link flag, the PUBLIC include directory was never propagated, and tools/server/server-task.h failed to find common.h during turboquant-<flavor> builds.	2026-04-18 20:30:28 +02:00
Ettore Di Giacinto	7809c5f5d0	fix(vision): propagate mtmd media marker from backend via ModelMetadata (#9412 ) Upstream llama.cpp (PR #21962) switched the server-side mtmd media marker to a random per-server string and removed the legacy "<__media__>" backward-compat replacement in mtmd_tokenizer. The Go layer still emitted the hardcoded "<__media__>", so on the non-tokenizer-template path the prompt arrived with a marker mtmd did not recognize and tokenization failed with "number of bitmaps (1) does not match number of markers (0)". Report the active media marker via ModelMetadataResponse.media_marker and substitute the sentinel "<__media__>" with it right before the gRPC call, after the backend has been loaded and probed. Also skip the Go-side multimodal templating entirely when UseTokenizerTemplate is true — llama.cpp's oaicompat_chat_params_parse already injects its own marker and StringContent is unused in that path. Backends that do not expose the field keep the legacy "<__media__>" behavior.	2026-04-18 20:30:13 +02:00
Ettore Di Giacinto	c49feb546f	fix(llama-cpp): rename linked target common -> llama-common (#9408 ) Upstream llama.cpp (45cac7ca) renamed the CMake library target `common` to `llama-common`. Linking the old name caused `target_include_directories(... PUBLIC .)` from the common/ dir to not propagate, so `#include "common.h"` failed when building grpc-server.	2026-04-18 00:42:05 +02:00
LocalAI [bot]	7dbd9c056a	chore: ⬆️ Update ggml-org/llama.cpp to `4fbdabdc61c04d1262b581e1b8c0c3b119f688ff` (#9381 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-17 08:13:04 +02:00
Ettore Di Giacinto	5837b14888	chore: ⬆️ Update TheTom/llama-cpp-turboquant to `45f8a066ed5f5bb38c695cec532f6cef9f4efa9d' (#9385 ) chore: ⬆️ Update TheTom/llama-cpp-turboquant to `45f8a066ed5f5bb38c695cec532f6cef9f4efa9d` Drop 0002-ggml-rpc-bump-op-count-to-97.patch; the fork now has GGML_OP_COUNT == 97 and RPC_PROTO_PATCH_VERSION 2 upstream. Fetch all tags in backend/cpp/llama-cpp/Makefile so tag-only commits (the new turboquant pin is reachable only through the tag feature-turboquant-kv-cache-b8821-45f8a06) can be checked out.	2026-04-17 08:12:21 +02:00
LocalAI [bot]	96cd561d9d	chore: ⬆️ Update ggml-org/llama.cpp to `b3d758750a268bf93f084ccfa3060fb9a203192a` (#9370 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-16 01:12:39 +02:00
LocalAI [bot]	62862ca06b	chore: ⬆️ Update ggml-org/llama.cpp to `fae3a28070fe4026f87bd6a544aba1b2d1896566` (#9357 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-15 01:25:41 +02:00
Ettore Di Giacinto	87e6de1989	feat: wire transcription for llama.cpp, add streaming support (#9353 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-14 16:13:40 +02:00
LocalAI [bot]	906acba8db	chore: ⬆️ Update ggml-org/llama.cpp to `e97492369888f5311e4d1f3beb325a36bbed70e9` (#9347 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-14 08:54:25 +02:00
LocalAI [bot]	ea32b8953f	chore: ⬆️ Update ggml-org/llama.cpp to `1e9d771e2c2f1113a5ebdd0dc15bafe57dce64be` (#9330 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-13 09:42:18 +02:00
Ettore Di Giacinto	151ad271f2	feat(rocm): bump to 7.x (#9323 ) feat(rocm): bump to 7.2.1 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-12 08:51:30 +02:00
LocalAI [bot]	6fbda277c5	chore: ⬆️ Update ggml-org/llama.cpp to `ff5ef8278615a2462b79b50abdf3cc95cfb31c6f` (#9319 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-11 23:15:23 +02:00
LocalAI [bot]	62a674ce12	chore: ⬆️ Update ggml-org/llama.cpp to `e62fa13c2497b2cd1958cb496e9489e86bbd5182` (#9312 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-11 08:39:10 +02:00
LocalAI [bot]	d4cd6c284f	chore: ⬆️ Update ggml-org/llama.cpp to `d132f22fc92f36848f7ccf2fc9987cd0b0120825` (#9302 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-10 08:46:45 +02:00
Ettore Di Giacinto	9748a1cbc6	fix(streaming): skip chat deltas for role-init elements to prevent first token duplication (#9299 ) When TASK_RESPONSE_TYPE_OAI_CHAT is used, the first streaming token produces a JSON array with two elements: a role-init chunk and the actual content chunk. The grpc-server loop called attach_chat_deltas for both elements with the same raw_result pointer, stamping the first token's ChatDelta.Content on both replies. The Go side accumulated both, emitting the first content token twice to SSE clients. Fix: in the array iteration loops in PredictStream, detect role-init elements (delta has "role" key) and skip attach_chat_deltas for them. Only content/reasoning elements get chat deltas attached. Reasoning models are unaffected because their first token goes into reasoning_content, not content.	2026-04-10 08:45:47 +02:00
Ettore Di Giacinto	2b05420f95	chore(llama.cpp): bump to 'd12cc3d1ca6bba741cd77887ac9c9ee18c8415c7' (#9282 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-09 08:12:05 +02:00
LocalAI [bot]	0526e60f8d	chore: ⬆️ Update ggml-org/llama.cpp to `66c4f9ded01b29d9120255be1ed8d5835bcbb51d` (#9269 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-08 08:27:38 +02:00
LocalAI [bot]	bccaba1f66	chore: ⬆️ Update ggml-org/llama.cpp to `d0a6dfeb28a09831d904fc4d910ddb740da82834` (#9259 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-07 00:38:36 +02:00
LocalAI [bot]	0dda4fe6f0	chore: ⬆️ Update ggml-org/llama.cpp to `761797ffdf2ce3f118e82c663b1ad7d935fbd656` (#9243 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-06 10:52:38 +02:00
Ettore Di Giacinto	773489eeb1	fix(chat): do not retry if we had chatdeltas or tooldeltas from backend (#9244 ) * fix(chat): do not retry if we had chatdeltas or tooldeltas from backend Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: use oai compat for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: apply to non-streaming path too Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * map also other fields Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-06 10:52:23 +02:00
Ettore Di Giacinto	06fbe48b3f	feat(llama.cpp): wire speculative decoding settings (#9238 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-05 14:56:30 +02:00
Ettore Di Giacinto	53deeb1107	fix(reasoning): suppress partial tag tokens during autoparser warm-up The C++ PEG parser needs a few tokens to identify the reasoning format (e.g. "<\|channel>thought\n" for Gemma 4). During this warm-up, the gRPC layer was sending raw partial tag tokens to Go, which leaked into the reasoning field. - Clear reply.message in gRPC when autoparser is active but has no diffs yet, matching llama.cpp server behavior of only emitting classified output - Prefer C++ autoparser chat deltas for reasoning/content in all streaming paths, falling back to Go-side extraction for backends without autoparser (e.g. vLLM) - Override non-streaming no-tools result with chat delta content when available - Guard PrependThinkingTokenIfNeeded against partial tag prefixes during streaming accumulation - Reorder default thinking tokens so <\|channel>thought is checked before <\|think\|> (Gemma 4 templates contain both)	2026-04-04 20:45:57 +00:00

1 2 3 4 5 ...

344 Commits