LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-06 15:56:06 -04:00

Author	SHA1	Message	Date
LocalAI [bot]	6942713d85	chore: ⬆️ Update leejet/stable-diffusion.cpp to `3a8788cb7d74f185d6b18688e9563015524ecaf5` (#9933 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-22 00:31:19 +02:00
LocalAI [bot]	0cf52c44d4	chore: ⬆️ Update ggml-org/whisper.cpp to `8443cf05e3fa8ce1b32348e1bcbcf8fc31f7f3ae` (#9929 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-21 23:24:01 +02:00
LocalAI [bot]	0d34cf7cbd	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `48a55f74e4c6e2aeda363dd386c1ac9170a0af71` (#9930 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-21 23:23:37 +02:00
LocalAI [bot]	959de86761	feat(llama-cpp): make server-side prompt cache work by default (#9925 ) Aligns LocalAI's llama-cpp gRPC backend with upstream's auto-on prompt cache path so repeated system prompts (agents, OpenAI/Anthropic-compatible CLIs, coding assistants) skip prefill on subsequent calls without any YAML changes. Reported in #9921. Upstream's server enables `kv_unified=true` (and bumps `n_parallel` to 4) when slot count is auto, which unlocks `cache_idle_slots`. LocalAI hardcodes `n_parallel=1` and so far also hardcoded `kv_unified=false`, which silently force-disables idle-slot saving at server init. The host prompt cache was allocated but never written across requests. Changes in backend/cpp/llama-cpp/grpc-server.cpp: - params.kv_unified: false -> true (single-slot path now benefits from the prompt cache; users can opt out with `kv_unified:false`) - params.n_ctx_checkpoints: 8 -> 32 (match upstream default) - params.cache_idle_slots = true initialized explicitly (upstream default) - params.checkpoint_every_nt = 8192 initialized explicitly (upstream default) - New option parsers: cache_idle_slots / idle_slots_cache, checkpoint_every_nt / checkpoint_every_n_tokens Docs: - features/text-generation.md: fix misleading `cache_ram` description (it's the host-side prompt cache, not the KV cache), document the kv_unified + cache_ram + cache_idle_slots interaction, add rows for the two newly-exposed options, and add a worked example for the agent/CLI workload from the issue. - advanced/model-configuration.md: mark the legacy `prompt_cache_path` / `prompt_cache_all` / `prompt_cache_ro` YAML fields as unused by the llama-cpp gRPC backend (they target upstream's CLI completion tool and are not consumed by grpc-server.cpp) and point readers at the new prompt-cache explainer. Closes #9921 Assisted-by: claude:opus-4.7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-21 16:31:48 +02:00
Richard Palethorpe	c68818a62e	fix(llama-cpp): terminate tensor_buft_overrides with sentinel (#9919 ) llama.cpp's model loader asserts back().pattern == nullptr on params.tensor_buft_overrides (and on params.kv_overrides.back().key[0] == 0) before binding them into llama_model_params. PR #8560 attempted to satisfy llama_params_fit's placeholder requirement by pre-filling params.tensor_buft_overrides up to llama_max_tensor_buft_overrides() before the option-parse loop. Any subsequent push_back from override_tensor / draft_cpu_moe / draft_n_cpu_moe / draft_override_tensor then appended real entries after the placeholders, leaving back() with a real pattern and tripping the assert. The draft override vector likewise had no terminator at all. Mirror upstream common/arg.cpp:645-658 instead: real entries are pushed during option parsing, and after parsing we pad the main vector up to ntbo (placeholders land at the end, so back() is always nullptr) and append a single {nullptr, nullptr} to the draft vector when it is non-empty. The existing kv_overrides terminator block already matches upstream and stays. Verified against ggml-org/llama.cpp@5cbaa5e: only tensor_buft_overrides (main + draft) and kv_overrides are sentinel-terminated common_params fields; everything else is size-driven std::vector. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-05-21 12:55:06 +02:00
LocalAI [bot]	12e056e96d	chore: ⬆️ Update ggml-org/llama.cpp to `ad277572619fcfb6ddd38f4c6437283a4b2b8636` (#9915 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-21 09:07:31 +02:00
LocalAI [bot]	b2d68a53a2	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `11a1fea9e291f12ce2c803a9d7812c30ca806bcf` (#9914 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-20 22:04:06 +00:00
LocalAI [bot]	1ffd82a050	chore: ⬆️ Update antirez/ds4 to `2606543be7a8c125a32cee37f5d1d85dc78f2fcf` (#9909 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-20 21:22:26 +00:00
LocalAI [bot]	f515168dbe	chore(acestep-cpp): bump pin to ed53caf and adapt wrapper to new API (#9908 ) The new ace-step.cpp revision moves backend initialization inside each `_load` call and drops the separate `DiTGGMLConfig` argument from `dit_ggml_load` (config now lives in `DiTGGML::cfg`, populated from GGUF metadata at load time). Drop the now-removed `_init_backend` calls and replace `g_dit_cfg` accesses with `g_dit.cfg`. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-20 21:05:32 +00:00
LocalAI [bot]	ef6ca34513	chore: ⬆️ Update leejet/stable-diffusion.cpp to `5b0267e941cade15bd80089d89838795d9f4baa6` (#9907 ) Adapt the C++ wrapper to the new `generate_video()` signature: upstream now returns `bool` and writes frames/audio via out-parameters (`sd_image_t`, `sd_audio_t`). Also set `p->fps` on the params struct (new upstream field) and free the returned audio handle on both the success and error paths. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-20 20:53:19 +00:00
dependabot[bot]	9413c3767f	chore(deps): update transformers requirement from >=5.8.0 to >=5.8.1 in /backend/python/transformers (#9883 ) chore(deps): update transformers requirement Updates the requirements on [transformers](https://github.com/huggingface/transformers) to permit the latest version. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](https://github.com/huggingface/transformers/compare/v5.8.0...v5.8.1) --- updated-dependencies: - dependency-name: transformers dependency-version: 5.8.1 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-20 22:16:02 +02:00
dependabot[bot]	3bf3cce232	chore(deps): bump sentence-transformers from 5.4.0 to 5.5.0 in /backend/python/transformers (#9888 ) chore(deps): bump sentence-transformers in /backend/python/transformers Bumps [sentence-transformers](https://github.com/huggingface/sentence-transformers) from 5.4.0 to 5.5.0. - [Release notes](https://github.com/huggingface/sentence-transformers/releases) - [Commits](https://github.com/huggingface/sentence-transformers/compare/v5.4.0...v5.5.0) --- updated-dependencies: - dependency-name: sentence-transformers dependency-version: 5.5.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-20 22:13:39 +02:00
LocalAI [bot]	06f8159035	chore: ⬆️ Update ggml-org/llama.cpp to `67ace021da905e27ecbdf1176b0eef578a5288c0` (#9897 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-20 22:05:58 +02:00
LocalAI [bot]	24e04d8e81	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `77413bc900f9a2bfd8a5407f184427bcc0825f6c` (#9899 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-20 01:02:53 +02:00
LocalAI [bot]	b9a49449ae	chore: ⬆️ Update ggml-org/whisper.cpp to `afa2ea544fb4b0448916b4a31ecd33c8685bd482` (#9898 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-20 01:02:25 +02:00
LocalAI [bot]	1879e11042	chore: ⬆️ Update antirez/ds4 to `599e49d253971451f710cb8323344e789906ed6c` (#9900 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-20 01:01:45 +02:00
LocalAI [bot]	4b02d23c0c	chore: ⬆️ Update ggml-org/llama.cpp to `5cbaa5e69e09bde3334cd8c355570553a0dca027` (#9876 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-19 08:06:16 +02:00
LocalAI [bot]	21140e96b2	chore: ⬆️ Update ggml-org/whisper.cpp to `47b9eb37a33c5031a1b667ace64477330b9f36c1` (#9877 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-19 08:05:56 +02:00
LocalAI [bot]	ca51606bfe	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `40aae0b6d86d50c0ee7011b3ce59a233203e430a` (#9875 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-19 08:01:41 +02:00
LocalAI [bot]	11cff1b309	chore: ⬆️ Update ggml-org/llama.cpp to `87589042cac2c390cec8d68fb2fad64e0a2a252a` (#9855 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-18 08:01:30 +02:00
LocalAI [bot]	3cba35ed32	chore: ⬆️ Update antirez/ds4 to `c9dd9499bfa57c1bbfbb4446eff963330ab5329b` (#9864 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-17 23:19:58 +02:00
LocalAI [bot]	265ae35231	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `c35189d83c91aad780aba62b89f2830cb2916223` (#9866 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-17 23:19:43 +02:00
LocalAI [bot]	6a48157a80	chore: ⬆️ Update leejet/stable-diffusion.cpp to `bd17f53b7386fb5f60e8587b75e73c4b2fed3426` (#9854 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-16 23:12:05 +02:00
LocalAI [bot]	41c838b2df	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `3e573cfea6e0a332eff822ffbdb1dd3b112e9051` (#9856 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-16 22:44:08 +02:00
LocalAI [bot]	21e793ad2a	chore: ⬆️ Update antirez/ds4 to `ef0a4905d05263df8e63689f2dd1efac618a752c` (#9857 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-16 22:43:46 +02:00
LocalAI [bot]	d77a9137d8	feat(llama-cpp): bump to MTP-merge SHA and automatically set MTP defaults (#9852 ) * feat(llama-cpp): bump to MTP-merge SHA and document draft-mtp spec type Update LLAMA_VERSION to 0253fb21 (post ggml-org/llama.cpp#22673 merge, 2026-05-16) to pick up Multi-Token Prediction support. No grpc-server.cpp changes are required: the existing `spec_type` option delegates to upstream's `common_speculative_types_from_names()`, which already accepts the new `draft-mtp` name. The `n_rs_seq` cparam needed by MTP is auto-derived inside `common_context_params_to_llama` from `params.speculative.need_n_rs_seq()`, and when no `draft_model` is set the upstream server builds the MTP context off the target model itself. Docs: extend the speculative-decoding section of the model-configuration guide with the new type, both load paths (MTP head embedded in the main GGUF vs. separate `mtp-.gguf` sibling), the PR's recommended `spec_n_max:2-3`, and the chained `draft-mtp,ngram-mod` recipe. Also notes that the upstream `-hf` auto-discovery of `mtp-.gguf` siblings is not wired through LocalAI's gRPC layer. Agent guide: short note explaining that new upstream spec types are picked up automatically and that MTP needs no gRPC plumbing. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(llama-cpp): auto-detect MTP heads and enable draft-mtp on import + load Detect upstream's `<arch>.nextn_predict_layers` GGUF metadata key (set by `convert_hf_to_gguf.py` for Qwen3.5/3.6 family models and similar) and, when present and the user has not configured a `spec_type` explicitly, auto-append the upstream-recommended speculative-decoding tuple: - spec_type:draft-mtp - spec_n_max:6 - spec_p_min:0.75 The 0.75 p_min is pinned defensively because upstream marks the current default with a "change to 0.0f" TODO; locking it here keeps acceptance thresholds stable across future llama.cpp bumps. Detection runs in two places: - The model importer (`POST /models/import-uri`, the `/import-model` UI) range-fetches the GGUF header for HuggingFace / direct-URL imports via `gguf.ParseGGUFFileRemote`, with a 30s timeout and non-fatal error handling. OCI/Ollama URIs are skipped because the artifact is not directly streamable; the load-time hook covers them once the file is on disk. - The llama-cpp load-time hook (`guessGGUFFromFile`) reads the local header on every model start and appends the same options if `spec_type` is not already set. Both paths share `ApplyMTPDefaults` and respect an explicit user-set `spec_type:` / `speculative_type:` so YAML overrides win. Ginkgo specs cover the append, preserve-user-choice, legacy alias, and nil safety paths. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(importer): resolve huggingface:// URIs before MTP header probe `gguf.ParseGGUFFileRemote` only speaks HTTP(S), but the importer was handing it the raw `huggingface://...` URI directly (and similarly for any other custom downloader scheme). Live-test against `huggingface://ggml-org/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-MTP-Q8_0.gguf` exposed this: the probe failed with `unsupported protocol scheme "huggingface"`, was caught by the non-fatal error path, and the MTP options were silently never applied to the generated YAML. Route every candidate URI through `downloader.URI.ResolveURL()` and require the resolved form to be HTTP(S). After the fix the probe successfully reads `<arch>.nextn_predict_layers=1` from the real HF GGUF and the emitted ConfigFile carries spec_type:draft-mtp, spec_n_max:6, spec_p_min:0.75 as intended. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-16 22:42:48 +02:00
LocalAI [bot]	00b8989886	chore: ⬆️ Update ggml-org/llama.cpp to `1348f67c58f561808136e8a152a9eddec168f221` (#9842 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-16 08:41:09 +02:00
LocalAI [bot]	43e0d397ca	chore: ⬆️ Update ggml-org/whisper.cpp to `968eebe77225d25e57a3f981da7c696310f0e881` (#9843 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-16 00:30:04 +02:00
LocalAI [bot]	a1a7a219ed	chore: ⬆️ Update antirez/ds4 to `950e8e6474a1c9fabe04e669d607606a7ef8824f` (#9844 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-15 23:46:29 +02:00
LocalAI [bot]	3937ec6527	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `5cc0d86c760e9858e4bed4418400bb39dbe025f2` (#9845 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-15 23:45:54 +02:00
LocalAI [bot]	1355b55794	chore: ⬆️ Update vllm-project/vllm cu130 wheel to `0.21.0` (#9846 ) ⬆️ Update vllm-project/vllm cu130 wheel Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-15 23:45:41 +02:00
LocalAI [bot]	4abf5befbb	chore: ⬆️ Update ggml-org/llama.cpp to `834a243664114487f99520370a7a7b00fc7a486f` (#9826 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-15 10:29:22 +02:00
LocalAI [bot]	195b910260	chore: ⬆️ Update leejet/stable-diffusion.cpp to `0b8296915c4094090cff6bd2e09a5e98288c3c7d` (#9827 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-15 10:19:52 +02:00
LocalAI [bot]	7bd1693ad0	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `0fcffdb64d21e57f0778f342415754156e01adfa` (#9828 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-15 10:08:46 +02:00
LocalAI [bot]	b5ac3a7373	chore: ⬆️ Update ggml-org/whisper.cpp to `46ca43d6399fdeada1b49fb2126ba373bd9ebc38` (#9829 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-15 10:08:24 +02:00
LocalAI [bot]	53de474ef5	chore: ⬆️ Update antirez/ds4 to `04b6fda2be395094cbf2d20d921e7a705a4166ef` (#9830 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-15 10:08:09 +02:00
LocalAI [bot]	6e1dbae256	feat(llama-cpp): expose 12 missing common_params via options[] (#9814 ) The llama.cpp backend already accepts a free-form options: array in the model config that maps to common_params fields, but a coverage audit against upstream pin 7f3f843c flagged 12 user-visible knobs that were neither set via the typed proto fields nor reachable via options:. Wire them up under the existing if/else chain in params_parse, before the speculative section. Each new option follows the file's prevailing patterns (try/catch around numeric parses, the same true/1/yes/on bool form used elsewhere, hardware_concurrency() fallback for thread counts, mirror of draft_override_tensor for override_tensor). Top-level / batching / IO: - n_ubatch (alias ubatch) -- physical batch size; was previously force-aliased to n_batch at line 482, blocking embedding/rerank workloads that need independent control - threads_batch (alias n_threads_batch) -- main-model batch threads; mirrors the existing draft_threads_batch - direct_io (alias use_direct_io) -- O_DIRECT model loads - verbosity -- llama.cpp log threshold (line 479 had this commented out) - override_tensor (alias tensor_buft_overrides) -- per-tensor buffer overrides for the main model; mirrors draft_override_tensor Embedding / multimodal: - pooling_type (alias pooling) -- mean/cls/last/rank/none; previously only auto-flipped to RANK for rerankers - embd_normalize (alias embedding_normalize) -- and the embedding handler now reads params_base.embd_normalize instead of a hardcoded 2 at the previous embd_normalize literal in Embedding() - mmproj_use_gpu (alias mmproj_offload) -- mmproj on CPU vs GPU - image_min_tokens / image_max_tokens -- per-image vision token budget Reasoning surface (the audit-focus three; LocalAI's existing ReasoningConfig.DisableReasoning only feeds the per-request chat_template_kwargs.enable_thinking and does not touch any of these): - reasoning_format -- none/auto/deepseek/deepseek-legacy parser - enable_reasoning (alias reasoning_budget) -- -1/0/>0 thinking budget - prefill_assistant -- trailing-assistant-message prefill toggle All 14 referenced fields exist on both the upstream pin and the turboquant fork's common.h, so no LOCALAI_LEGACY_LLAMA_CPP_SPEC guard is needed. Docs: extend model-configuration.md with new "Reasoning Models", "Multimodal Backend Options", "Embedding & Reranking Backend Options", and "Other Backend Tuning Options" subsections; also refresh the Speculative Type Values table to show the new dash-separated canonical names alongside the underscore aliases LocalAI still accepts. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-14 08:53:34 +02:00
LocalAI [bot]	53bdb18d10	chore: ⬆️ Update ggml-org/llama.cpp to `7f3f843c31cd32dc4adc10b393342dfee071c332` (#9809 ) * ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix(llama-cpp): adapt to upstream COMMON_SPECULATIVE_TYPE_DRAFT rename ggml-org/llama.cpp#22964 ("spec: update CLI arguments for better consistency") renamed the speculative type enum values: COMMON_SPECULATIVE_TYPE_DRAFT -> COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE COMMON_SPECULATIVE_TYPE_EAGLE3 -> COMMON_SPECULATIVE_TYPE_DRAFT_EAGLE3 and the registered name strings flipped from underscore- to dash- separated form (e.g. ngram_simple -> ngram-simple), with the bare draft/eagle3 aliases replaced by draft-simple/draft-eagle3. This broke the build with the new LLAMA_VERSION on every variant (vulkan/arm64, darwin and likely all the rest) at grpc-server.cpp:461. Update the upstream branch of the speculative-type fallback to use the new identifier (the LOCALAI_LEGACY_LLAMA_CPP_SPEC fork branch keeps the old name), and normalize spec_type option tokens before passing them to common_speculative_types_from_names so existing model configs that say spec_type:draft / spec_type:ngram_simple keep working. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:claude-opus-4-7 --------- Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-14 08:53:23 +02:00
LocalAI [bot]	0353d3bd77	chore: ⬆️ Update ggml-org/whisper.cpp to `3e9b7d0fef3528ee2208da3cdb873a2c53d2ae2f` (#9808 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-14 00:20:14 +02:00
LocalAI [bot]	ec49995190	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `949bb8f1d660fc1264c137a6f3dbd619375f6134` (#9807 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-14 00:15:32 +02:00
LocalAI [bot]	4430fae779	chore: ⬆️ Update antirez/ds4 to `0cba357ca1bc0e7510421cc26888e420ea942123` (#9806 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-14 00:14:23 +02:00
LocalAI [bot]	ddbbdf45b9	chore: ⬆️ Update TheTom/llama-cpp-turboquant to `5aeb2fdbe26cd4c534c6fa15de73cb5749bd0403` (#9740 ) ⬆️ Update TheTom/llama-cpp-turboquant Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-13 21:58:33 +02:00
Richard Palethorpe	0245b33eab	feat(realtime): Add Liquid Audio s2s model and assistant mode on talk page (#9801 ) * feat(liquid-audio): add LFM2.5-Audio any-to-any backend + realtime_audio usecase Wires LiquidAI's LFM2.5-Audio-1.5B as a self-contained Realtime API model: single engine handles VAD, transcription, LLM, and TTS in one bidirectional stream — drop-in alternative to a VAD+STT+LLM+TTS pipeline. Backend - backend/python/liquid-audio/ — new Python gRPC backend wrapping the `liquid-audio` package. Modes: chat / asr / tts / s2s, voice presets, Load/Predict/PredictStream/AudioTranscription/TTS/VAD/AudioToAudioStream/ Free and StartFineTune/FineTuneProgress/StopFineTune. Runtime monkey-patch on `liquid_audio.utils.snapshot_download` so absolute local paths from LocalAI's gallery resolve without a HF round-trip. soundfile in place of torchaudio.load/save (torchcodec drags NVIDIA NPP we don't bundle). - backend/backend.proto + pkg/grpc/{backend,client,server,base,embed, interface}.go — new AudioToAudioStream RPC mirroring AudioTransformStream (config/frame/control oneof in; typed event+pcm+meta out). - core/services/nodes/{health_mock,inflight}_test.go — add stubs for the new RPC to the test fakes. Config + capabilities - core/config/backend_capabilities.go — UsecaseRealtimeAudio, MethodAudio ToAudioStream, UsecaseInfoMap entry, liquid-audio BackendCapability row. - core/config/model_config.go — FLAG_REALTIME_AUDIO bitmask, ModalityGroups membership in both speech-input and audio-output groups so a lone flag still reads as multimodal, GetAllModelConfigUsecases entry, GuessUsecases branch. Realtime endpoint - core/http/endpoints/openai/realtime.go — extract prepareRealtimeConfig() so the gate is unit-testable; accept realtime_audio models and self-fill empty pipeline slots with the model's own name (user-pinned slots win). - core/http/endpoints/openai/realtime_gate_test.go — six specs covering nil cfg, empty pipeline, legacy pipeline, self-contained realtime_audio, user-pinned VAD slot, and partial legacy pipeline. UI + endpoints - core/http/routes/ui.go — /api/pipeline-models accepts either a legacy VAD+STT+LLM+TTS pipeline or a realtime_audio model; surfaces a self_contained flag so the Talk page can collapse the four cards. - core/http/routes/ui_api.go — realtime_audio in usecaseFilters. - core/http/routes/ui_pipeline_models_test.go — covers both code paths. - core/http/react-ui/src/pages/Talk.jsx — self-contained badge instead of the four-slot grid; rename Edit Pipeline → Edit Model Config; less pipeline-specific wording. - core/http/react-ui/src/pages/Models.jsx + locales/en/models.json — new realtime_audio filter button + i18n. - core/http/react-ui/src/utils/capabilities.js — CAP_REALTIME_AUDIO. - core/http/react-ui/src/pages/FineTune.jsx — voice + validation-dataset fields, surfaced when backend === liquid-audio, plumbed via extra_options on submit/export/import. Gallery + importer - gallery/liquid-audio.yaml — config template with known_usecases: [realtime_audio, chat, tts, transcript, vad]. - gallery/index.yaml — four model entries (realtime/chat/asr/tts) keyed by mode option. Fixed pre-existing `transcribe` typo on the asr entry (loader silently dropped the unknown string → entry never surfaced as a transcript model). - gallery/lfm.yaml — function block for the LFM2 Pythonic tool-call format `<\|tool_call_start\|>[name(k="v")]<\|tool_call_end\|>` matching common_chat_params_init_lfm2 in vendored llama.cpp. - core/gallery/importers/{liquid-audio,liquid-audio_test}.go — detector matches LFM2-Audio HF repos (excludes -gguf mirrors); mode/voice preferences plumbed through to options. - core/gallery/importers/importers.go — register LiquidAudioImporter before LlamaCPPImporter. - pkg/functions/parse_lfm2_test.go — seven specs for the response/argument regex pair on the LFM2 pythonic format. Build matrix - .github/backend-matrix.yml — seven liquid-audio targets (cuda12, cuda13, l4t-cuda-13, hipblas, intel, cpu amd64, cpu arm64). Jetpack r36 cuda-12 is skipped (Ubuntu 22.04 / Python 3.10 incompatible with liquid-audio's 3.12 floor). - backend/index.yaml — anchor + 13 image entries. - Makefile — .NOTPARALLEL, prepare-test-extra, test-extra, docker-build-liquid-audio. Docs - .agents/plans/liquid-audio-integration.md — phased plan; PR-D (real any-to-any wiring via AudioToAudioStream), PR-E (mid-audio tool-call detector), PR-G (GGUF entries once upstream llama.cpp PR #18641 lands) remain. - .agents/api-endpoints-and-auth.md — expand the capability-surface checklist with every place a new FLAG_* needs to be registered. Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(realtime): function calling + history cap for any-to-any models Three pieces, all on the realtime_audio path that just landed: 1. liquid-audio backend (backend/python/liquid-audio/backend.py): - _build_chat_state grows a `tools_prelude` arg. - new _render_tools_prelude parses request.Tools (the OpenAI Chat Completions function array realtime.go already serialises) and emits an LFM2 `<\|tool_list_start\|>…<\|tool_list_end\|>` system turn ahead of the user history. Mirrors gallery/lfm.yaml's `function:` template so the model sees the same prompt shape whether served via llama-cpp or here. Without this the backend silently dropped tools — function calling was wired end-to-end on the Go side but the model never saw a tool list. 2. Realtime history cap (core/http/endpoints/openai/realtime.go): - Session grows MaxHistoryItems int; default picked by new defaultMaxHistoryItems(cfg) — 6 for realtime_audio models (LFM2.5 1.5B degrades quickly past a handful of turns), 0/unlimited for legacy pipelines composing larger LLMs. - triggerResponse runs conv.Items through trimRealtimeItems before building conversationHistory. Helper walks the cut left if it would orphan a function_call_output, so tool result + call pairs stay intact. - realtime_gate_test.go: specs for defaultMaxHistoryItems and trimRealtimeItems (zero cap, under cap, over cap, tool-call pair preservation). 3. Talk page (core/http/react-ui/src/pages/Talk.jsx): - Reuses the chat page's MCP plumbing — useMCPClient hook, ClientMCPDropdown component, same auto-connect/disconnect effect pattern. No bespoke tool registry, no new REST endpoints; tools come from whichever MCP servers the user toggles on, exactly as on the chat page. - sendSessionUpdate now passes session.tools=getToolsForLLM(); the update re-fires when the active server set changes mid-session. - New response.function_call_arguments.done handler executes via the hook's executeTool (which round-trips through the MCP client SDK), then replies with conversation.item.create {type:function_call_output} + response.create so the model completes its turn with the tool output. Mirrors chat's client-side agentic loop, translated to the realtime wire shape. UI changes require a LocalAI image rebuild (Dockerfile:308-313 bakes react-ui/dist into the runtime image). Backend.py changes can be swapped live in /backends/<id>/backend.py + /backend/shutdown. Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(realtime): LocalAI Assistant ("Manage Mode") for the Talk page Mirrors the chat-page metadata.localai_assistant flow so users can ask the realtime model what's loaded / installed / configured. Tools are run server-side via the same in-process MCP holder that powers the chat modality — no transport switch, no proxy, no new wire protocol. Wire: - core/http/endpoints/openai/realtime.go: - RealtimeSessionOptions{LocalAIAssistant,IsAdmin}; isCurrentUserAdmin helper mirrors chat.go's requireAssistantAccess (no-op when auth disabled, else requires auth.RoleAdmin). - Session grows AssistantExecutor mcpTools.ToolExecutor. - runRealtimeSession, when opts.LocalAIAssistant is set: gate on admin, fail closed if DisableLocalAIAssistant or the holder has no tools, DiscoverTools and inject into session.Tools, prepend holder.SystemPrompt() to instructions. - Tool-call dispatch loop: when AssistantExecutor.IsTool(name), run ExecuteTool inproc, append a FunctionCallOutput to conv.Items, skip the function_call_arguments client emit (the client can't execute these — it doesn't know about them). After the loop, if any assistant tool ran, trigger another response so the model speaks the result. Mirrors chat's agentic loop, driven server-side rather than via client round-trip. - core/http/endpoints/openai/realtime_webrtc.go: RealtimeCallRequest gains `localai_assistant` (JSON omitempty). Handshake calls isCurrentUserAdmin and builds RealtimeSessionOptions. - core/http/react-ui/src/pages/Talk.jsx: admin-only "Manage Mode" checkbox under the Tools dropdown; passes localai_assistant: true to realtimeApi.call's body, captured in the connect callback's deps. Mirroring chat's pattern means the in-process MCP tools surface "just works" for the Talk page without exposing a Streamable-HTTP MCP endpoint (which was the alternative). Clients with their own MCP servers can still use the existing ClientMCPDropdown path in parallel; the realtime handler distinguishes them by AssistantExecutor.IsTool() at dispatch time. Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(realtime): render Manage Mode tool calls in the Talk transcript Previously the realtime endpoint only emitted response.output_item.added for the FunctionCall item, and Talk.jsx's switch ignored the event — so server-side tool runs were invisible in the UI. The model would speak the result but the user had no way to see what tool was actually called. realtime.go: after executing an assistant tool inproc, emit a second output_item.added/.done pair for the FunctionCallOutput item. Mirrors the way the chat page displays tool_call + tool_result blocks. Talk.jsx: handle both response.output_item.added and .done. Render FunctionCall (with arguments) and FunctionCallOutput (pretty-printed JSON when possible) as two transcript entries — `tool_call` with the wrench icon, `tool_result` with the clipboard icon, both in mono-space secondary-colour. Resets streamingRef after the result so the next assistant text delta starts a fresh transcript entry instead of appending to the previous turn. Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * refactor(realtime): bound the Manage Mode tool-loop + preserve assistant tools Fallout from a review pass on the Manage Mode patches: - Bound the server-side agentic loop. triggerResponse used to recurse on executedAssistantTool with no cap — a model that kept calling tools would blow the goroutine stack. New maxAssistantToolTurns = 10 (mirrors useChat.js's maxToolTurns). Public triggerResponse is now a thin shim over triggerResponseAtTurn(toolTurn int); recursion increments the counter and stops at the cap with an xlog.Warn. - Preserve Manage Mode tools across client session.update. The handler used to blindly overwrite session.Tools, so toggling a client MCP server mid-session silently wiped the in-process admin tools. Session now caches the original AssistantTools slice at session creation and the session.update handler merges them back in (client names win on collision — the client is explicit). - strconv.ParseBool for the localai_assistant query param instead of hand-rolled "1" \|\| "true". Mirrors LocalAIAssistantFromMetadata. - Talk.jsx: render both tool_call and tool_result on response.output_item.done instead of splitting them across .added and .done. The server's event pairing (added → done) stays correct; the UI just doesn't need to inspect both phases of the same item. One switch case instead of two, no behavioural change. Out of scope (noted for follow-ups): extract a shared assistant-tools helper between chat.go and realtime.go (duplication is small enough that two parallel implementations stay readable for now), and an i18n key for the Manage Mode helper text (Talk.jsx doesn't use i18n anywhere else yet). Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * ci(test-extra): wire liquid-audio backend smoke test The backend ships test.py + a `make test` target and is listed in backend-matrix.yml, so scripts/changed-backends.js already writes a `liquid-audio=true\|false` output when files under backend/python/liquid-audio/ change. The workflow just wasn't reading it. - Expose the `liquid-audio` output on the detect-changes job - Add a tests-liquid-audio job that runs `make` + `make test` in backend/python/liquid-audio, gated on the per-backend detect flag The smoke covers Health() and LoadModel(mode:finetune); fine-tune mode short-circuits before any HuggingFace download (backend.py:192), so the job needs neither weights nor a GPU. The full-inference path remains gated on LIQUID_AUDIO_MODEL_ID, which CI doesn't set. The four new Go test files (core/gallery/importers/liquid-audio_test.go, core/http/endpoints/openai/realtime_gate_test.go, core/http/routes/ui_pipeline_models_test.go, pkg/functions/parse_lfm2_test.go) are already picked up by the existing test.yml workflow via `make test` → `ginkgo -r ./pkg/... ./core/...`; their packages all carry RunSpecs entries. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-05-13 21:57:27 +02:00
LocalAI [bot]	a645c1f4aa	chore: ⬆️ Update ggml-org/llama.cpp to `a9883db8ee021cf16783016a60996d41820b5195` (#9796 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-13 21:40:31 +02:00
LocalAI [bot]	957619af53	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `f9a93c37e2fc021760c3c1aa99cf74c73b7591a7` (#9795 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-13 00:40:48 +02:00
LocalAI [bot]	0b81e36504	chore: ⬆️ Update antirez/ds4 to `f8b4ed635d559b3a5b44bf2df6a77e21b3e9178f` (#9794 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-13 00:40:09 +02:00
LocalAI [bot]	602866a9d8	chore: ⬆️ Update ggml-org/whisper.cpp to `338cce1e58133261753243802a0e7a430118866d` (#9793 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-13 00:39:57 +02:00
LocalAI [bot]	bc4cd3dd85	feat(llama-cpp): bump to `1ec7ba0c`, adapt grpc-server, expose new spec-decoding options (#9765 ) * chore(llama.cpp): bump to 1ec7ba0c14f33f17e980daeeda5f35b225d41994 Picks up the upstream `spec : parallel drafting support` change (ggml-org/llama.cpp#22838) which reshapes the speculative-decoding API and `server_context_impl`. Adapt the grpc-server wrapper accordingly: * `common_params_speculative::type` (single enum) became `types` (`std::vector<common_speculative_type>`). Update both the "default to draft when a draft model is set" branch and the `spec_type`/`speculative_type` option parser. The parser now also tolerates comma-separated lists, mirroring the upstream `common_speculative_types_from_names` semantics. * `common_params_speculative_draft::n_ctx` is gone (draft now shares the target context size). Keep the `draft_ctx_size` option name for backward compatibility and ignore the value rather than failing. * `server_context_impl::model` was renamed to `model_tgt`; update the two reranker / model-metadata call sites. Replaces #9763. Builds cleanly under the linux/amd64 cpu-llama-cpp target locally. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(llama-cpp): expose new speculative-decoding option keys Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838) adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative families and beefs up the draft-model knobs. The previous bump only adapted the API; this exposes the new fields through the grpc-server options dictionary so model configs can drive them. New `options:` keys (all under `backend: llama-cpp`): ngram_mod (`ngram_mod` type): spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match ngram_map_k (`ngram_map_k` type): spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits ngram_map_k4v (`ngram_map_k4v` type): spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m / spec_ngram_map_k4v_min_hits ngram lookup caches (`ngram_cache` type): spec_lookup_cache_static / lookup_cache_static spec_lookup_cache_dynamic / lookup_cache_dynamic Draft-model tuning (active when `spec_type` is `draft`): draft_cache_type_k / spec_draft_cache_type_k draft_cache_type_v / spec_draft_cache_type_v draft_threads / spec_draft_threads draft_threads_batch / spec_draft_threads_batch draft_cpu_moe / spec_draft_cpu_moe (bool flag) draft_n_cpu_moe / spec_draft_n_cpu_moe (first N MoE layers on CPU) draft_override_tensor / spec_draft_override_tensor (comma-separated <tensor regex>=<buffer type>; re-implements upstream's static parse_tensor_buffer_overrides since it isn't exported) `spec_type` already accepted comma-separated lists after the previous commit, matching upstream's `common_speculative_types_from_names`. Docs: refresh `docs/content/advanced/model-configuration.md` with per-family tables and a note about multi-type chaining. Builds locally with `make docker-build-llama-cpp` (linux/amd64 cpu-llama-cpp AVX variant). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(turboquant): bridge new llama.cpp spec API to the legacy fork layout The previous commits in this series adapted backend/cpp/llama-cpp/grpc-server.cpp to the post-#22838 (parallel drafting) llama.cpp API. The turboquant build reuses the same grpc-server.cpp through backend/cpp/turboquant/Makefile, which copies it into turboquant-<flavor>-build/ and runs patch-grpc-server.sh on the copy. The fork branched before the API refactor, so it errors out on: * `ctx_server.impl->model_tgt` (fork still has `model`) * `params.speculative.{ngram_mod,ngram_map_k,ngram_map_k4v,ngram_cache}.` (none of these sub-structs exist in the fork) `params.speculative.draft.{cache_type_k/v, cpuparams[, _batch].n_threads, tensor_buft_overrides}` (fork uses the pre-#22397 flat layout) * `params.speculative.types` vector / `common_speculative_types_from_names` (fork has a scalar `type` and only the singular helper) Approach: 1. backend/cpp/llama-cpp/grpc-server.cpp: introduce a single feature switch `LOCALAI_LEGACY_LLAMA_CPP_SPEC`. When defined, the two `speculative.type[s]` discriminations (the "default to draft when a draft model is set" branch and the `spec_type` / `speculative_type` option parser) fall back to the singular scalar form, and the entire new-option block (ngram_mod / map_k / map_k4v / ngram_cache / draft.{cache_type_, cpuparams, tensor_buft_overrides}) is preprocessed out. The macro is not defined in the source tree — stock llama-cpp builds get the full new API. 2. backend/cpp/turboquant/patch-grpc-server.sh: two new patch steps applied to the per-flavor build copy at turboquant-<flavor>-build/grpc-server.cpp: - substitute `ctx_server.impl->model_tgt` -> `ctx_server.impl->model` - inject `#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1` before the first `#include`, so the guarded blocks above drop out for the fork build. Both patches are idempotent and follow the existing sed/awk pattern in this script (KV cache types, `get_media_marker`, flat speculative renames). Stock llama-cpp's `grpc-server.cpp` is never touched. Drop both legacy patches once the turboquant fork rebases past ggml-org/llama.cpp#22397 / #22838. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(turboquant): close draft_ctx_size brace inside legacy guard The previous turboquant fix wrapped the new option-handler blocks in `#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC ... #endif` but placed the guard in the middle of an `else if` chain — the `} else if` openings of the new blocks were responsible for closing the previous block's brace. With the macro defined the new blocks vanish, draft_ctx_size's `{` loses its closer, the for-loop's `}` is consumed instead, and the file ends with a stray opening brace — clang reports it as `function-definition is not allowed here before '{'` on the next top-level `int main(...)` and `expected '}' at end of input`. Move the chain split inside the draft_ctx_size branch: } else if (... "draft_ctx_size") { // ... #ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC } // legacy: chain ends here #else } else if (... "spec_ngram_mod_n_min") { // modern: chain continues ... } else if (... "draft_override_tensor") { ... } // closes last branch #endif } // closes for-loop Brace count is now balanced under both preprocessor branches (verified with `tr -cd '{' \| wc -c` against the patched and unpatched outputs). Local `make docker-build-turboquant` builds the linux/amd64 cpu-llama-cpp `turboquant-avx` variant cleanly. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ci): forward AMDGPU_TARGETS into Dockerfile.turboquant builder-prebuilt Dockerfile.turboquant's `builder-prebuilt` stage was missing the `ARG AMDGPU_TARGETS` / `ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}` pair that `builder-fromsource` already has (and that `Dockerfile.llama-cpp` mirrors across both stages). When CI uses the prebuilt base image (quay.io/go-skynet/ci-cache:base-grpc-, the common path) the build-arg passed by the workflow never reaches the env inside the compile stage. backend/cpp/llama-cpp/Makefile:38 (introduced by #9626) errors out on hipblas builds when AMDGPU_TARGETS is empty, and the turboquant Makefile reuses backend/cpp/llama-cpp via a sibling build dir, so the same check fires from turboquant-fallback under BUILD_TYPE=hipblas: Makefile:38: AMDGPU_TARGETS is empty — set it to a comma-separated list of gfx targets e.g. gfx1100,gfx1101. Stop. make: * [Makefile:66: turboquant-fallback] Error 2 The bug is latent on master because the docker layer cache stays warm across builds — the compile step rarely re-runs from scratch. The llama.cpp bump in this PR invalidates the cache, so the missing env var becomes load-bearing and the hipblas turboquant CI job fails. Mirror the existing pattern from Dockerfile.llama-cpp. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-12 17:22:37 +02:00
dependabot[bot]	61c9b187fa	chore(deps): update charset-normalizer requirement from >=3.4.0 to >=3.4.7 in /backend/python/vllm (#9779 ) chore(deps): update charset-normalizer requirement Updates the requirements on [charset-normalizer](https://github.com/jawah/charset_normalizer) to permit the latest version. - [Release notes](https://github.com/jawah/charset_normalizer/releases) - [Changelog](https://github.com/jawah/charset_normalizer/blob/master/CHANGELOG.md) - [Commits](https://github.com/jawah/charset_normalizer/compare/3.4.0...3.4.7) --- updated-dependencies: - dependency-name: charset-normalizer dependency-version: 3.4.7 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-12 09:22:23 +02:00
dependabot[bot]	abc2a51641	chore(deps): update transformers requirement from >=5.0.0 to >=5.8.0 in /backend/python/transformers (#9775 ) chore(deps): update transformers requirement Updates the requirements on [transformers](https://github.com/huggingface/transformers) to permit the latest version. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](https://github.com/huggingface/transformers/compare/v5.0.0...v5.8.0) --- updated-dependencies: - dependency-name: transformers dependency-version: 5.8.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-12 09:21:05 +02:00

1 2 3 4 5 ...

1301 Commits