506 Commits

Author SHA1 Message Date
LocalAI [bot]
4d01298048 chore: ⬆️ Update antirez/ds4 to e8e8779b261c10f36ad6270ba732c8f0be5b62e3 (#10024)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-27 15:16:43 +02:00
LocalAI [bot]
51bad74bf8 chore: ⬆️ Update ggml-org/llama.cpp to 0d18aaa9d1a8af3df9abccd828e22eeaac7f840b (#10022)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-27 00:29:14 +02:00
LocalAI [bot]
eed3ecff82 chore: ⬆️ Update ikawrakow/ik_llama.cpp to d2da6da05c73aeb658a3d1751f386c24e6963856 (#10020)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-27 00:28:32 +02:00
LocalAI [bot]
4aad97971c chore: ⬆️ Update ggml-org/llama.cpp to 35c9b1f39ebe5a7bb83986d64415a079218be78d (#9998)
* ⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix(llama-cpp): track upstream rename checkpoint_every_nt -> checkpoint_min_step

Upstream llama.cpp renamed common_params::checkpoint_every_nt to
checkpoint_min_step and changed its default from 8192 to 256. The semantics
also shifted: it used to enforce a fixed checkpoint cadence during prefill,
now it sets a minimum spacing between context checkpoints. Track the new
field name in grpc-server.cpp and accept the old option names as backward-
compatible aliases for users with existing configs.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-26 08:34:41 +02:00
LocalAI [bot]
5d544a7868 chore: ⬆️ Update ikawrakow/ik_llama.cpp to b4e1d916c5ec7e75ea3c124dd090425a99fc613f (#9995)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-25 23:57:17 +02:00
LocalAI [bot]
87e01aa290 chore: ⬆️ Update antirez/ds4 to ad0209f6a4b067574d2b4afe896c08c177156b31 (#9996)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-25 23:56:33 +02:00
Richard Palethorpe
6a80e23733 feat(middleware): Model routing, PII filtering, Cloud model proxies (#9802)
Add a routing middleware stack and a cloud-proxy backend.

* cloud-proxy: a Go gRPC backend that forwards OpenAI- and
  Anthropic-shaped chat requests to upstream providers, with an
  optional translate mode (OpenAI request -> Anthropic /v1/messages
  -> OpenAI response) and full tool-calling support.

* routing: admission control, content-aware model routing
  (embedding cache + classifier + rerank + Arch-Router score),
  PII detection/redaction (regex + NER) with streaming filter and
  OpenAI/Anthropic adapters, and a per-user/per-key billing recorder
  backed by GORM or in-memory storage.

* middleware: UsageMiddleware records usage via the billing recorder,
  plus admission, route-model, usage-stamp and trace middlewares.

* observability: BackendTrace ring buffer stores full request bodies
  (capped), MITM proxy emits structured trace events, and router
  classifier decisions surface at /api/router/decide.

* gallery: Arch-Router-1.5B (Q4_K_M and Q8_0).

* UI: cloud-proxy model-editor fields, classifier system-prompt and
  score-normalization config, and a Traces page rendering request
  bodies.

Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-25 09:28:27 +02:00
LocalAI [bot]
1dcd1ae915 chore: ⬆️ Update ggml-org/llama.cpp to 549b9d84330c327e6791fa812a7d60c0cf63572e (#9974)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-25 09:22:56 +02:00
LocalAI [bot]
acad78a95a chore: ⬆️ Update ikawrakow/ik_llama.cpp to 9f7ba245ab41e118f03aa8dd5134d18a81159d02 (#9973)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-25 00:05:29 +02:00
LocalAI [bot]
c94d1e1f5b chore: ⬆️ Update antirez/ds4 to f91c12b50a1448527c435c028bfc70d1b00f6c33 (#9975)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-25 00:05:15 +02:00
LocalAI [bot]
a95f4e63e0 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 642c038ccdf3dd08e6d9ac6fdc3b1c311ebd8a02 (#9966)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-23 23:52:51 +02:00
LocalAI [bot]
dfd19a3f88 chore: ⬆️ Update ggml-org/llama.cpp to c0c7e147e7efa6c5858754b47259ba4880f8a906 (#9963)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-23 23:52:36 +02:00
LocalAI [bot]
63d84a5705 chore: ⬆️ Update antirez/ds4 to 444afce822057d87f14c4dec307dce24fd49b3ee (#9964)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-23 23:51:53 +02:00
LocalAI [bot]
e4cc1f11f3 chore: ⬆️ Update ggml-org/llama.cpp to 1acee6bf8939948f9bcbf4b14034e4b475f06069 (#9952)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-23 08:38:29 +02:00
LocalAI [bot]
d0a59be9de chore: ⬆️ Update ikawrakow/ik_llama.cpp to b3d39cff8bffbd67296d6badd4076a1486a0715c (#9953)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-22 23:58:48 +02:00
LocalAI [bot]
4735345105 chore: ⬆️ Update ggml-org/llama.cpp to bb28c1fe246b72276ee1d00ce89306be7b865766 (#9934)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-22 09:49:33 +02:00
LocalAI [bot]
7384fd800b chore: ⬆️ Update antirez/ds4 to 8d576642c39b9a2d782a80159ba84ef5a81c0b81 (#9932)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-22 08:31:49 +02:00
LocalAI [bot]
0d34cf7cbd chore: ⬆️ Update ikawrakow/ik_llama.cpp to 48a55f74e4c6e2aeda363dd386c1ac9170a0af71 (#9930)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-21 23:23:37 +02:00
LocalAI [bot]
959de86761 feat(llama-cpp): make server-side prompt cache work by default (#9925)
Aligns LocalAI's llama-cpp gRPC backend with upstream's auto-on prompt
cache path so repeated system prompts (agents, OpenAI/Anthropic-compatible
CLIs, coding assistants) skip prefill on subsequent calls without any
YAML changes. Reported in #9921.

Upstream's server enables `kv_unified=true` (and bumps `n_parallel` to 4)
when slot count is auto, which unlocks `cache_idle_slots`. LocalAI
hardcodes `n_parallel=1` and so far also hardcoded `kv_unified=false`,
which silently force-disables idle-slot saving at server init. The host
prompt cache was allocated but never written across requests.

Changes in backend/cpp/llama-cpp/grpc-server.cpp:
- params.kv_unified: false -> true (single-slot path now benefits from
  the prompt cache; users can opt out with `kv_unified:false`)
- params.n_ctx_checkpoints: 8 -> 32 (match upstream default)
- params.cache_idle_slots = true initialized explicitly (upstream default)
- params.checkpoint_every_nt = 8192 initialized explicitly (upstream default)
- New option parsers: cache_idle_slots / idle_slots_cache,
  checkpoint_every_nt / checkpoint_every_n_tokens

Docs:
- features/text-generation.md: fix misleading `cache_ram` description
  (it's the host-side prompt cache, not the KV cache), document the
  kv_unified + cache_ram + cache_idle_slots interaction, add rows for
  the two newly-exposed options, and add a worked example for the
  agent/CLI workload from the issue.
- advanced/model-configuration.md: mark the legacy `prompt_cache_path`
  / `prompt_cache_all` / `prompt_cache_ro` YAML fields as unused by the
  llama-cpp gRPC backend (they target upstream's CLI completion tool
  and are not consumed by grpc-server.cpp) and point readers at the
  new prompt-cache explainer.

Closes #9921

Assisted-by: claude:opus-4.7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-21 16:31:48 +02:00
Richard Palethorpe
c68818a62e fix(llama-cpp): terminate tensor_buft_overrides with sentinel (#9919)
llama.cpp's model loader asserts back().pattern == nullptr on
params.tensor_buft_overrides (and on params.kv_overrides.back().key[0]
== 0) before binding them into llama_model_params. PR #8560 attempted
to satisfy llama_params_fit's placeholder requirement by pre-filling
params.tensor_buft_overrides up to llama_max_tensor_buft_overrides()
*before* the option-parse loop. Any subsequent push_back from
override_tensor / draft_cpu_moe / draft_n_cpu_moe / draft_override_tensor
then appended real entries after the placeholders, leaving back() with
a real pattern and tripping the assert. The draft override vector
likewise had no terminator at all.

Mirror upstream common/arg.cpp:645-658 instead: real entries are
pushed during option parsing, and after parsing we pad the main vector
up to ntbo (placeholders land at the end, so back() is always nullptr)
and append a single {nullptr, nullptr} to the draft vector when it is
non-empty. The existing kv_overrides terminator block already matches
upstream and stays.

Verified against ggml-org/llama.cpp@5cbaa5e: only tensor_buft_overrides
(main + draft) and kv_overrides are sentinel-terminated common_params
fields; everything else is size-driven std::vector.

Assisted-by: claude-code:claude-opus-4-7

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-21 12:55:06 +02:00
LocalAI [bot]
12e056e96d chore: ⬆️ Update ggml-org/llama.cpp to ad277572619fcfb6ddd38f4c6437283a4b2b8636 (#9915)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-21 09:07:31 +02:00
LocalAI [bot]
b2d68a53a2 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 11a1fea9e291f12ce2c803a9d7812c30ca806bcf (#9914)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-20 22:04:06 +00:00
LocalAI [bot]
1ffd82a050 chore: ⬆️ Update antirez/ds4 to 2606543be7a8c125a32cee37f5d1d85dc78f2fcf (#9909)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-20 21:22:26 +00:00
LocalAI [bot]
06f8159035 chore: ⬆️ Update ggml-org/llama.cpp to 67ace021da905e27ecbdf1176b0eef578a5288c0 (#9897)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-20 22:05:58 +02:00
LocalAI [bot]
24e04d8e81 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 77413bc900f9a2bfd8a5407f184427bcc0825f6c (#9899)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-20 01:02:53 +02:00
LocalAI [bot]
1879e11042 chore: ⬆️ Update antirez/ds4 to 599e49d253971451f710cb8323344e789906ed6c (#9900)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-20 01:01:45 +02:00
LocalAI [bot]
4b02d23c0c chore: ⬆️ Update ggml-org/llama.cpp to 5cbaa5e69e09bde3334cd8c355570553a0dca027 (#9876)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-19 08:06:16 +02:00
LocalAI [bot]
ca51606bfe chore: ⬆️ Update ikawrakow/ik_llama.cpp to 40aae0b6d86d50c0ee7011b3ce59a233203e430a (#9875)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-19 08:01:41 +02:00
LocalAI [bot]
11cff1b309 chore: ⬆️ Update ggml-org/llama.cpp to 87589042cac2c390cec8d68fb2fad64e0a2a252a (#9855)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-18 08:01:30 +02:00
LocalAI [bot]
3cba35ed32 chore: ⬆️ Update antirez/ds4 to c9dd9499bfa57c1bbfbb4446eff963330ab5329b (#9864)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-17 23:19:58 +02:00
LocalAI [bot]
265ae35231 chore: ⬆️ Update ikawrakow/ik_llama.cpp to c35189d83c91aad780aba62b89f2830cb2916223 (#9866)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-17 23:19:43 +02:00
LocalAI [bot]
41c838b2df chore: ⬆️ Update ikawrakow/ik_llama.cpp to 3e573cfea6e0a332eff822ffbdb1dd3b112e9051 (#9856)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-16 22:44:08 +02:00
LocalAI [bot]
21e793ad2a chore: ⬆️ Update antirez/ds4 to ef0a4905d05263df8e63689f2dd1efac618a752c (#9857)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-16 22:43:46 +02:00
LocalAI [bot]
d77a9137d8 feat(llama-cpp): bump to MTP-merge SHA and automatically set MTP defaults (#9852)
* feat(llama-cpp): bump to MTP-merge SHA and document draft-mtp spec type

Update LLAMA_VERSION to 0253fb21 (post ggml-org/llama.cpp#22673 merge,
2026-05-16) to pick up Multi-Token Prediction support.

No grpc-server.cpp changes are required: the existing `spec_type` option
delegates to upstream's `common_speculative_types_from_names()`, which
already accepts the new `draft-mtp` name. The `n_rs_seq` cparam needed
by MTP is auto-derived inside `common_context_params_to_llama` from
`params.speculative.need_n_rs_seq()`, and when no `draft_model` is set
the upstream server builds the MTP context off the target model itself.

Docs: extend the speculative-decoding section of the model-configuration
guide with the new type, both load paths (MTP head embedded in the main
GGUF vs. separate `mtp-*.gguf` sibling), the PR's recommended
`spec_n_max:2-3`, and the chained `draft-mtp,ngram-mod` recipe. Also
notes that the upstream `-hf` auto-discovery of `mtp-*.gguf` siblings is
not wired through LocalAI's gRPC layer.

Agent guide: short note explaining that new upstream spec types are
picked up automatically and that MTP needs no gRPC plumbing.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama-cpp): auto-detect MTP heads and enable draft-mtp on import + load

Detect upstream's `<arch>.nextn_predict_layers` GGUF metadata key (set by
`convert_hf_to_gguf.py` for Qwen3.5/3.6 family models and similar) and,
when present and the user has not configured a `spec_type` explicitly,
auto-append the upstream-recommended speculative-decoding tuple:

  - spec_type:draft-mtp
  - spec_n_max:6
  - spec_p_min:0.75

The 0.75 p_min is pinned defensively because upstream marks the current
default with a "change to 0.0f" TODO; locking it here keeps acceptance
thresholds stable across future llama.cpp bumps.

Detection runs in two places:

  - The model importer (`POST /models/import-uri`, the `/import-model`
    UI) range-fetches the GGUF header for HuggingFace / direct-URL
    imports via `gguf.ParseGGUFFileRemote`, with a 30s timeout and
    non-fatal error handling. OCI/Ollama URIs are skipped because the
    artifact is not directly streamable; the load-time hook covers them
    once the file is on disk.
  - The llama-cpp load-time hook (`guessGGUFFromFile`) reads the local
    header on every model start and appends the same options if
    `spec_type` is not already set.

Both paths share `ApplyMTPDefaults` and respect an explicit user-set
`spec_type:` / `speculative_type:` so YAML overrides win. Ginkgo
specs cover the append, preserve-user-choice, legacy alias, and nil
safety paths.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(importer): resolve huggingface:// URIs before MTP header probe

`gguf.ParseGGUFFileRemote` only speaks HTTP(S), but the importer was
handing it the raw `huggingface://...` URI directly (and similarly for
any other custom downloader scheme). Live-test against
`huggingface://ggml-org/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-MTP-Q8_0.gguf`
exposed this: the probe failed with `unsupported protocol scheme
"huggingface"`, was caught by the non-fatal error path, and the MTP
options were silently never applied to the generated YAML.

Route every candidate URI through `downloader.URI.ResolveURL()` and
require the resolved form to be HTTP(S). After the fix the probe
successfully reads `<arch>.nextn_predict_layers=1` from the real HF
GGUF and the emitted ConfigFile carries spec_type:draft-mtp,
spec_n_max:6, spec_p_min:0.75 as intended.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-16 22:42:48 +02:00
LocalAI [bot]
00b8989886 chore: ⬆️ Update ggml-org/llama.cpp to 1348f67c58f561808136e8a152a9eddec168f221 (#9842)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-16 08:41:09 +02:00
LocalAI [bot]
a1a7a219ed chore: ⬆️ Update antirez/ds4 to 950e8e6474a1c9fabe04e669d607606a7ef8824f (#9844)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 23:46:29 +02:00
LocalAI [bot]
3937ec6527 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 5cc0d86c760e9858e4bed4418400bb39dbe025f2 (#9845)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 23:45:54 +02:00
LocalAI [bot]
4abf5befbb chore: ⬆️ Update ggml-org/llama.cpp to 834a243664114487f99520370a7a7b00fc7a486f (#9826)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 10:29:22 +02:00
LocalAI [bot]
7bd1693ad0 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 0fcffdb64d21e57f0778f342415754156e01adfa (#9828)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 10:08:46 +02:00
LocalAI [bot]
53de474ef5 chore: ⬆️ Update antirez/ds4 to 04b6fda2be395094cbf2d20d921e7a705a4166ef (#9830)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 10:08:09 +02:00
LocalAI [bot]
6e1dbae256 feat(llama-cpp): expose 12 missing common_params via options[] (#9814)
The llama.cpp backend already accepts a free-form options: array in the
model config that maps to common_params fields, but a coverage audit
against upstream pin 7f3f843c flagged 12 user-visible knobs that were
neither set via the typed proto fields nor reachable via options:.

Wire them up under the existing if/else chain in params_parse, before
the speculative section. Each new option follows the file's prevailing
patterns (try/catch around numeric parses, the same true/1/yes/on bool
form used elsewhere, hardware_concurrency() fallback for thread counts,
mirror of draft_override_tensor for override_tensor).

Top-level / batching / IO:
  - n_ubatch (alias ubatch) -- physical batch size; was previously
    force-aliased to n_batch at line 482, blocking embedding/rerank
    workloads that need independent control
  - threads_batch (alias n_threads_batch) -- main-model batch threads;
    mirrors the existing draft_threads_batch
  - direct_io (alias use_direct_io) -- O_DIRECT model loads
  - verbosity -- llama.cpp log threshold (line 479 had this commented
    out)
  - override_tensor (alias tensor_buft_overrides) -- per-tensor buffer
    overrides for the main model; mirrors draft_override_tensor

Embedding / multimodal:
  - pooling_type (alias pooling) -- mean/cls/last/rank/none; previously
    only auto-flipped to RANK for rerankers
  - embd_normalize (alias embedding_normalize) -- and the embedding
    handler now reads params_base.embd_normalize instead of a hardcoded
    2 at the previous embd_normalize literal in Embedding()
  - mmproj_use_gpu (alias mmproj_offload) -- mmproj on CPU vs GPU
  - image_min_tokens / image_max_tokens -- per-image vision token budget

Reasoning surface (the audit-focus three; LocalAI's existing
ReasoningConfig.DisableReasoning only feeds the per-request
chat_template_kwargs.enable_thinking and does not touch any of these):
  - reasoning_format -- none/auto/deepseek/deepseek-legacy parser
  - enable_reasoning (alias reasoning_budget) -- -1/0/>0 thinking budget
  - prefill_assistant -- trailing-assistant-message prefill toggle

All 14 referenced fields exist on both the upstream pin and the
turboquant fork's common.h, so no LOCALAI_LEGACY_LLAMA_CPP_SPEC guard
is needed.

Docs: extend model-configuration.md with new "Reasoning Models",
"Multimodal Backend Options", "Embedding & Reranking Backend Options",
and "Other Backend Tuning Options" subsections; also refresh the
Speculative Type Values table to show the new dash-separated canonical
names alongside the underscore aliases LocalAI still accepts.


Assisted-by: claude-code:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-14 08:53:34 +02:00
LocalAI [bot]
53bdb18d10 chore: ⬆️ Update ggml-org/llama.cpp to 7f3f843c31cd32dc4adc10b393342dfee071c332 (#9809)
* ⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix(llama-cpp): adapt to upstream COMMON_SPECULATIVE_TYPE_DRAFT rename

ggml-org/llama.cpp#22964 ("spec: update CLI arguments for better
consistency") renamed the speculative type enum values:
  COMMON_SPECULATIVE_TYPE_DRAFT  -> COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE
  COMMON_SPECULATIVE_TYPE_EAGLE3 -> COMMON_SPECULATIVE_TYPE_DRAFT_EAGLE3
and the registered name strings flipped from underscore- to dash-
separated form (e.g. ngram_simple -> ngram-simple), with the bare
draft/eagle3 aliases replaced by draft-simple/draft-eagle3.

This broke the build with the new LLAMA_VERSION on every variant
(vulkan/arm64, darwin and likely all the rest) at grpc-server.cpp:461.

Update the upstream branch of the speculative-type fallback to use the
new identifier (the LOCALAI_LEGACY_LLAMA_CPP_SPEC fork branch keeps the
old name), and normalize spec_type option tokens before passing them to
common_speculative_types_from_names so existing model configs that say
spec_type:draft / spec_type:ngram_simple keep working.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-14 08:53:23 +02:00
LocalAI [bot]
ec49995190 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 949bb8f1d660fc1264c137a6f3dbd619375f6134 (#9807)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-14 00:15:32 +02:00
LocalAI [bot]
4430fae779 chore: ⬆️ Update antirez/ds4 to 0cba357ca1bc0e7510421cc26888e420ea942123 (#9806)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-14 00:14:23 +02:00
LocalAI [bot]
ddbbdf45b9 chore: ⬆️ Update TheTom/llama-cpp-turboquant to 5aeb2fdbe26cd4c534c6fa15de73cb5749bd0403 (#9740)
⬆️ Update TheTom/llama-cpp-turboquant

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 21:58:33 +02:00
LocalAI [bot]
a645c1f4aa chore: ⬆️ Update ggml-org/llama.cpp to a9883db8ee021cf16783016a60996d41820b5195 (#9796)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 21:40:31 +02:00
LocalAI [bot]
957619af53 chore: ⬆️ Update ikawrakow/ik_llama.cpp to f9a93c37e2fc021760c3c1aa99cf74c73b7591a7 (#9795)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 00:40:48 +02:00
LocalAI [bot]
0b81e36504 chore: ⬆️ Update antirez/ds4 to f8b4ed635d559b3a5b44bf2df6a77e21b3e9178f (#9794)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 00:40:09 +02:00
LocalAI [bot]
bc4cd3dd85 feat(llama-cpp): bump to 1ec7ba0c, adapt grpc-server, expose new spec-decoding options (#9765)
* chore(llama.cpp): bump to 1ec7ba0c14f33f17e980daeeda5f35b225d41994

Picks up the upstream `spec : parallel drafting support` change
(ggml-org/llama.cpp#22838) which reshapes the speculative-decoding API
and `server_context_impl`.

Adapt the grpc-server wrapper accordingly:

  * `common_params_speculative::type` (single enum) became `types`
    (`std::vector<common_speculative_type>`). Update both the
    "default to draft when a draft model is set" branch and the
    `spec_type`/`speculative_type` option parser. The parser now also
    tolerates comma-separated lists, mirroring the upstream
    `common_speculative_types_from_names` semantics.
  * `common_params_speculative_draft::n_ctx` is gone (draft now shares
    the target context size). Keep the `draft_ctx_size` option name for
    backward compatibility and ignore the value rather than failing.
  * `server_context_impl::model` was renamed to `model_tgt`; update the
    two reranker / model-metadata call sites.

Replaces #9763. Builds cleanly under the linux/amd64 cpu-llama-cpp
target locally.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama-cpp): expose new speculative-decoding option keys

Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838)
adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative
families and beefs up the draft-model knobs. The previous bump only
adapted the API; this exposes the new fields through the grpc-server
options dictionary so model configs can drive them.

New `options:` keys (all under `backend: llama-cpp`):

ngram_mod (`ngram_mod` type):
  spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match

ngram_map_k (`ngram_map_k` type):
  spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits

ngram_map_k4v (`ngram_map_k4v` type):
  spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m /
  spec_ngram_map_k4v_min_hits

ngram lookup caches (`ngram_cache` type):
  spec_lookup_cache_static / lookup_cache_static
  spec_lookup_cache_dynamic / lookup_cache_dynamic

Draft-model tuning (active when `spec_type` is `draft`):
  draft_cache_type_k / spec_draft_cache_type_k
  draft_cache_type_v / spec_draft_cache_type_v
  draft_threads / spec_draft_threads
  draft_threads_batch / spec_draft_threads_batch
  draft_cpu_moe / spec_draft_cpu_moe          (bool flag)
  draft_n_cpu_moe / spec_draft_n_cpu_moe      (first N MoE layers on CPU)
  draft_override_tensor / spec_draft_override_tensor
    (comma-separated <tensor regex>=<buffer type>; re-implements upstream's
     static parse_tensor_buffer_overrides since it isn't exported)

`spec_type` already accepted comma-separated lists after the previous
commit, matching upstream's `common_speculative_types_from_names`.

Docs: refresh `docs/content/advanced/model-configuration.md` with
per-family tables and a note about multi-type chaining.

Builds locally with `make docker-build-llama-cpp` (linux/amd64
cpu-llama-cpp AVX variant).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(turboquant): bridge new llama.cpp spec API to the legacy fork layout

The previous commits in this series adapted backend/cpp/llama-cpp/grpc-server.cpp
to the post-#22838 (parallel drafting) llama.cpp API. The turboquant build
reuses the same grpc-server.cpp through backend/cpp/turboquant/Makefile,
which copies it into turboquant-<flavor>-build/ and runs patch-grpc-server.sh
on the copy. The fork branched before the API refactor, so it errors out on:

  * `ctx_server.impl->model_tgt` (fork still has `model`)
  * `params.speculative.{ngram_mod,ngram_map_k,ngram_map_k4v,ngram_cache}.*`
    (none of these sub-structs exist in the fork)
  * `params.speculative.draft.{cache_type_k/v, cpuparams[, _batch].n_threads,
    tensor_buft_overrides}` (fork uses the pre-#22397 flat layout)
  * `params.speculative.types` vector / `common_speculative_types_from_names`
    (fork has a scalar `type` and only the singular helper)

Approach:

1. backend/cpp/llama-cpp/grpc-server.cpp: introduce a single feature switch
   `LOCALAI_LEGACY_LLAMA_CPP_SPEC`. When defined, the two `speculative.type[s]`
   discriminations (the "default to draft when a draft model is set" branch
   and the `spec_type` / `speculative_type` option parser) fall back to the
   singular scalar form, and the entire new-option block (ngram_mod / map_k
   / map_k4v / ngram_cache / draft.{cache_type_*, cpuparams*,
   tensor_buft_overrides}) is preprocessed out. The macro is *not* defined
   in the source tree — stock llama-cpp builds get the full new API.

2. backend/cpp/turboquant/patch-grpc-server.sh: two new patch steps applied
   to the per-flavor build copy at turboquant-<flavor>-build/grpc-server.cpp:
   - substitute `ctx_server.impl->model_tgt` -> `ctx_server.impl->model`
   - inject `#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1` before the first
     `#include`, so the guarded blocks above drop out for the fork build.

   Both patches are idempotent and follow the existing sed/awk pattern in
   this script (KV cache types, `get_media_marker`, flat speculative
   renames). Stock llama-cpp's `grpc-server.cpp` is never touched.

Drop both legacy patches once the turboquant fork rebases past
ggml-org/llama.cpp#22397 / #22838.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(turboquant): close draft_ctx_size brace inside legacy guard

The previous turboquant fix wrapped the new option-handler blocks in
`#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC ... #endif` but placed the guard
in the middle of an `else if` chain — the `} else if` openings of the
new blocks were responsible for closing the previous block's brace.
With the macro defined the new blocks vanish, draft_ctx_size's `{`
loses its closer, the for-loop's `}` is consumed instead, and the
file ends with a stray opening brace — clang reports it as
`function-definition is not allowed here before '{'` on the next
top-level `int main(...)` and `expected '}' at end of input`.

Move the chain split inside the draft_ctx_size branch:

    } else if (... "draft_ctx_size") {
        // ...
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
    }                                  // legacy: chain ends here
#else
    } else if (... "spec_ngram_mod_n_min") {  // modern: chain continues
        ...
    } else if (... "draft_override_tensor") {
        ...
    }                                  // closes last branch
#endif
    }                                  // closes for-loop

Brace count is now balanced under both preprocessor branches (verified
with `tr -cd '{' | wc -c` against the patched and unpatched outputs).

Local `make docker-build-turboquant` builds the linux/amd64 cpu-llama-cpp
`turboquant-avx` variant cleanly.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ci): forward AMDGPU_TARGETS into Dockerfile.turboquant builder-prebuilt

Dockerfile.turboquant's `builder-prebuilt` stage was missing the
`ARG AMDGPU_TARGETS` / `ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}` pair that
`builder-fromsource` already has (and that `Dockerfile.llama-cpp`
mirrors across both stages). When CI uses the prebuilt base image
(quay.io/go-skynet/ci-cache:base-grpc-*, the common path) the build-arg
passed by the workflow never reaches the env inside the compile stage.

backend/cpp/llama-cpp/Makefile:38 (introduced by #9626) errors out on
hipblas builds when AMDGPU_TARGETS is empty, and the turboquant
Makefile reuses backend/cpp/llama-cpp via a sibling build dir, so the
same check fires from turboquant-fallback under BUILD_TYPE=hipblas:

  Makefile:38: *** AMDGPU_TARGETS is empty — set it to a comma-separated
  list of gfx targets e.g. gfx1100,gfx1101.  Stop.
  make: *** [Makefile:66: turboquant-fallback] Error 2

The bug is latent on master because the docker layer cache stays warm
across builds — the compile step rarely re-runs from scratch. The
llama.cpp bump in this PR invalidates the cache, so the missing env var
becomes load-bearing and the hipblas turboquant CI job fails.

Mirror the existing pattern from Dockerfile.llama-cpp.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-12 17:22:37 +02:00
LocalAI [bot]
78722caedc chore: ⬆️ Update ikawrakow/ik_llama.cpp to eb570eb96689c235933b813693ca28ab9d3d26de (#9764)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-12 00:02:22 +02:00