The llama.cpp backend already accepts a free-form options: array in the
model config that maps to common_params fields, but a coverage audit
against upstream pin 7f3f843c flagged 12 user-visible knobs that were
neither set via the typed proto fields nor reachable via options:.
Wire them up under the existing if/else chain in params_parse, before
the speculative section. Each new option follows the file's prevailing
patterns (try/catch around numeric parses, the same true/1/yes/on bool
form used elsewhere, hardware_concurrency() fallback for thread counts,
mirror of draft_override_tensor for override_tensor).
Top-level / batching / IO:
- n_ubatch (alias ubatch) -- physical batch size; was previously
force-aliased to n_batch at line 482, blocking embedding/rerank
workloads that need independent control
- threads_batch (alias n_threads_batch) -- main-model batch threads;
mirrors the existing draft_threads_batch
- direct_io (alias use_direct_io) -- O_DIRECT model loads
- verbosity -- llama.cpp log threshold (line 479 had this commented
out)
- override_tensor (alias tensor_buft_overrides) -- per-tensor buffer
overrides for the main model; mirrors draft_override_tensor
Embedding / multimodal:
- pooling_type (alias pooling) -- mean/cls/last/rank/none; previously
only auto-flipped to RANK for rerankers
- embd_normalize (alias embedding_normalize) -- and the embedding
handler now reads params_base.embd_normalize instead of a hardcoded
2 at the previous embd_normalize literal in Embedding()
- mmproj_use_gpu (alias mmproj_offload) -- mmproj on CPU vs GPU
- image_min_tokens / image_max_tokens -- per-image vision token budget
Reasoning surface (the audit-focus three; LocalAI's existing
ReasoningConfig.DisableReasoning only feeds the per-request
chat_template_kwargs.enable_thinking and does not touch any of these):
- reasoning_format -- none/auto/deepseek/deepseek-legacy parser
- enable_reasoning (alias reasoning_budget) -- -1/0/>0 thinking budget
- prefill_assistant -- trailing-assistant-message prefill toggle
All 14 referenced fields exist on both the upstream pin and the
turboquant fork's common.h, so no LOCALAI_LEGACY_LLAMA_CPP_SPEC guard
is needed.
Docs: extend model-configuration.md with new "Reasoning Models",
"Multimodal Backend Options", "Embedding & Reranking Backend Options",
and "Other Backend Tuning Options" subsections; also refresh the
Speculative Type Values table to show the new dash-separated canonical
names alongside the underscore aliases LocalAI still accepts.
Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* ⬆️ Update ggml-org/llama.cpp
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix(llama-cpp): adapt to upstream COMMON_SPECULATIVE_TYPE_DRAFT rename
ggml-org/llama.cpp#22964 ("spec: update CLI arguments for better
consistency") renamed the speculative type enum values:
COMMON_SPECULATIVE_TYPE_DRAFT -> COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE
COMMON_SPECULATIVE_TYPE_EAGLE3 -> COMMON_SPECULATIVE_TYPE_DRAFT_EAGLE3
and the registered name strings flipped from underscore- to dash-
separated form (e.g. ngram_simple -> ngram-simple), with the bare
draft/eagle3 aliases replaced by draft-simple/draft-eagle3.
This broke the build with the new LLAMA_VERSION on every variant
(vulkan/arm64, darwin and likely all the rest) at grpc-server.cpp:461.
Update the upstream branch of the speculative-type fallback to use the
new identifier (the LOCALAI_LEGACY_LLAMA_CPP_SPEC fork branch keeps the
old name), and normalize spec_type option tokens before passing them to
common_speculative_types_from_names so existing model configs that say
spec_type:draft / spec_type:ngram_simple keep working.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7
---------
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* chore(llama.cpp): bump to 1ec7ba0c14f33f17e980daeeda5f35b225d41994
Picks up the upstream `spec : parallel drafting support` change
(ggml-org/llama.cpp#22838) which reshapes the speculative-decoding API
and `server_context_impl`.
Adapt the grpc-server wrapper accordingly:
* `common_params_speculative::type` (single enum) became `types`
(`std::vector<common_speculative_type>`). Update both the
"default to draft when a draft model is set" branch and the
`spec_type`/`speculative_type` option parser. The parser now also
tolerates comma-separated lists, mirroring the upstream
`common_speculative_types_from_names` semantics.
* `common_params_speculative_draft::n_ctx` is gone (draft now shares
the target context size). Keep the `draft_ctx_size` option name for
backward compatibility and ignore the value rather than failing.
* `server_context_impl::model` was renamed to `model_tgt`; update the
two reranker / model-metadata call sites.
Replaces #9763. Builds cleanly under the linux/amd64 cpu-llama-cpp
target locally.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama-cpp): expose new speculative-decoding option keys
Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838)
adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative
families and beefs up the draft-model knobs. The previous bump only
adapted the API; this exposes the new fields through the grpc-server
options dictionary so model configs can drive them.
New `options:` keys (all under `backend: llama-cpp`):
ngram_mod (`ngram_mod` type):
spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match
ngram_map_k (`ngram_map_k` type):
spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits
ngram_map_k4v (`ngram_map_k4v` type):
spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m /
spec_ngram_map_k4v_min_hits
ngram lookup caches (`ngram_cache` type):
spec_lookup_cache_static / lookup_cache_static
spec_lookup_cache_dynamic / lookup_cache_dynamic
Draft-model tuning (active when `spec_type` is `draft`):
draft_cache_type_k / spec_draft_cache_type_k
draft_cache_type_v / spec_draft_cache_type_v
draft_threads / spec_draft_threads
draft_threads_batch / spec_draft_threads_batch
draft_cpu_moe / spec_draft_cpu_moe (bool flag)
draft_n_cpu_moe / spec_draft_n_cpu_moe (first N MoE layers on CPU)
draft_override_tensor / spec_draft_override_tensor
(comma-separated <tensor regex>=<buffer type>; re-implements upstream's
static parse_tensor_buffer_overrides since it isn't exported)
`spec_type` already accepted comma-separated lists after the previous
commit, matching upstream's `common_speculative_types_from_names`.
Docs: refresh `docs/content/advanced/model-configuration.md` with
per-family tables and a note about multi-type chaining.
Builds locally with `make docker-build-llama-cpp` (linux/amd64
cpu-llama-cpp AVX variant).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(turboquant): bridge new llama.cpp spec API to the legacy fork layout
The previous commits in this series adapted backend/cpp/llama-cpp/grpc-server.cpp
to the post-#22838 (parallel drafting) llama.cpp API. The turboquant build
reuses the same grpc-server.cpp through backend/cpp/turboquant/Makefile,
which copies it into turboquant-<flavor>-build/ and runs patch-grpc-server.sh
on the copy. The fork branched before the API refactor, so it errors out on:
* `ctx_server.impl->model_tgt` (fork still has `model`)
* `params.speculative.{ngram_mod,ngram_map_k,ngram_map_k4v,ngram_cache}.*`
(none of these sub-structs exist in the fork)
* `params.speculative.draft.{cache_type_k/v, cpuparams[, _batch].n_threads,
tensor_buft_overrides}` (fork uses the pre-#22397 flat layout)
* `params.speculative.types` vector / `common_speculative_types_from_names`
(fork has a scalar `type` and only the singular helper)
Approach:
1. backend/cpp/llama-cpp/grpc-server.cpp: introduce a single feature switch
`LOCALAI_LEGACY_LLAMA_CPP_SPEC`. When defined, the two `speculative.type[s]`
discriminations (the "default to draft when a draft model is set" branch
and the `spec_type` / `speculative_type` option parser) fall back to the
singular scalar form, and the entire new-option block (ngram_mod / map_k
/ map_k4v / ngram_cache / draft.{cache_type_*, cpuparams*,
tensor_buft_overrides}) is preprocessed out. The macro is *not* defined
in the source tree — stock llama-cpp builds get the full new API.
2. backend/cpp/turboquant/patch-grpc-server.sh: two new patch steps applied
to the per-flavor build copy at turboquant-<flavor>-build/grpc-server.cpp:
- substitute `ctx_server.impl->model_tgt` -> `ctx_server.impl->model`
- inject `#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1` before the first
`#include`, so the guarded blocks above drop out for the fork build.
Both patches are idempotent and follow the existing sed/awk pattern in
this script (KV cache types, `get_media_marker`, flat speculative
renames). Stock llama-cpp's `grpc-server.cpp` is never touched.
Drop both legacy patches once the turboquant fork rebases past
ggml-org/llama.cpp#22397 / #22838.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(turboquant): close draft_ctx_size brace inside legacy guard
The previous turboquant fix wrapped the new option-handler blocks in
`#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC ... #endif` but placed the guard
in the middle of an `else if` chain — the `} else if` openings of the
new blocks were responsible for closing the previous block's brace.
With the macro defined the new blocks vanish, draft_ctx_size's `{`
loses its closer, the for-loop's `}` is consumed instead, and the
file ends with a stray opening brace — clang reports it as
`function-definition is not allowed here before '{'` on the next
top-level `int main(...)` and `expected '}' at end of input`.
Move the chain split inside the draft_ctx_size branch:
} else if (... "draft_ctx_size") {
// ...
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
} // legacy: chain ends here
#else
} else if (... "spec_ngram_mod_n_min") { // modern: chain continues
...
} else if (... "draft_override_tensor") {
...
} // closes last branch
#endif
} // closes for-loop
Brace count is now balanced under both preprocessor branches (verified
with `tr -cd '{' | wc -c` against the patched and unpatched outputs).
Local `make docker-build-turboquant` builds the linux/amd64 cpu-llama-cpp
`turboquant-avx` variant cleanly.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ci): forward AMDGPU_TARGETS into Dockerfile.turboquant builder-prebuilt
Dockerfile.turboquant's `builder-prebuilt` stage was missing the
`ARG AMDGPU_TARGETS` / `ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}` pair that
`builder-fromsource` already has (and that `Dockerfile.llama-cpp`
mirrors across both stages). When CI uses the prebuilt base image
(quay.io/go-skynet/ci-cache:base-grpc-*, the common path) the build-arg
passed by the workflow never reaches the env inside the compile stage.
backend/cpp/llama-cpp/Makefile:38 (introduced by #9626) errors out on
hipblas builds when AMDGPU_TARGETS is empty, and the turboquant
Makefile reuses backend/cpp/llama-cpp via a sibling build dir, so the
same check fires from turboquant-fallback under BUILD_TYPE=hipblas:
Makefile:38: *** AMDGPU_TARGETS is empty — set it to a comma-separated
list of gfx targets e.g. gfx1100,gfx1101. Stop.
make: *** [Makefile:66: turboquant-fallback] Error 2
The bug is latent on master because the docker layer cache stays warm
across builds — the compile step rarely re-runs from scratch. The
llama.cpp bump in this PR invalidates the cache, so the missing env var
becomes load-bearing and the hipblas turboquant CI job fails.
Mirror the existing pattern from Dockerfile.llama-cpp.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Bumps backend/cpp/llama-cpp/Makefile LLAMA_VERSION from 665abc6 to
d775992, picking up upstream PR ggml-org/llama.cpp#22397 which splits
common_params_speculative into nested draft / ngram_simple / ngram_mod
sub-structs. Renames every grpc-server.cpp reference to match:
speculative.mparams_dft.path -> speculative.draft.mparams.path
speculative.{n_max,n_min} -> speculative.draft.{n_max,n_min}
speculative.{p_min,p_split} -> speculative.draft.{p_min,p_split}
speculative.{n_gpu_layers,n_ctx} -> speculative.draft.{n_gpu_layers,n_ctx}
speculative.ngram_size_n -> speculative.ngram_simple.size_n
speculative.ngram_size_m -> speculative.ngram_simple.size_m
speculative.ngram_min_hits -> speculative.ngram_simple.min_hits
The "speculative.n_max" JSON key sent to the upstream server stays
unchanged — server-task.cpp still reads it and routes the value into
draft.n_max internally.
The turboquant fork (TheTom/llama-cpp-turboquant @ 11a241d) branched
before #22397 and still exposes the flat layout. Since turboquant
reuses the shared backend/cpp/llama-cpp/grpc-server.cpp, extend
patch-grpc-server.sh with an idempotent sed block that reverts the
ten field references back to the legacy flat names on the build copy
only — the original under backend/cpp/llama-cpp/ stays compiling
against vanilla upstream. Drop the block once the fork rebases.
ik-llama-cpp has its own grpc-server.cpp with no speculative refs
(0/2661 lines), so it is unaffected.
Validated locally with `make docker-build-llama-cpp` (avx, avx2,
avx512, fallback, grpc + rpc-server all built; image exported).
Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Adds split_mode (alias sm) to the llama.cpp backend options allowlist,
accepting none|layer|row|tensor. The tensor value targets the experimental
backend-agnostic tensor parallelism from ggml-org/llama.cpp#19378 and
requires a llama.cpp build that includes that PR, FlashAttention enabled,
KV-cache quantization disabled, and a manually set context size.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(llama-cpp): include server-chat.cpp in grpc-server translation unit
Upstream llama.cpp refactor (ggml-org/llama.cpp#20690) moved the
OAI/Anthropic/Responses and transcription conversion helpers out of
server-common.cpp into a new server-chat.cpp, and server-task.cpp and
server-context.cpp now call those symbols (convert_transcriptions_to_chatcmpl,
server_chat_convert_responses_to_chatcmpl, server_chat_convert_anthropic_to_oai,
server_chat_msg_diff_to_json_oaicompat) via server-chat.h.
grpc-server.cpp builds as a single translation unit by #include-ing the
upstream .cpp files directly. Without including server-chat.cpp, the
declarations are satisfied at compile time via server-chat.h but the
link step fails with undefined references once LLAMA_VERSION crosses
the refactor commit (134d6e54).
Guard the include with __has_include so the same source stays buildable
on older LLAMA_VERSION pins that predate the refactor (where prepare.sh
won't copy server-chat.cpp into tools/grpc-server/).
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(llama-cpp): bump LLAMA_VERSION to 0d0764dfd
Bump to ggml-org/llama.cpp@0d0764dfd2.
Paired with the preceding grpc-server server-chat.cpp include so the
refactor at 134d6e54 links cleanly. Supersedes PR #9494.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Upstream llama.cpp (PR #21962) switched the server-side mtmd media
marker to a random per-server string and removed the legacy
"<__media__>" backward-compat replacement in mtmd_tokenizer. The
Go layer still emitted the hardcoded "<__media__>", so on the
non-tokenizer-template path the prompt arrived with a marker mtmd
did not recognize and tokenization failed with "number of bitmaps
(1) does not match number of markers (0)".
Report the active media marker via ModelMetadataResponse.media_marker
and substitute the sentinel "<__media__>" with it right before the
gRPC call, after the backend has been loaded and probed. Also skip
the Go-side multimodal templating entirely when UseTokenizerTemplate
is true — llama.cpp's oaicompat_chat_params_parse already injects its
own marker and StringContent is unused in that path. Backends that do
not expose the field keep the legacy "<__media__>" behavior.
When TASK_RESPONSE_TYPE_OAI_CHAT is used, the first streaming token
produces a JSON array with two elements: a role-init chunk and the
actual content chunk. The grpc-server loop called attach_chat_deltas
for both elements with the same raw_result pointer, stamping the first
token's ChatDelta.Content on both replies. The Go side accumulated both,
emitting the first content token twice to SSE clients.
Fix: in the array iteration loops in PredictStream, detect role-init
elements (delta has "role" key) and skip attach_chat_deltas for them.
Only content/reasoning elements get chat deltas attached.
Reasoning models are unaffected because their first token goes into
reasoning_content, not content.
* fix(chat): do not retry if we had chatdeltas or tooldeltas from backend
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix: use oai compat for llama.cpp
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix: apply to non-streaming path too
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* map also other fields
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The C++ PEG parser needs a few tokens to identify the reasoning format
(e.g. "<|channel>thought\n" for Gemma 4). During this warm-up, the gRPC
layer was sending raw partial tag tokens to Go, which leaked into the
reasoning field.
- Clear reply.message in gRPC when autoparser is active but has no diffs
yet, matching llama.cpp server behavior of only emitting classified output
- Prefer C++ autoparser chat deltas for reasoning/content in all streaming
paths, falling back to Go-side extraction for backends without autoparser
(e.g. vLLM)
- Override non-streaming no-tools result with chat delta content when available
- Guard PrependThinkingTokenIfNeeded against partial tag prefixes during
streaming accumulation
- Reorder default thinking tokens so <|channel>thought is checked before
<|think|> (Gemma 4 templates contain both)
* feat: add distributed mode (experimental)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix data races, mutexes, transactions
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactorings
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix events and tool stream in agent chat
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* use ginkgo
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(cron): compute correctly time boundaries avoiding re-triggering
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* enhancements, refactorings
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* do not flood of healthy checks
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* do not list obvious backends as text backends
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* tests fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Drop redundant healthcheck
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* enhancements, refactorings
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat: wire min_p
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat: inferencing defaults
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(refactor): re-use iterative parser
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore: generate automatically inference defaults from unsloth
Instead of trying to re-invent the wheel and maintain here the inference
defaults, prefer to consume unsloth ones, and contribute there as
necessary.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore: apply defaults also to models installed via gallery
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore: be consistent and apply fallback to all endpoint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(functions): add peg-based parsing
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat: support returning toolcalls directly from backends
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore: do run PEG only if backend didn't send deltas
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(deps): Bump llama.cpp to '5b6c9bc0f3c8f55598b9999b65aff7ce4119bc15' and refactor usage of base params
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore: update AGENTS.md
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(7355): Update llama-cpp grpc for v3 interface
Signed-off-by: Simon Redman <simon@ergotech.com>
* feat(llama-gprc): Trim whitespace from servers list
Signed-off-by: Simon Redman <simon@ergotech.com>
* Trim trailing spaces in grpc-server.cpp
Signed-off-by: Simon Redman <simon@ergotech.com>
---------
Signed-off-by: Simon Redman <simon@ergotech.com>
* feat: add support to logprobs in results
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat: add support to logitbias
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat: initial hook to install elements directly
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* WIP: ui changes
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Move HF api client to pkg
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add simple importer for gguf files
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add opcache
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* wire importers to CLI
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add omitempty to config fields
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fix tests
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add MLX importer
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Small refactors to star to use HF for discovery
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add tests
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Common preferences
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add support to bare HF repos
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(importer/llama.cpp): add support for mmproj files
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* add mmproj quants to common preferences
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fix vlm usage in tokenizer mode with llama.cpp
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat: respect context
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* workaround fasthttp
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): allow to abort call
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Refactor
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore: improving error
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Respect context also with MCP
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Tie to both contexts
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Make detection more robust
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama.cpp): expose env vars as options for consistency
This allows to configure everything in the YAML file of the model rather
than have global configurations
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama.cpp): respect usetokenizertemplate and use llama.cpp templating system to process messages
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* WIP
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Detect template exists if use tokenizer template is enabled
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Better recognization of chat
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fixes to support tool calls while using templates from tokenizer
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Drop template guessing, fix passing tools to tokenizer
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Extract grammar and other options from chat template, add schema struct
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* WIP
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* WIP
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Automatically set use_jinja
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Cleanups, identify by default gguf models for chat
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Update docs
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix: properly terminate kv_overrides array with empty key
The llama model loading function expects KV overrides to be terminated
with an empty key (key[0] == 0). Previously, the kv_overrides vector was
not being properly terminated, causing an assertion failure.
This commit ensures that after parsing all KV override strings, we add a
final terminating entry with an empty key to satisfy the C-style array
termination requirement. This fixes the assertion error and allows the
model to load correctly with custom KV overrides.
Fixes#6643
- Also included a reference to the usage of the `overrides` option in
the advanced-usage section.
Signed-off-by: blob42 <contact@blob42.xyz>
* doc: document the `overrides` option
---------
Signed-off-by: blob42 <contact@blob42.xyz>
* fix(llama.cpp): correctly set grammar triggers
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Do not enable lazy by default
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>