31 Commits

Author SHA1 Message Date
LocalAI [bot]
621c612b2d ci(bump-deps): register ds4 + move version pin into the Makefile (#9761)
* ci(bump-deps): register ds4 + move version pin into the Makefile

The initial ds4 PR (#9758) put the upstream commit pin in
backend/cpp/ds4/prepare.sh as a shell variable. The auto-bump bot at
.github/bump_deps.sh greps for ^$VAR?= in a Makefile, so DS4_VERSION
was invisible to it - other backends (llama-cpp, ik-llama-cpp,
turboquant, voxtral, etc.) all pin in their Makefile.

This change:

- Moves DS4_VERSION?= and DS4_REPO?= to the top of
  backend/cpp/ds4/Makefile.
- Inlines the git init/fetch/checkout recipe into the 'ds4:' target
  (matches llama-cpp's 'llama.cpp:' target pattern). Directory acts
  as the target so make only re-clones when missing.
- Deletes the now-redundant prepare.sh.
- Adds antirez/ds4 + DS4_VERSION + main + backend/cpp/ds4/Makefile to
  the .github/workflows/bump_deps.yaml matrix so the daily bot opens
  PRs against this pin.
- Updates .agents/ds4-backend.md to point at the Makefile.

Verified:
  $ grep -m1 '^DS4_VERSION?=' backend/cpp/ds4/Makefile
  DS4_VERSION?=ae302c2fa18cc6d9aefc021d0f27ae03c9ad2fc0
  $ make -C backend/cpp/ds4 ds4   # clones into ds4/ at the pin
  $ make -C backend/cpp/ds4 ds4   # no-op on second invocation
  make: 'ds4' is up to date.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: route backend/cpp/ds4/ changes through changed-backends.js

scripts/changed-backends.js:inferBackendPath has an explicit branch per
cpp dockerfile suffix (ik-llama-cpp, turboquant, llama-cpp). Without a
matching branch the function returns null, the backend never lands in
the path map, and PR change-detection cannot map "backend/cpp/ds4/X
changed" -> "rebuild ds4 image".

This is why PR #9761 produced zero ds4 jobs even though it directly
edits backend/cpp/ds4/Makefile.

Adds the missing branch (Dockerfile.ds4 -> backend/cpp/ds4/), placed
before the llama-cpp branch (since both share the .cpp ancestry but
ds4 is more specific - same ordering rule documented in
.agents/adding-backends.md).

Verified with a local Node simulation of the script against this PR's
diff: the path map now contains 'ds4 -> backend/cpp/ds4/' and a
'backend/cpp/ds4/Makefile' change correctly triggers the ds4 backend
in the rebuild set.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(adding-backends): harden the two gotchas that bit ds4

Both omissions are silent at the time you ADD a backend - the failure
mode only appears later (the bump bot stays silent forever, or the path
filter shows up on the next PR that touches your backend with zero CI
jobs and looks broken for unrelated reasons). Expanding the
`scripts/changed-backends.js` paragraph from a one-liner to a fully
worked example, and adding a new sibling paragraph for the
`bump_deps.yaml` + Makefile-pin contract.

Both call out the specific mistakes from the ds4 timeline (#9758#9761) so future contributors can pattern-match on the cause.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-11 22:46:02 +02:00
LocalAI [bot]
d892e4af80 feat: add ds4 backend (DeepSeek V4 Flash) with tool calls, thinking, KV cache (#9758)
* test(e2e-backends): allow BACKEND_BINARY for native-built backends

Adds an escape hatch for hardware-gated backends (e.g. ds4) where the
model is too large for Docker build context. When BACKEND_BINARY points
at a run.sh produced by 'make -C backend/cpp/<name> package', the suite
skips docker image extraction and drives the binary directly.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(e2e-backends): validate BACKEND_BINARY basename + log actual source

Two follow-ups from the cbcf5148 code review:

- BACKEND_BINARY now requires a path whose basename is `run.sh`. Without
  this check, `filepath.Dir(binary)` silently discarded the filename, so
  pointing the env var at an arbitrary binary failed later with a
  confusing assertion that named a path the user never typed.
- The "Testing image=..." debug line printed an empty string when the
  binary path was used, hiding the actual source in CI logs. The line
  now reports whichever of BACKEND_IMAGE / BACKEND_BINARY is in effect
  as `src=...`.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): scaffold ds4 backend dir

Adds prepare.sh, run.sh, and a .gitignore. CMakeLists, Makefile, and the
implementation arrive in follow-up commits.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): add backend Makefile

Drives ds4's upstream Makefile to produce engine .o files (CUDA on Linux
when BUILD_TYPE=cublas, Metal on Darwin, otherwise CPU debug path), then
invokes CMake on our wrapper.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): add CMakeLists for grpc-server

Generates protoc stubs from backend.proto, links grpc-server.cpp +
dsml_parser.cpp + dsml_renderer.cpp + kv_cache.cpp against pre-built
ds4 engine .o files. DS4_GPU=cuda|metal|cpu selects the backend.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): grpc-server skeleton + module stubs

The minimum that links: Backend service with Health + Free; other RPCs
default to UNIMPLEMENTED. Stub headers/sources for dsml_parser,
dsml_renderer, and kv_cache are in place so CMake links cleanly even
before those modules ship.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement LoadModel

Opens engine + creates session sized to ContextSize (default 32768).
Backend is compile-time: CPU when DS4_NO_GPU, Metal on __APPLE__, else
CUDA. MTP/speculative options are accepted via ModelOptions.Options[]
(mtp_path, mtp_draft, mtp_margin). kv_cache_dir option is captured into
g_kv_cache_dir for the cache module (Task 19 wires it in).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement TokenizeString

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement Predict (plain text)

Tool calls + thinking-mode split arrive in Task 13 once dsml_parser is in.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement PredictStream (plain text)

ChatDelta + reasoning/tool_calls split arrives in Task 14.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement Status RPC

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): add DSML streaming parser

Classifies raw model-emitted token text into CONTENT / REASONING /
TOOL_START / TOOL_ARGS / TOOL_END events. Markers it watches for are the
literal DSML strings rendered by ds4_server.c's prompt template
(<|DSML|tool_calls>, <|DSML|invoke name=...>, <think>, etc.) - these are
plain text the model emits, not special tokens.

Partial markers split across token chunks are buffered until a full marker
or a definitively-not-a-marker '<' is observed. RandomToolId() generates
the API-side tool call id (call_xxx) that exact-replay would key on.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): split hex escapes in DSML markers + add cstring/cstdio includes

C++ \x hex escapes have no length cap. '\x9cD' was read as a single escape
producing byte 0xCD, eating the 'D'. The markers were never actually matching
the DSML text the model emits. Split each escape with adjacent string literal
concatenation so the byte sequence is exactly EF BD 9C 44 (|D) at runtime.

Also adds <cstring> and <cstdio> includes (libstdc++ 13 does not transitively
expose std::strlen / std::snprintf via <string>).

The local plan file (uncommitted) was also updated with the same fixes so
Task 16's dsml_renderer.cpp does not re-introduce the bug.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): wire DsmlParser into Predict (ChatDelta)

Non-streaming Predict now emits one ChatDelta carrying content,
reasoning_content, and tool_calls[] parsed from the model's DSML output.
Reply.message still carries the raw model bytes for backends that prefer
the regex fallback path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): wire DsmlParser into PredictStream

Per-token ChatDelta writes: content/reasoning_content go incrementally,
tool_calls emit TOOL_START as one delta (id + name) followed by
TOOL_ARGS deltas with incremental JSON. The Go-side aggregator
(pkg/functions/chat_deltas.go) reassembles them.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): chat template + reasoning_effort mapping

UseTokenizerTemplate=true + Messages -> ds4_chat_begin / append /
assistant_prefix. PredictOptions.Metadata['enable_thinking'] and
['reasoning_effort'] map to ds4_think_mode (DS4_THINK_HIGH default;
'max'/'xhigh' -> DS4_THINK_MAX; disabled -> DS4_THINK_NONE).

Tool-call rendering for assistant turns with tool_calls JSON arrives in
the next commit (dsml_renderer).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): render assistant tool_calls + tool results to DSML

Closes the round-trip: when an OpenAI client sends a multi-turn chat
where prior turns contain tool_calls or role=tool messages, build_prompt
serializes them back to the DSML shape the model was trained on. Mirrors
ds4_server.c's prompt renderer; uses nlohmann::json for parsing the
OpenAI tool_calls payload.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): disk KV cache module

Dir-based cache keyed by SHA1(rendered prompt prefix). File format:
'DS4G' magic + version + ctx_size + prefix_len + prefix + payload_bytes
+ ds4_session_save_payload output. NOT bit-compatible with ds4-server's
KVC files - that interop is a follow-up plan. LoadLongestPrefix walks
the dir picking the longest stored prefix that prefixes the incoming
prompt.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): wire KvCache into Predict/PredictStream

LoadModel reads 'kv_cache_dir' from ModelOptions.Options[], passes it to
g_kv_cache.SetDir. Each Predict/PredictStream computes a render text for
the request, tries LoadLongestPrefix to recover state, then Saves the
new state after generation. ds4_session_sync handles the live-cache
fast path internally, so the disk cache only matters for cold-starts
and cross-session reuse.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): add package.sh

Linux: bundles libc + ld + libstdc++ + libgomp + GPU runtime libs into
package/lib so the FROM scratch image boots without a host libc.
Darwin is handled by scripts/build/ds4-darwin.sh which uses otool -L.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): rename namespace ds4_backend -> ds4cpp

ds4.h defines 'typedef enum {...} ds4_backend' which collides with our
C++ 'namespace ds4_backend' anywhere a TU includes both. kv_cache.h
includes ds4.h directly and surfaces the conflict immediately; other
TUs would hit it once gRPC dev headers are available.

Renames the C++ namespace to ds4cpp across all wrapper files and the
plan, leaving the upstream ds4 typedef untouched.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend): add Dockerfile.ds4

Single-stage builder (CUDA devel image for cublas, ubuntu:24.04 for cpu)
-> FROM scratch with packaged grpc-server + bundled runtime libs.
nlohmann-json3-dev is required for dsml_renderer's JSON handling.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(make): wire backend/cpp/ds4 + ds4-darwin into root Makefile

BACKEND_DS4 entry + generate-docker-build-target eval + docker-build-ds4
in docker-build-backends + .NOTPARALLEL guards. Also adds the
backends/ds4-darwin target which delegates to scripts/build/ds4-darwin.sh
(landed in Task 24).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: add backend-matrix entries for ds4 (cpu + cuda13, per-arch)

Two entries per build (amd64 + arm64) so backend-merge-jobs assembles a
multi-arch manifest. Skipping cuda12 - ds4 was validated against CUDA 13.
Darwin Metal is handled outside this matrix by backend_build_darwin.yml.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/index): add ds4 meta + image entries

cpu + cuda13 x latest + master. Darwin Metal builds publish under
ds4-darwin via the existing llama-cpp-darwin OCI pipeline.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(scripts/build): add ds4-darwin.sh

Native macOS/Metal build for the ds4 backend. Mirrors llama-cpp-darwin.sh:
make grpc-server -> otool -L for dylib bundling -> OCI tar that
'local-ai backends install' consumes via the backends/ds4-darwin
Makefile target.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(darwin): build ds4-darwin in backend_build_darwin

Adds a 'Build ds4 backend (Darwin Metal)' step that runs the
backends/ds4-darwin Makefile target on the macOS runner.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(import): auto-detect ds4 weights via DS4Importer

Adds core/gallery/importers/ds4.go which matches on the antirez/deepseek-v4-gguf
repo URI and the DeepSeek-V4-Flash-*.gguf filename pattern. Registered before
LlamaCPPImporter so ds4 weights route to backend: ds4 instead of falling
through to llama-cpp.

Also lists ds4 in /backends/known so the /import-model UI surfaces it as a
manual choice for users who want to force the backend on a non-canonical URI.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add deepseek-v4-flash-q2 (ds4 backend)

One-click install of the q2 weights with backend: ds4.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(.agents): add ds4-backend.md

Documents the backend shape, DSML state machine, thinking-mode mapping,
disk KV cache, build matrix (cpu/cuda13/Darwin), and the BACKEND_BINARY
hardware-validation path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): pass UBUNTU_VERSION + arch env vars to install-base-deps

The .docker/install-base-deps.sh script needs UBUNTU_VERSION (defaults to
2404), TARGETARCH, SKIP_DRIVERS, and APT_MIRROR/APT_PORTS_MIRROR exported
into the environment so it can pick the right cuda-keyring / cudss / nvpl
debs and apt mirrors. Dockerfile.ds4 was declaring some of the ARGs but not
re-exporting them via ENV. Mirrors Dockerfile.llama-cpp's pattern.

Without this fix 'make docker-build-ds4 BUILD_TYPE=cublas CUDA_MAJOR_VERSION=13'
failed at:
  /usr/local/sbin/install-base-deps: line 120: UBUNTU_VERSION: unbound variable

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/index): add Metal image entries for ds4

Adds metal-ds4 + metal-ds4-development image entries pointing at
quay.io/go-skynet/local-ai-backends:{latest,master}-metal-darwin-arm64-ds4
(built by scripts/build/ds4-darwin.sh on macOS arm64 runners), plus the
'metal' and 'metal-darwin-arm64' capability mappings on the ds4 meta and
ds4-development variant.

Closes a gap from the initial Task 23 landing - the Darwin Metal build
script and CI workflow step were already wired (Tasks 24-25), but the
gallery had no image entry for users to install the Metal variant.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ci): use ubuntu:24.04 base for ds4 cuda13 matrix entries

The initial Task 22 matrix landing used base-image: 'nvidia/cuda:13.0.0-devel-ubuntu24.04'
which clashes with install-base-deps.sh's cuda-keyring step:

  E: Conflicting values set for option Signed-By regarding source
     https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/

The canonical pattern (llama-cpp, ik-llama-cpp, turboquant) uses plain
'ubuntu:24.04' + 'skip-drivers: false' so install-base-deps installs CUDA
from scratch via its own keyring setup. Adopting that here.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): drop install-base-deps.sh dependency

The .docker/install-base-deps.sh pipeline is built around the llama-cpp
needs: NVIDIA keyring + cuda-toolkit apt + gRPC-from-source build at
/opt/grpc. For ds4 we don't need any of that:
- CUDA: nvidia/cuda:13.0.0-devel-ubuntu24.04 ships /usr/local/cuda
  ready to go; install-base-deps's keyring step then conflicts with
  the pre-installed Signed-By.
- gRPC: ds4's grpc-server.cpp only links against grpc++; system
  libgrpc++-dev (apt) is sufficient, no source build needed.

Replaced the install-base-deps invocation in Dockerfile.ds4 with a
direct 'apt-get install libgrpc++-dev libprotobuf-dev protobuf-compiler-grpc
nlohmann-json3-dev cmake build-essential pkg-config git'. Matrix entries
back to nvidia/cuda base + skip-drivers=true so install-base-deps would
no-op even if some downstream tooling calls it.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): correct proto accessors + alias grpc::Status as GStatus

Two compile bugs caught by the docker build:

1. proto::Message uses snake_case accessors. The build_prompt loop called
   m.toolcalls() / m.toolcallid() - the protoc-generated names are
   m.tool_calls() / m.tool_call_id(). Plan-text bug propagated to the
   wrapper.

2. The Status RPC method shadowed the 'using grpc::Status' alias, so any
   later method declaration using Status as a return type failed to parse
   ('Status does not name a type' starting at LoadModel). Solution: alias
   grpc::Status as GStatus instead, with no 'using' clause that would
   conflict. All RPC method declarations and return-statement constructions
   now use GStatus.

Pre-existing code reviewer flagged the Status-shadow concern as 'minor'
in the original Task 10 commit; it turned out to be a real compile blocker
under libstdc++ 13 once the surrounding methods were filled in.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): preserve TOOL_ARGS content in dsml_parser Flush

When the model emitted a parameter value that arrived in the same buffer
as the surrounding tool_call markers (e.g. the buffered tail after a
literal '</think>' opened the model output), the parser deferred all
buffered bytes to Flush() because looks_like_prefix() always returns
true while buf starts with '<'. Flush() then drained the buffer as
plain CONTENT/REASONING regardless of parser state, so the bytes
between the parameter open and close markers were classified as
CONTENT instead of TOOL_ARGS.

Symptom: the model emitted

  <|DSML|parameter name="location" string="true">Paris, France</|DSML|parameter>

and the assembled tool_call arguments came out as {"location":""} -
the opener and closer were emitted into the args stream but the
"Paris, France" content went to the assistant message instead.

Fix:

1. Flush() now uses the same state-aware emit logic as DrainPlain:
   PARAM_VALUE bytes become TOOL_ARGS (json-escaped when string),
   THINK bytes become REASONING, TEXT bytes become CONTENT, and
   INVOKE / TOOL_CALLS structural whitespace is discarded.

2. looks_like_prefix() restricts its leading-'<' fallback to buffers
   that have not yet seen a '>'. Without that change, char-by-char
   feeds would discard the '<' of '<|DSML|invoke name="..."' once
   the marker prefix length was reached but the closing quote/'>'
   were still in flight.

Verified with a standalone harness that runs the failing input three
ways (single Feed, split-after-'>', and char-by-char) and aggregates
TOOL_ARGS for tool index 0: all three now produce
{"location":"Paris, France"}.

Assisted-by: Claude:opus-4.7 [Read,Edit,Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): use ds4_session_sync + manual generation loop for KV persistence

ds4_engine_generate_argmax() is a self-contained helper that doesn't take or
update a ds4_session - it manages its own internal state. Our Predict and
PredictStream methods created g_session via ds4_session_create() but then
called ds4_engine_generate_argmax(), so g_session's KV state never advanced.
ds4_session_payload_bytes(g_session) returned 0 and the disk KV cache save
correctly rejected with 'session has no valid checkpoint to save'.

Switch both RPCs to the proper session API:
  ds4_session_sync(g_session, &prompt, ...)
  loop:
    int token = ds4_session_argmax(g_session)
    if token == eos: break
    emit(token)
    ds4_session_eval(g_session, token, ...)

After the loop the session has a real checkpoint and ds4_session_save_payload
writes the KV state to disk. Verified end-to-end on a DGX Spark GB10: three
.kv files (15-30 MB each) are written when BACKEND_TEST_OPTIONS sets
kv_cache_dir, and the e2e tool-call assertion still passes.

Also added stderr diagnostics to KvCache (enabled/disabled at SetDir; per-save
path + payload_bytes + result) so future failures are visible instead of
silent. The 'wrote ok' lines are low-volume - one per Predict/PredictStream
when the cache is enabled - and skipped entirely when the option is unset.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): use ds4_session_eval_speculative_argmax when MTP loaded

Wires MTP (Multi-Token Prediction) speculative decoding into the manual
generation loop in both Predict and PredictStream. When the upstream MTP
weights are loaded via 'mtp_path:' option AND we're on CUDA / Metal,
ds4_engine_mtp_draft_tokens() returns >0 and we switch the inner loop to
ds4_session_eval_speculative_argmax(), which can accept N>1 tokens per
verifier step. When MTP is not loaded (no option, CPU backend, or weights
absent), we fall through to the simple ds4_session_argmax + ds4_session_eval
path with no behavior change.

Validated on a DGX Spark GB10 with the optional MTP GGUF
(DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf, ~3.6 GB). LoadModel logs
'ds4: MTP support model loaded ... (draft=2)' on stderr.

Caveat per upstream README: 'currently provides at most a slight speedup,
not a meaningful generation-speed win'. Wired now mainly to track the
upstream API; bigger speedups arrive when ds4 improves the speculative path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): honor PredictOptions sampling with DSML-aware override

Mirrors ds4_server.c:7102-7115 sampling-policy semantics on the LocalAI
gRPC side. The generation loop now consults compute_sample_params() per
token to pick the effective (temperature, top_k, top_p, min_p), based on:

  1. Request defaults: PredictOptions.temperature / .topk / .topp / .minp
  2. Thinking-mode override: when enable_thinking != false, force T=1.0,
     top_k=0, top_p=1.0, min_p=0.0 (creativity for the reasoning pass and
     the trailing content)
  3. DSML structural override: when DsmlParser::IsInDsmlStructural()
     returns true (we are between tool-call markers but NOT in a param
     value payload), force T=0.0 so protocol bytes parse cleanly

When the effective temperature is 0, we keep using ds4_session_argmax +
MTP speculative path (matches ds4-server's gate that only enables MTP for
greedy positions). When > 0, we call ds4_session_sample(s, T, ...) with
a per-thread RNG seeded from system_clock and fall back to single-token
ds4_session_eval.

New public method on DsmlParser: IsInDsmlStructural() encodes which states
need protocol-byte determinism. PARAM_VALUE is excluded (payload uses user
sampling); TEXT and THINK are excluded (no tool-call context to protect).

Verified on the DGX Spark GB10: the e2e suite still passes with all 5
specs including tools, and the Predict output now varies between runs
(creative sampling active) while the tool-call args remain a clean
'{"location":"Paris, France"}' because the parser-state check forces
greedy on the structural bytes.

UX note: thinking mode is ON by default (matching ds4-server). Users who
want deterministic output should set Metadata.enable_thinking = false.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add sha256 to deepseek-v4-flash-q2 entry

Per HF LFS metadata for antirez/deepseek-v4-gguf:
  size: 86720111200 bytes (~80.76 GiB)
  sha256: 31598c67c8b8744d3bcebcd19aa62253c6dc43cef3b8adf9f593656c9e86fd8c

LocalAI's downloader verifies sha256 when present, so users who install
deepseek-v4-flash-q2 from the gallery get integrity-checked weights and
the partial-download issue (an 81 GB file is easy to truncate) becomes
recoverable instead of silently producing a broken backend.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-11 22:15:47 +02:00
Ettore Di Giacinto
ea00199554 ci: tag every backend digest, including singletons
backend_build.yml pushes by canonical digest only (push-by-digest=true,
no tags applied at build time). User-facing tagging happens in
backend_merge.yml's `imagetools create` step. Before this commit,
scripts/changed-backends.js emitted a merge entry only for tag-suffixes
with 2+ legs, so every single-arch backend (CUDA/ROCm/Intel Python
images, vLLM, sglang, transformers, diffusers, ...) pushed its digest
untagged and stayed that way until quay's GC reaped it. Symptom: tag
releases shipped multi-arch backends tagged correctly, but no
v<X>-gpu-nvidia-cuda-12-vllm (or any singleton variant) ever appeared
in the registry.

Changes:

- scripts/changed-backends.js drops the `group.length < 2` skip and
  emits two merge matrices, one per arch class, so each downstream
  merge job can `needs:` only its corresponding build matrix.
- backend.yml splits backend-merge-jobs into multiarch and singlearch
  variants. The split preserves PR #9746's fix: slow singlearch CUDA
  builds (~6h) must not gate multiarch merges, or quay's GC reaps the
  multiarch per-arch digests before they're tagged.
- backend_pr.yml mirrors the split.
- backend_build.yml renames the digest artifact from
  `digests<suffix>-<platform-tag>` to
  `digests<suffix>--<platform-tag-or-"single">`. The `--` separator
  prevents the merge-side glob from over-matching sibling backends
  whose tag-suffix is a prefix of ours (e.g. -cpu-vllm vs
  -cpu-vllm-omni, -cpu-mlx vs -cpu-mlx-audio); the `single` placeholder
  keeps the name well-formed when platform-tag is empty.
- backend_merge.yml updates the download pattern to match.

Verified locally: a tag-push event now expands to 36 multiarch merge
entries (= 72 builds / 2 legs) and 199 singlearch merge entries (one
per singleton, including -gpu-nvidia-cuda-12-vllm at index 24).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-11 13:22:00 +00:00
LocalAI [bot]
35f6db8c76 ci: split backend-jobs into single-arch and multi-arch matrices (#9746)
Symptom (run 25612992409): backend-merge-jobs failed with
"quay.io/go-skynet/local-ai-backends@sha256:fdbd93ca...: not found"
even though the per-arch build for -cpu-llama-cpp pushed that exact
digest 14h31m earlier.

Root cause: backend-merge-jobs was gated on the WHOLE backend-jobs
matrix (`needs: backend-jobs`). The multi-arch -cpu-llama-cpp legs
finished within 30 min, but a single-arch CUDA-12-llama-cpp slot in
the same matrix queued for ~8h (max-parallel: 8 throttle) and then
took ~6h to build cold. By the time it freed the merge to run, quay's
GC had reaped the per-arch digests pushed by the fast multi-arch legs
the day before.

Fix: split the linux backend matrix in two.

  backend-jobs-multiarch  - entries with `platform-tag` set (paired
    per-arch legs that feed backend-merge-jobs).
  backend-jobs-singlearch - entries without `platform-tag` (heavy
    standalone builds: CUDA, ROCm, Intel oneAPI, vLLM, sglang, etc.).

backend-merge-jobs now `needs:` only backend-jobs-multiarch. The
multi-arch matrix completes in ~2-3h, well inside quay's GC window.
Heavy single-arch entries keep running independently with no merge
dependency.

scripts/changed-backends.js gains a splitByArch() helper that
partitions filtered entries by whether `platform-tag` is set, and
emits matrix-singlearch + matrix-multiarch + has-backends-singlearch
+ has-backends-multiarch outputs (replacing the previous combined
matrix / has-backends pair). Applied in both the full-matrix and
filtered-matrix code paths. Smoke test: 199 single-arch + 72 multi-
arch + 35 darwin = 271 total entries; 36 merge-matrix entries
(one per multi-arch backend pair). Matches expectation.

Local `make backends/<name>` is unaffected — the script's outputs
only feed CI workflow matrices.

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-10 18:15:53 +02:00
Ettore Di Giacinto
733c254b32 ci: consolidate llama-cpp-darwin into the matrix-driven Darwin flow (#9731)
The bespoke llama-cpp-darwin + llama-cpp-darwin-publish top-level jobs
in backend.yml ran unconditionally on every backend.yml trigger
(push/cron), bypassing the path filter that all 34 other Darwin
backends already honor via backend-jobs-darwin -> backend_build_darwin.yml.

Move llama-cpp into the includeDarwin matrix:
- New entry in .github/backend-matrix.yml (lang=go, no build-type).
- backend_build_darwin.yml gains an `if: inputs.backend == 'llama-cpp'`
  build step that drives `make backends/llama-cpp-darwin`. The bespoke
  script (scripts/build/llama-cpp-darwin.sh) compiles three CMake
  variants from backend/cpp/llama-cpp and bundles dylibs via otool, so
  it doesn't fit the build-darwin-go-backend mold; the existing
  llama-cpp-aware ccache setup blocks already in this workflow are
  what motivated the consolidation in the first place.
- scripts/changed-backends.js's inferBackendPathDarwin gains a special
  case so llama-cpp on Darwin maps to backend/cpp/llama-cpp/ (the C++
  source tree) rather than the non-existent backend/go/llama-cpp/.
- Bumps Darwin go-version from 1.24.x -> 1.25.x in backend.yml and
  backend_pr.yml so llama-cpp keeps the Go toolchain it had under the
  bespoke job; the other 34 Darwin backends pick this up too with no
  known reason to pin 1.24.
- Removes ~80 lines of bespoke YAML from backend.yml.

The publish path is unchanged in shape - every Darwin backend now uses
the same crane-push leg from ubuntu-latest in
backend_build_darwin.yml; only the build target differs per backend.

After this commit, llama-cpp-darwin only rebuilds when
backend/cpp/llama-cpp/ is touched (verified locally) - same behavior
as every other Darwin backend.

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 10:18:17 +02:00
LocalAI [bot]
cb68cd1cf4 ci: pilot per-arch split + manifest merge for faster-whisper and llama-cpp-quantization (#9727)
ci: pilot per-arch split for faster-whisper and llama-cpp-quantization

Convert two backends from QEMU-emulated multi-arch (linux/amd64,linux/arm64
on a single ubuntu-latest) to native per-arch + manifest-list merge:
- amd64 leg on ubuntu-latest
- arm64 leg on ubuntu-24.04-arm (native, ~5-10x faster than emulated)
- merge job assembles both digests under the final tag via
  docker buildx imagetools create

Backends piloted:
- -cpu-faster-whisper (small Python, fast baseline)
- -cpu-llama-cpp-quantization (heavier compile path, stress test)

Infrastructure changes that the rest of Phase 2 (Tasks 2.5+) will reuse:
- .github/backend-matrix.yml entries gain a `platform-tag` field
  ('amd64'/'arm64') for matrix entries that participate in the split.
  Other entries omit it; backend_build.yml already defaults missing
  values to '' (empty cache key suffix preserved as cache<suffix>-).
- backend.yml + backend_pr.yml forward `platform-tag` from matrix to
  the reusable backend_build.yml.
- scripts/changed-backends.js groups filtered entries by tag-suffix
  and emits a `merge-matrix` (plus `has-merges`) for groups of size>=2.
  Singletons aren't merged.
- backend.yml + backend_pr.yml gain a `backend-merge-jobs` job that
  consumes merge-matrix and calls backend_merge.yml after backend-jobs.
  PR variant is also event-gated so the no-op-on-PR merge job doesn't
  even start.

The other 34 multi-arch entries are unchanged in this PR -- Task 2.5
fans out the same shape to them once the pilot is observed green.

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 00:04:42 +02:00
LocalAI [bot]
1f313cfdb0 ci: phase 1-3 of GHA free tier migration (path filter, multi-arch split prep, /mnt disk relief) (#9726)
* ci: extract free-disk-space composite action

Consolidate the apt-clean + dotnet/android/ghc/boost removal blocks from
backend_build.yml, image_build.yml, and test.yml into a single composite
action. The three callers had slightly different inline blocks; the
composite uses the more aggressive backend_build/image_build variant for
all three callers — test.yml jobs now also purge snapd, edge/firefox/
powershell/r-base-core, and sweep /opt/ghc + /usr/local/share/boost +
$AGENT_TOOLSDIRECTORY. Idempotent and skipped on self-hosted runners.

In test.yml, actions/checkout now runs before the composite action call
because the composite lives at ./.github/actions/free-disk-space and
requires a checked-out repo. The original ordering relied on
jlumbroso/free-disk-space@main being a remote action; this is the
minimum-invasive change to support a local composite.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: path-filter backend.yml master push

Run scripts/changed-backends.js on master pushes too (not just PRs) so
unrelated commits don't rebuild all ~210 backend container images. Tag
pushes still build the full matrix via FORCE_ALL.

Push events use the GitHub Compare API to diff event.before..event.after.
Edge cases (first push with zero base, API truncation beyond 300 files,
missing fields, network failure) fall back to "run everything" — better
safe than silently miss a backend.

The matrix literal moves from .github/workflows/backend.yml into a new
data-only file at .github/backend-matrix.yml (outside workflows/ so
actionlint doesn't try to parse it as a workflow). Both backend.yml and
backend_pr.yml now consume the dynamic matrix output uniformly via
fromJson(needs.generate-matrix.outputs.matrix); the script reads the
matrix from the new location.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: bound max-parallel on backend-jobs matrices

Cap to 8 concurrent jobs to avoid queue starvation on the shared GHA free
pool while migration is in flight. Lift after Phases 4-5 retire the
self-hosted runners. Also drops a leftover commented-out max-parallel
line that lived in backend.yml since the previous matrix shape.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: scope backend cache per arch, push by digest

Prepare backend_build.yml for the multi-arch split. The reusable
workflow now accepts a `platform-tag` input ("amd64" / "arm64") that
scopes the registry cache to cache<suffix>-<platform-tag> and (on push
events) pushes the resulting image by canonical digest only. Digests
are uploaded as artifacts named digests<suffix>-<platform-tag> for the
merge job (Task 2.2) to consume.

`platform-tag` is optional with empty default during the migration —
existing callers continue to work unchanged (their cache key just
becomes `cache<suffix>-`, an orphaned but valid key). Tasks 2.3+ will
update callers to pass an explicit "amd64" / "arm64" value. Phase 6
flips the input to required: true once every caller is wired.

PR builds keep their existing tag-based push to ci-tests but pick up
the per-arch cache key. Multi-arch PR builds remain emulated in this
commit; they migrate when the matrix entries split (Tasks 2.3+).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: add backend_merge.yml reusable workflow

Joins per-arch digest artifacts (uploaded by backend_build.yml when
called with platform-tag) into a single tagged multi-arch manifest list
via `docker buildx imagetools create`. Called once per backend by
backend.yml after both per-arch build jobs succeed.

The workflow generates final tags identically to the previous monolithic
build job (same docker/metadata-action invocation), so consumers of
quay.io/go-skynet/local-ai-backends and localai/localai-backends see no
tag-shape change. Two imagetools calls (one per registry) reference the
same per-arch digests under different image names.

Not yet wired into backend.yml — Tasks 2.3+ rewrite individual matrix
entries to expand into per-arch + merge jobs that call this workflow.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: relocate Docker data-root to /mnt on hosted runners

GHA hosted ubuntu-latest runners ship a ~75 GB /mnt drive that's unused
by default. Stopping Docker, rsync'ing /var/lib/docker to /mnt, and
restarting with data-root pointing there yields ~100 GB of working
space (combined with the apt-clean from Task 1.1) — enough for ROCm
dev image + vLLM torch install + flash-attn intermediate layers.

This is the structural change that lets Phases 4 and 5 of the migration
plan move the bigger-runner and arc-runner-set jobs onto ubuntu-latest.

The composite action is no-op on self-hosted runners (where /mnt isn't
expected) and on non-X64 runners (Task 3.2 verifies the arm64 hosted
pool's /mnt shape separately before enabling). Wired into both
backend_build.yml and image_build.yml between free-disk-space and the
first Docker operation.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(setup-build-disk): chmod 1777 /mnt/docker-tmp

buildx CLI runs as the unprivileged 'runner' user and creates config
dirs under TMPDIR before binding them into the buildkit container.
/mnt is root-owned by default, so the original mkdir produced a
permission-denied when buildx tried to write there:

  ERROR: mkdir /mnt/docker-tmp/buildkitd-config2740457204: permission denied

Mirror /tmp's permission mode (1777 — world-writable with sticky bit)
on /mnt/docker-tmp so non-root processes can stage their config.

Caught by the first PR run (image-build hipblas job) on PR #9726.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: weekly full-matrix rebuild via cron

Path-filtering backend.yml master push (the previous commit's main
optimization) skips backends whose source didn't change. That broke
the DEPS_REFRESH cache-buster's coverage: the build-arg keyed on
%Y-W%V busts the install layer's cache on a new ISO week, but only
when the build actually runs. Untouched Python backends (torch,
transformers, vllm with no version pin) would otherwise ship stale
wheels indefinitely.

Add a Sunday 06:00 UTC cron that fires the full matrix. Schedule
events have no event.ref / event.before, so the script's changedFiles
== null fallback (scripts/changed-backends.js) emits the full matrix
automatically — no script change needed.

C++/Go backends with pinned deps cache-hit and complete fast, so the
weekly cost is dominated by Python re-resolves which is exactly what
we want.

workflow_dispatch added so a maintainer can trigger an ad-hoc
full-matrix rebuild without faking a tag push.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-08 23:43:41 +02:00
Andreas Egli
1d0de757c3 fix: add hipblaslt library (#9541)
Signed-off-by: Andreas Egli <github@kharan.ch>
2026-04-24 18:50:03 +02:00
Ettore Di Giacinto
95efb8a562 feat(backend): add turboquant llama.cpp-fork backend (#9355)
* feat(backend): add turboquant llama.cpp-fork backend

turboquant is a llama.cpp fork (TheTom/llama-cpp-turboquant, branch
feature/turboquant-kv-cache) that adds a TurboQuant KV-cache scheme.
It ships as a first-class backend reusing backend/cpp/llama-cpp sources
via a thin wrapper Makefile: each variant target copies ../llama-cpp
into a sibling build dir and invokes llama-cpp's build-llama-cpp-grpc-server
with LLAMA_REPO/LLAMA_VERSION overridden to point at the fork. No
duplication of grpc-server.cpp — upstream fixes flow through automatically.

Wires up the full matrix (CPU, CUDA 12/13, L4T, L4T-CUDA13, ROCm, SYCL
f32/f16, Vulkan) in backend.yml and the gallery entries in index.yaml,
adds a tests-turboquant-grpc e2e job driven by BACKEND_TEST_CACHE_TYPE_K/V=q8_0
to exercise the KV-cache config path (backend_test.go gains dedicated env
vars wired into ModelOptions.CacheTypeKey/Value — a generic improvement
usable by any llama.cpp-family backend), and registers a nightly auto-bump
PR in bump_deps.yaml tracking feature/turboquant-kv-cache.

scripts/changed-backends.js gets a special-case so edits to
backend/cpp/llama-cpp/ also retrigger the turboquant CI pipeline, since
the wrapper reuses those sources.

* feat(turboquant): carry upstream patches against fork API drift

turboquant branched from llama.cpp before upstream commit 66060008
("server: respect the ignore eos flag", #21203) which added the
`logit_bias_eog` field to `server_context_meta` and a matching
parameter to `server_task::params_from_json_cmpl`. The shared
backend/cpp/llama-cpp/grpc-server.cpp depends on that field, so
building it against the fork unmodified fails.

Cherry-pick that commit as a patch file under
backend/cpp/turboquant/patches/ and apply it to the cloned fork
sources via a new apply-patches.sh hook called from the wrapper
Makefile. Simplifies the build flow too: instead of hopping through
llama-cpp's build-llama-cpp-grpc-server indirection, the wrapper now
drives the copied Makefile directly (clone -> patch -> build).

Drop the corresponding patch whenever the fork catches up with
upstream — the build fails fast if a patch stops applying, which
is the signal to retire it.

* docs: add turboquant backend section + clarify cache_type_k/v

Document the new turboquant (llama.cpp fork with TurboQuant KV-cache)
backend alongside the existing llama-cpp / ik-llama-cpp sections in
features/text-generation.md: when to pick it, how to install it from
the gallery, and a YAML example showing backend: turboquant together
with cache_type_k / cache_type_v.

Also expand the cache_type_k / cache_type_v table rows in
advanced/model-configuration.md to spell out the accepted llama.cpp
quantization values and note that these fields apply to all
llama.cpp-family backends, not just vLLM.

* feat(turboquant): patch ggml-rpc GGML_OP_COUNT assertion

The fork adds new GGML ops bringing GGML_OP_COUNT to 97, but
ggml/include/ggml-rpc.h static-asserts it equals 96, breaking
the GGML_RPC=ON build paths (turboquant-grpc / turboquant-rpc-server).
Carry a one-line patch that updates the expected count so the
assertion holds. Drop this patch whenever the fork fixes it upstream.

* feat(turboquant): allow turbo* KV-cache types and exercise them in e2e

The shared backend/cpp/llama-cpp/grpc-server.cpp carries its own
allow-list of accepted KV-cache types (kv_cache_types[]) and rejects
anything outside it before the value reaches llama.cpp's parser. That
list only contains the standard llama.cpp types — turbo2/turbo3/turbo4
would throw "Unsupported cache type" at LoadModel time, meaning
nothing the LocalAI gRPC layer accepted was actually fork-specific.

Add a build-time augmentation step (patch-grpc-server.sh, called from
the turboquant wrapper Makefile) that inserts GGML_TYPE_TURBO2_0/3_0/4_0
into the allow-list of the *copied* grpc-server.cpp under
turboquant-<flavor>-build/. The original file under backend/cpp/llama-cpp/
is never touched, so the stock llama-cpp build keeps compiling against
vanilla upstream which has no notion of those enum values.

Switch test-extra-backend-turboquant to set
BACKEND_TEST_CACHE_TYPE_K=turbo3 / _V=turbo3 so the e2e gRPC suite
actually runs the fork's TurboQuant KV-cache code paths (turbo3 also
auto-enables flash_attention in the fork). Picking q8_0 here would
only re-test the standard llama.cpp path that the upstream llama-cpp
backend already covers.

Refresh the docs (text-generation.md + model-configuration.md) to
list turbo2/turbo3/turbo4 explicitly and call out that you only get
the TurboQuant code path with this backend + a turbo* cache type.

* fix(turboquant): rewrite patch-grpc-server.sh in awk, not python3

The builder image (ubuntu:24.04 stage-2 in Dockerfile.turboquant)
does not install python3, so the python-based augmentation step
errored with `python3: command not found` at make time. Switch to
awk, which ships in coreutils and is already available everywhere
the rest of the wrapper Makefile runs.

* Apply suggestion from @mudler

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-15 01:25:04 +02:00
Ettore Di Giacinto
9ca03cf9cc feat(backends): add ik-llama-cpp (#9326)
* feat(backends): add ik-llama-cpp

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: add grpc e2e suite, hook to CI, update README

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Apply suggestion from @mudler

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

* Apply suggestion from @mudler

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-12 13:51:28 +02:00
Ettore Di Giacinto
151ad271f2 feat(rocm): bump to 7.x (#9323)
feat(rocm): bump to 7.2.1

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-12 08:51:30 +02:00
Richard Palethorpe
ea6e850809 feat: Add Kokoros backend (#9212)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-04-08 19:23:16 +02:00
Richard Palethorpe
e0eb2fd734 chore(ci): Scope tests extras backend tests (#9170)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-03-30 17:46:07 +00:00
Ettore Di Giacinto
271cc79709 chore(backends): do not bundle cuda target directory (#7982)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-01-12 07:51:09 +01:00
Copilot
06323df457 Optimize GPU library copying to preserve symlinks and avoid duplicates (#7931)
* Initial plan

* Optimize library copying to preserve symlinks and avoid duplicates

Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>

* Address code review feedback: extract get_inode helper, use file type detection for sorting

Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>

* Simplify implementation by removing inode tracking

Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>

* Add clarifying comment about basename deduplication

Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-01-08 22:26:48 +01:00
Copilot
fd53978a7b feat: package GPU libraries inside backend containers for unified base image (#7891)
* Initial plan

* Add GPU library packaging for isolated backend environments

- Create scripts/build/package-gpu-libs.sh for packaging CUDA, ROCm, SYCL, and Vulkan libraries
- Update llama-cpp, whisper, stablediffusion-ggml package.sh to include GPU libraries
- Update Dockerfile.python to package GPU libraries into Python backends
- Update libbackend.sh to set LD_LIBRARY_PATH for GPU library loading

Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>

* Address code review feedback: fix variable consistency and quoting

Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>

* Fix code review issues: improve glob handling and remove redundant variable

Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>

* Simplify main Dockerfile and workflow to use unified base image

- Remove GPU-specific driver installation from Dockerfile (CUDA, ROCm, Vulkan, Intel)
- Simplify image.yml workflow to build single unified base image for linux/amd64 and linux/arm64
- GPU libraries are now packaged in individual backend containers

Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-01-07 15:48:51 +01:00
Richard Palethorpe
976c159fdb chore(ci): Build some Go based backends on Darwin (#6164)
* chore(ci): Build Go based backends on Darwin

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* chore(stablediffusion-ggml): Fixes for building on Darwin

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* chore(whisper): Build on Darwin

Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2025-09-01 22:18:30 +02:00
Richard Palethorpe
e6ebfd3ba1 feat(whisper-cpp): Convert to Purego and add VAD (#6087)
* fix(ci): Avoid matching wrong backend with the same prefix

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* chore(whisper): Use Purego and enable VAD

This replaces the Whisper CGO bindings with our own Purego based module
to make compilation easier.

In addition this allows VAD models to be loaded by Whisper. There is not
much benefit now except that the same backend can be used for VAD and
transcription. Depending on upstream we may also be able to use GPU for
VAD in the future, but presently it is disabled.

Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2025-08-28 17:25:18 +02:00
Ettore Di Giacinto
4993df81c3 fix(metal-llama.cpp): add all libutf8_validity
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-08-26 10:19:55 +02:00
Ettore Di Giacinto
6971f71a6c Add mlx-vlm (#6119)
* Add mlx-vlm

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Add to CI workflows

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Add requirements-mps.txt

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Simplify

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-08-23 23:05:30 +02:00
Ettore Di Giacinto
1ba66d00f5 feat: bundle python inside backends (#6123)
* feat(backends): bundle python

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test ci

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* vllm on self-hosted

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Add clang

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Try to fix it for Mac

* Relocate links only when is portable

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Make sure to call macosPortableEnv

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Use self-hosted for vllm

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* CI

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-08-23 22:36:39 +02:00
Ettore Di Giacinto
1d830ce7dd feat(mlx): add mlx backend (#6049)
* chore: allow to install with pip

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Make the backend to build and actually work

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* List models from system only

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Add script to build darwin python backends

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Run protogen in libbackend

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Detect if mps is available across python backends

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* CI: try to build backend

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Debug CI

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Index mlx-vlm

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Remove mlx-vlm

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Drop CI test

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-08-22 08:42:29 +02:00
Richard Palethorpe
ebd1db2f09 chore(ci): Build modified backends on PR (#6086)
* chore(stablediffusion-ggml): rm redundant comment

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* chore(ci): Build modified backends on PR

Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2025-08-18 17:56:34 +02:00
Ettore Di Giacinto
4fbd639463 chore(ci): fixup builds for darwin and hipblas
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-08-15 15:58:02 +02:00
Ettore Di Giacinto
253b7537dc fix(llama-cpp/darwin): make sure to bundle libutf8 libs (#6060)
fix(darwin): make sure to bundle libutf8_validity

Plus some refactoring, use makefile

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-08-14 17:56:35 +02:00
Ettore Di Giacinto
294f7022f3 feat: do not bundle llama-cpp anymore (#5790)
* Build llama.cpp separately

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Start to try to attach some tests

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Add git and small fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix: correctly autoload external backends

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Try to run AIO tests

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Slightly update the Makefile helps

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Adapt auto-bumper

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Try to run linux test

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Add llama-cpp into build pipelines

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Add default capability (for cpu)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Drop llama-cpp specific logic from the backend loader

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* drop grpc install in ci for tests

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Pass by backends path for tests

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Build protogen at start

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(tests): set backends path consistently

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Correctly configure the backends path

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Try to build for darwin

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Compile for metal on arm64/darwin

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Try to run build off from cross-arch

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Add to the backend index nvidia-l4t and cpu's llama-cpp backends

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Build also darwin-x86 for llama-cpp

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Disable arm64 builds temporary

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Test backend build on PR

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fixup build backend reusable workflow

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* pass by skip drivers

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Use crane

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Skip drivers

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* x86 darwin

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Add packaging step for llama.cpp

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fix leftover from bark-cpp extraction

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Try to fix hipblas build

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 13:24:12 +02:00
Ettore Di Giacinto
8dcab2f9c7 chore(scripts): allow to specify quants (#5430)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-05-22 11:53:30 +02:00
Ettore Di Giacinto
e8128a339a chore(scripts): handle summarization errors (#4271)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2024-11-26 14:51:55 +01:00
Ettore Di Giacinto
62aa3bfdb2 fix(scripts): minor fixup to gallery scripts
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2024-07-13 11:36:20 +02:00
Ettore Di Giacinto
bb063ab78a feat(scripts): add scripts to help adding new models to the gallery (#2789)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2024-07-12 18:16:42 +02:00
Ettore Di Giacinto
683c306f90 ci(Makefile): adds tts in binary releases (#2695)
* ci(Makefile): run tts and stablediffusion in dist

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* re-add macos-13

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* rely on detection

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* move logic to a script

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* missing some libs still

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2024-07-05 23:19:24 +02:00