Compare commits

...

21 Commits

Author SHA1 Message Date
LocalAI [bot]
bc4cd3dd85 feat(llama-cpp): bump to 1ec7ba0c, adapt grpc-server, expose new spec-decoding options (#9765)
* chore(llama.cpp): bump to 1ec7ba0c14f33f17e980daeeda5f35b225d41994

Picks up the upstream `spec : parallel drafting support` change
(ggml-org/llama.cpp#22838) which reshapes the speculative-decoding API
and `server_context_impl`.

Adapt the grpc-server wrapper accordingly:

  * `common_params_speculative::type` (single enum) became `types`
    (`std::vector<common_speculative_type>`). Update both the
    "default to draft when a draft model is set" branch and the
    `spec_type`/`speculative_type` option parser. The parser now also
    tolerates comma-separated lists, mirroring the upstream
    `common_speculative_types_from_names` semantics.
  * `common_params_speculative_draft::n_ctx` is gone (draft now shares
    the target context size). Keep the `draft_ctx_size` option name for
    backward compatibility and ignore the value rather than failing.
  * `server_context_impl::model` was renamed to `model_tgt`; update the
    two reranker / model-metadata call sites.

Replaces #9763. Builds cleanly under the linux/amd64 cpu-llama-cpp
target locally.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama-cpp): expose new speculative-decoding option keys

Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838)
adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative
families and beefs up the draft-model knobs. The previous bump only
adapted the API; this exposes the new fields through the grpc-server
options dictionary so model configs can drive them.

New `options:` keys (all under `backend: llama-cpp`):

ngram_mod (`ngram_mod` type):
  spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match

ngram_map_k (`ngram_map_k` type):
  spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits

ngram_map_k4v (`ngram_map_k4v` type):
  spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m /
  spec_ngram_map_k4v_min_hits

ngram lookup caches (`ngram_cache` type):
  spec_lookup_cache_static / lookup_cache_static
  spec_lookup_cache_dynamic / lookup_cache_dynamic

Draft-model tuning (active when `spec_type` is `draft`):
  draft_cache_type_k / spec_draft_cache_type_k
  draft_cache_type_v / spec_draft_cache_type_v
  draft_threads / spec_draft_threads
  draft_threads_batch / spec_draft_threads_batch
  draft_cpu_moe / spec_draft_cpu_moe          (bool flag)
  draft_n_cpu_moe / spec_draft_n_cpu_moe      (first N MoE layers on CPU)
  draft_override_tensor / spec_draft_override_tensor
    (comma-separated <tensor regex>=<buffer type>; re-implements upstream's
     static parse_tensor_buffer_overrides since it isn't exported)

`spec_type` already accepted comma-separated lists after the previous
commit, matching upstream's `common_speculative_types_from_names`.

Docs: refresh `docs/content/advanced/model-configuration.md` with
per-family tables and a note about multi-type chaining.

Builds locally with `make docker-build-llama-cpp` (linux/amd64
cpu-llama-cpp AVX variant).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(turboquant): bridge new llama.cpp spec API to the legacy fork layout

The previous commits in this series adapted backend/cpp/llama-cpp/grpc-server.cpp
to the post-#22838 (parallel drafting) llama.cpp API. The turboquant build
reuses the same grpc-server.cpp through backend/cpp/turboquant/Makefile,
which copies it into turboquant-<flavor>-build/ and runs patch-grpc-server.sh
on the copy. The fork branched before the API refactor, so it errors out on:

  * `ctx_server.impl->model_tgt` (fork still has `model`)
  * `params.speculative.{ngram_mod,ngram_map_k,ngram_map_k4v,ngram_cache}.*`
    (none of these sub-structs exist in the fork)
  * `params.speculative.draft.{cache_type_k/v, cpuparams[, _batch].n_threads,
    tensor_buft_overrides}` (fork uses the pre-#22397 flat layout)
  * `params.speculative.types` vector / `common_speculative_types_from_names`
    (fork has a scalar `type` and only the singular helper)

Approach:

1. backend/cpp/llama-cpp/grpc-server.cpp: introduce a single feature switch
   `LOCALAI_LEGACY_LLAMA_CPP_SPEC`. When defined, the two `speculative.type[s]`
   discriminations (the "default to draft when a draft model is set" branch
   and the `spec_type` / `speculative_type` option parser) fall back to the
   singular scalar form, and the entire new-option block (ngram_mod / map_k
   / map_k4v / ngram_cache / draft.{cache_type_*, cpuparams*,
   tensor_buft_overrides}) is preprocessed out. The macro is *not* defined
   in the source tree — stock llama-cpp builds get the full new API.

2. backend/cpp/turboquant/patch-grpc-server.sh: two new patch steps applied
   to the per-flavor build copy at turboquant-<flavor>-build/grpc-server.cpp:
   - substitute `ctx_server.impl->model_tgt` -> `ctx_server.impl->model`
   - inject `#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1` before the first
     `#include`, so the guarded blocks above drop out for the fork build.

   Both patches are idempotent and follow the existing sed/awk pattern in
   this script (KV cache types, `get_media_marker`, flat speculative
   renames). Stock llama-cpp's `grpc-server.cpp` is never touched.

Drop both legacy patches once the turboquant fork rebases past
ggml-org/llama.cpp#22397 / #22838.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(turboquant): close draft_ctx_size brace inside legacy guard

The previous turboquant fix wrapped the new option-handler blocks in
`#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC ... #endif` but placed the guard
in the middle of an `else if` chain — the `} else if` openings of the
new blocks were responsible for closing the previous block's brace.
With the macro defined the new blocks vanish, draft_ctx_size's `{`
loses its closer, the for-loop's `}` is consumed instead, and the
file ends with a stray opening brace — clang reports it as
`function-definition is not allowed here before '{'` on the next
top-level `int main(...)` and `expected '}' at end of input`.

Move the chain split inside the draft_ctx_size branch:

    } else if (... "draft_ctx_size") {
        // ...
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
    }                                  // legacy: chain ends here
#else
    } else if (... "spec_ngram_mod_n_min") {  // modern: chain continues
        ...
    } else if (... "draft_override_tensor") {
        ...
    }                                  // closes last branch
#endif
    }                                  // closes for-loop

Brace count is now balanced under both preprocessor branches (verified
with `tr -cd '{' | wc -c` against the patched and unpatched outputs).

Local `make docker-build-turboquant` builds the linux/amd64 cpu-llama-cpp
`turboquant-avx` variant cleanly.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ci): forward AMDGPU_TARGETS into Dockerfile.turboquant builder-prebuilt

Dockerfile.turboquant's `builder-prebuilt` stage was missing the
`ARG AMDGPU_TARGETS` / `ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}` pair that
`builder-fromsource` already has (and that `Dockerfile.llama-cpp`
mirrors across both stages). When CI uses the prebuilt base image
(quay.io/go-skynet/ci-cache:base-grpc-*, the common path) the build-arg
passed by the workflow never reaches the env inside the compile stage.

backend/cpp/llama-cpp/Makefile:38 (introduced by #9626) errors out on
hipblas builds when AMDGPU_TARGETS is empty, and the turboquant
Makefile reuses backend/cpp/llama-cpp via a sibling build dir, so the
same check fires from turboquant-fallback under BUILD_TYPE=hipblas:

  Makefile:38: *** AMDGPU_TARGETS is empty — set it to a comma-separated
  list of gfx targets e.g. gfx1100,gfx1101.  Stop.
  make: *** [Makefile:66: turboquant-fallback] Error 2

The bug is latent on master because the docker layer cache stays warm
across builds — the compile step rarely re-runs from scratch. The
llama.cpp bump in this PR invalidates the cache, so the missing env var
becomes load-bearing and the hipblas turboquant CI job fails.

Mirror the existing pattern from Dockerfile.llama-cpp.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-12 17:22:37 +02:00
LocalAI [bot]
86a7f6c9fa ci: close GC race + cascade-skip + darwin grpc gaps from v4.2.1 (#9781)
* ci: close the GC race + cascade-skip + darwin grpc gaps from v4.2.1

v4.2.1's backend.yml run (#25701862853) exposed three independent issues
on top of the singletons fix shipped in ea001995. Address all three plus
two related cleanups:

1. quay GC race in backend-merge-jobs-multiarch (12/37 merges failed with
   "manifest not found"). Even after PR #9746 split multi/single-arch
   merges, the multiarch matrix itself takes ~2h to drain at
   max-parallel: 8, and the earliest per-arch digests (push-by-digest,
   no tag) get reaped by quay's GC before the merge runs. The split
   bounded the race for multiarch; it doesn't eliminate it. Anchor each
   per-arch digest immediately to a tag in the internal ci-cache image
   (`keepalive-<run_id><tag-suffix>-<platform-tag>`). Quay won't GC
   tagged manifests. backend_merge.yml deletes the keepalive tags via
   quay REST API after publishing the user-facing manifest list.
   Cleanup is best-effort: if the quay token is not OAuth-scoped the
   merge does NOT fail, the orphan tags just persist.

2. cascade-skip on backend-merge-jobs-singlearch. v4.2.1 had 2 failed
   and 2 cancelled singlearch builds (out of 199); GHA's default
   `needs:` semantics cascade-skipped the entire singlearch merge
   matrix, so zero singleton tags were applied even though 197
   singletons built successfully. Wrap the merge `if:` in
   `!cancelled() && ...` for both multi and single arch in backend.yml
   and backend_pr.yml so partial build failures publish the successful
   tag-suffixes.

3. Darwin llama-cpp grpc-server build fails with `find_package(absl)`
   not found. Same shape as the ccache/blake3/fmt/hiredis/xxhash/zstd
   fix already in `Dependencies`: a brew cache hit restores
   `/opt/homebrew/Cellar/grpc` so `brew install grpc` no-ops, but
   abseil isn't in our Cellar cache list and never gets installed
   alongside, leaving grpc's CMake unable to resolve it. Mirror the
   `brew reinstall ccache` line with `brew reinstall grpc` to
   re-validate grpc's full transitive dep closure on every cache-hit
   run.

4. Move the four heaviest CUDA cpp builds back to bigger-runner. v4.2.1
   wall-clock: -gpu-nvidia-cuda-12-llama-cpp 5h36m,
   -gpu-nvidia-cuda-12-turboquant 6h05m,
   -gpu-nvidia-cuda-13-llama-cpp 5h37m,
   -gpu-nvidia-cuda-13-turboquant 6h05m. The cuda-12 turboquant and
   cuda-13 turboquant entries are over GHA's 6h job timeout. Phase 5.3
   of the free-tier migration (PR #9730) had explicitly flagged this
   batch as 'highest-risk' with a per-entry revert path. All other
   matrix entries (vulkan-llama-cpp ~47m, ROCm hipblas-llama-cpp ~2h,
   intel sycl-f32 ~1h49m) stay on free-tier ubuntu-latest.

Verified locally: all six edited workflow YAMLs parse cleanly. Real
verification has to come from the next tag release run.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: extract keepalive anchor + cleanup into .github/scripts/

The two inline shell blocks from the previous commit are long enough to
hurt readability of the workflow YAML and benefit from their own files
with self-contained docs. Move them to .github/scripts/:

  anchor-digest-in-cache.sh    backend_build.yml's keepalive anchor
  cleanup-keepalive-tags.sh    backend_merge.yml's best-effort cleanup

Workflow steps reduce to a single `run:` invocation each, with all the
parameter plumbing handled by env vars on the step. backend_merge.yml
also gains a sparse `actions/checkout@v6` step (sparse to .github/scripts
only) so the cleanup script is available on the runner — backend_build
already checks out for the docker build.

Net workflow diff: -36 lines across the two files. Script logic and
behavior are byte-identical to the inline version.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-12 17:22:09 +02:00
LocalAI [bot]
a57e73691d fix(ollama): accept prompt alias on /api/embed for Ollama parity (#9780)
Ollama's embedding endpoint accepts both `input` and `prompt` as the
input string value (see ollama/ollama docs/api.md#generate-embeddings).
LocalAI only accepted `input`, which broke client libraries that send
the `prompt` form.

Add `Prompt` to OllamaEmbedRequest and have GetInputStrings fall back
to it when Input is unset. Input still wins when both are provided.

Fixes #9767.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-12 17:21:20 +02:00
dependabot[bot]
a689100d61 chore(deps): bump the npm_and_yarn group across 1 directory with 3 updates (#9728)
Bumps the npm_and_yarn group with 3 updates in the /core/http/react-ui directory: [fast-uri](https://github.com/fastify/fast-uri), [hono](https://github.com/honojs/hono) and [ip-address](https://github.com/beaugunderson/ip-address).


Updates `fast-uri` from 3.1.0 to 3.1.2
- [Release notes](https://github.com/fastify/fast-uri/releases)
- [Commits](https://github.com/fastify/fast-uri/compare/v3.1.0...v3.1.2)

Updates `hono` from 4.12.14 to 4.12.18
- [Release notes](https://github.com/honojs/hono/releases)
- [Commits](https://github.com/honojs/hono/compare/v4.12.14...v4.12.18)

Updates `ip-address` from 10.1.0 to 10.2.0
- [Commits](https://github.com/beaugunderson/ip-address/commits)

---
updated-dependencies:
- dependency-name: fast-uri
  dependency-version: 3.1.2
  dependency-type: indirect
  dependency-group: npm_and_yarn
- dependency-name: hono
  dependency-version: 4.12.18
  dependency-type: indirect
  dependency-group: npm_and_yarn
- dependency-name: ip-address
  dependency-version: 10.2.0
  dependency-type: indirect
  dependency-group: npm_and_yarn
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:54:38 +02:00
Andreas Egli
03815e3b59 fix: parse vulkan VRAM from text (#9669)
* fix: parse vulkan VRAM from text

Assisted-by: opencode:gpt-5.5
Signed-off-by: Andreas Egli <github@kharan.ch>

* fix: replace string.split with streaming iteration

Assisted-by: Opencode:Gemma4
Signed-off-by: Andreas Egli <github@kharan.ch>

---------

Signed-off-by: Andreas Egli <github@kharan.ch>
2026-05-12 09:53:48 +02:00
dependabot[bot]
37991c8a18 chore(deps): bump github.com/mudler/edgevpn from 0.31.1 to 0.32.2 (#9773)
Bumps [github.com/mudler/edgevpn](https://github.com/mudler/edgevpn) from 0.31.1 to 0.32.2.
- [Release notes](https://github.com/mudler/edgevpn/releases)
- [Commits](https://github.com/mudler/edgevpn/compare/v0.31.1...v0.32.2)

---
updated-dependencies:
- dependency-name: github.com/mudler/edgevpn
  dependency-version: 0.32.2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:51:39 +02:00
dependabot[bot]
61c9b187fa chore(deps): update charset-normalizer requirement from >=3.4.0 to >=3.4.7 in /backend/python/vllm (#9779)
chore(deps): update charset-normalizer requirement

Updates the requirements on [charset-normalizer](https://github.com/jawah/charset_normalizer) to permit the latest version.
- [Release notes](https://github.com/jawah/charset_normalizer/releases)
- [Changelog](https://github.com/jawah/charset_normalizer/blob/master/CHANGELOG.md)
- [Commits](https://github.com/jawah/charset_normalizer/compare/3.4.0...3.4.7)

---
updated-dependencies:
- dependency-name: charset-normalizer
  dependency-version: 3.4.7
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:22:23 +02:00
dependabot[bot]
c66014312e chore(deps): bump github.com/fsnotify/fsnotify from 1.9.0 to 1.10.1 (#9778)
Bumps [github.com/fsnotify/fsnotify](https://github.com/fsnotify/fsnotify) from 1.9.0 to 1.10.1.
- [Release notes](https://github.com/fsnotify/fsnotify/releases)
- [Changelog](https://github.com/fsnotify/fsnotify/blob/main/CHANGELOG.md)
- [Commits](https://github.com/fsnotify/fsnotify/compare/v1.9.0...v1.10.1)

---
updated-dependencies:
- dependency-name: github.com/fsnotify/fsnotify
  dependency-version: 1.10.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:21:18 +02:00
dependabot[bot]
abc2a51641 chore(deps): update transformers requirement from >=5.0.0 to >=5.8.0 in /backend/python/transformers (#9775)
chore(deps): update transformers requirement

Updates the requirements on [transformers](https://github.com/huggingface/transformers) to permit the latest version.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](https://github.com/huggingface/transformers/compare/v5.0.0...v5.8.0)

---
updated-dependencies:
- dependency-name: transformers
  dependency-version: 5.8.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:21:05 +02:00
dependabot[bot]
cd7d163178 chore(deps): bump github.com/onsi/gomega from 1.39.1 to 1.40.0 (#9774)
Bumps [github.com/onsi/gomega](https://github.com/onsi/gomega) from 1.39.1 to 1.40.0.
- [Release notes](https://github.com/onsi/gomega/releases)
- [Changelog](https://github.com/onsi/gomega/blob/master/CHANGELOG.md)
- [Commits](https://github.com/onsi/gomega/compare/v1.39.1...v1.40.0)

---
updated-dependencies:
- dependency-name: github.com/onsi/gomega
  dependency-version: 1.40.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:20:36 +02:00
dependabot[bot]
7aac599deb chore(deps): bump github.com/anthropics/anthropic-sdk-go from 1.27.0 to 1.42.0 (#9772)
chore(deps): bump github.com/anthropics/anthropic-sdk-go

Bumps [github.com/anthropics/anthropic-sdk-go](https://github.com/anthropics/anthropic-sdk-go) from 1.27.0 to 1.42.0.
- [Release notes](https://github.com/anthropics/anthropic-sdk-go/releases)
- [Changelog](https://github.com/anthropics/anthropic-sdk-go/blob/main/CHANGELOG.md)
- [Commits](https://github.com/anthropics/anthropic-sdk-go/compare/v1.27.0...v1.42.0)

---
updated-dependencies:
- dependency-name: github.com/anthropics/anthropic-sdk-go
  dependency-version: 1.42.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:20:24 +02:00
dependabot[bot]
d75173dd2a chore(deps): bump actions/download-artifact from 4 to 8 (#9771)
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 4 to 8.
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](https://github.com/actions/download-artifact/compare/v4...v8)

---
updated-dependencies:
- dependency-name: actions/download-artifact
  dependency-version: '8'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:20:14 +02:00
dependabot[bot]
9be5310394 chore(deps): bump actions/upload-artifact from 4 to 7 (#9770)
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 7.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v4...v7)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:20:03 +02:00
dependabot[bot]
cdf50fd723 chore(deps): bump node from 25-slim to 26-slim (#9769)
Bumps node from 25-slim to 26-slim.

---
updated-dependencies:
- dependency-name: node
  dependency-version: 26-slim
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:19:51 +02:00
LocalAI [bot]
bc3fb16105 feat(ollama): report model capabilities + details on /api/tags and /api/show (#9766)
Ollama-compatible clients (Open WebUI, Enchanted, ollama-grid-search,
etc.) rely on the `capabilities` list and `details.{parameter_size,
quantization_level,families}` fields returned by /api/tags and
/api/show to decide which models are eligible for a given task --
for example to filter the "embedding model" picker. Upstream Ollama
returns these; LocalAI's compat layer was leaving them empty, so
embedding models were silently rejected by clients that only allow
chat models for chat and only allow embedding models for embeddings.

This wires up the existing config signals already present in
ModelConfig:

- modelCapabilities() derives the Ollama capability strings from the
  config: "embedding" (FLAG_EMBEDDINGS), "completion" (FLAG_CHAT /
  FLAG_COMPLETION), "vision" (explicit KnownUsecases bit or MMProj /
  multimodal template / backend media marker), "tools" (auto-detected
  ToolFormatMarkers, JSON/Response regex, XML format, grammar
  triggers), "thinking" (ReasoningConfig with reasoning not disabled)
  and "insert" (presence of a completion template).
- modelDetailsFromModelConfig() now fills families, parameter_size
  and quantization_level. The latter two are parsed from the GGUF
  filename via regex -- conservative tokens only (Q*/IQ*/F16/F32/BF16
  and \d+(\.\d+)?[BM] surrounded by separators) so we don't accidentally
  match "Qwen3" as "3B".
- modelInfoFromModelConfig() exposes general.architecture and
  general.context_length in the new ShowResponse.model_info map.

Note: HasUsecases(FLAG_VISION) cannot be used directly -- GuessUsecases
has no FLAG_VISION case and returns true at the end for any chat model.
hasVisionSupport() instead reads KnownUsecases explicitly plus MMProj /
template / media-marker signals.

Tests are written first (TDD) using Ginkgo/Gomega -- DescribeTable for
the capability mapping (embedding-only, chat, vision, thinking, tools
via markers, tools via JSON regex, no-capability rerank) plus
integration tests against ShowModelEndpoint that round-trip JSON
through a real ModelConfigLoader populated from a temp YAML file.

Fixes #9760.


Assisted-by: Claude Code:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-12 00:16:19 +02:00
LocalAI [bot]
78722caedc chore: ⬆️ Update ikawrakow/ik_llama.cpp to eb570eb96689c235933b813693ca28ab9d3d26de (#9764)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-12 00:02:22 +02:00
LocalAI [bot]
621c612b2d ci(bump-deps): register ds4 + move version pin into the Makefile (#9761)
* ci(bump-deps): register ds4 + move version pin into the Makefile

The initial ds4 PR (#9758) put the upstream commit pin in
backend/cpp/ds4/prepare.sh as a shell variable. The auto-bump bot at
.github/bump_deps.sh greps for ^$VAR?= in a Makefile, so DS4_VERSION
was invisible to it - other backends (llama-cpp, ik-llama-cpp,
turboquant, voxtral, etc.) all pin in their Makefile.

This change:

- Moves DS4_VERSION?= and DS4_REPO?= to the top of
  backend/cpp/ds4/Makefile.
- Inlines the git init/fetch/checkout recipe into the 'ds4:' target
  (matches llama-cpp's 'llama.cpp:' target pattern). Directory acts
  as the target so make only re-clones when missing.
- Deletes the now-redundant prepare.sh.
- Adds antirez/ds4 + DS4_VERSION + main + backend/cpp/ds4/Makefile to
  the .github/workflows/bump_deps.yaml matrix so the daily bot opens
  PRs against this pin.
- Updates .agents/ds4-backend.md to point at the Makefile.

Verified:
  $ grep -m1 '^DS4_VERSION?=' backend/cpp/ds4/Makefile
  DS4_VERSION?=ae302c2fa18cc6d9aefc021d0f27ae03c9ad2fc0
  $ make -C backend/cpp/ds4 ds4   # clones into ds4/ at the pin
  $ make -C backend/cpp/ds4 ds4   # no-op on second invocation
  make: 'ds4' is up to date.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: route backend/cpp/ds4/ changes through changed-backends.js

scripts/changed-backends.js:inferBackendPath has an explicit branch per
cpp dockerfile suffix (ik-llama-cpp, turboquant, llama-cpp). Without a
matching branch the function returns null, the backend never lands in
the path map, and PR change-detection cannot map "backend/cpp/ds4/X
changed" -> "rebuild ds4 image".

This is why PR #9761 produced zero ds4 jobs even though it directly
edits backend/cpp/ds4/Makefile.

Adds the missing branch (Dockerfile.ds4 -> backend/cpp/ds4/), placed
before the llama-cpp branch (since both share the .cpp ancestry but
ds4 is more specific - same ordering rule documented in
.agents/adding-backends.md).

Verified with a local Node simulation of the script against this PR's
diff: the path map now contains 'ds4 -> backend/cpp/ds4/' and a
'backend/cpp/ds4/Makefile' change correctly triggers the ds4 backend
in the rebuild set.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(adding-backends): harden the two gotchas that bit ds4

Both omissions are silent at the time you ADD a backend - the failure
mode only appears later (the bump bot stays silent forever, or the path
filter shows up on the next PR that touches your backend with zero CI
jobs and looks broken for unrelated reasons). Expanding the
`scripts/changed-backends.js` paragraph from a one-liner to a fully
worked example, and adding a new sibling paragraph for the
`bump_deps.yaml` + Makefile-pin contract.

Both call out the specific mistakes from the ds4 timeline (#9758#9761) so future contributors can pattern-match on the cause.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-11 22:46:02 +02:00
LocalAI [bot]
e3f9de1026 docs: ⬆️ update docs version mudler/LocalAI (#9762)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-11 22:37:06 +02:00
LocalAI [bot]
d892e4af80 feat: add ds4 backend (DeepSeek V4 Flash) with tool calls, thinking, KV cache (#9758)
* test(e2e-backends): allow BACKEND_BINARY for native-built backends

Adds an escape hatch for hardware-gated backends (e.g. ds4) where the
model is too large for Docker build context. When BACKEND_BINARY points
at a run.sh produced by 'make -C backend/cpp/<name> package', the suite
skips docker image extraction and drives the binary directly.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(e2e-backends): validate BACKEND_BINARY basename + log actual source

Two follow-ups from the cbcf5148 code review:

- BACKEND_BINARY now requires a path whose basename is `run.sh`. Without
  this check, `filepath.Dir(binary)` silently discarded the filename, so
  pointing the env var at an arbitrary binary failed later with a
  confusing assertion that named a path the user never typed.
- The "Testing image=..." debug line printed an empty string when the
  binary path was used, hiding the actual source in CI logs. The line
  now reports whichever of BACKEND_IMAGE / BACKEND_BINARY is in effect
  as `src=...`.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): scaffold ds4 backend dir

Adds prepare.sh, run.sh, and a .gitignore. CMakeLists, Makefile, and the
implementation arrive in follow-up commits.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): add backend Makefile

Drives ds4's upstream Makefile to produce engine .o files (CUDA on Linux
when BUILD_TYPE=cublas, Metal on Darwin, otherwise CPU debug path), then
invokes CMake on our wrapper.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): add CMakeLists for grpc-server

Generates protoc stubs from backend.proto, links grpc-server.cpp +
dsml_parser.cpp + dsml_renderer.cpp + kv_cache.cpp against pre-built
ds4 engine .o files. DS4_GPU=cuda|metal|cpu selects the backend.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): grpc-server skeleton + module stubs

The minimum that links: Backend service with Health + Free; other RPCs
default to UNIMPLEMENTED. Stub headers/sources for dsml_parser,
dsml_renderer, and kv_cache are in place so CMake links cleanly even
before those modules ship.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement LoadModel

Opens engine + creates session sized to ContextSize (default 32768).
Backend is compile-time: CPU when DS4_NO_GPU, Metal on __APPLE__, else
CUDA. MTP/speculative options are accepted via ModelOptions.Options[]
(mtp_path, mtp_draft, mtp_margin). kv_cache_dir option is captured into
g_kv_cache_dir for the cache module (Task 19 wires it in).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement TokenizeString

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement Predict (plain text)

Tool calls + thinking-mode split arrive in Task 13 once dsml_parser is in.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement PredictStream (plain text)

ChatDelta + reasoning/tool_calls split arrives in Task 14.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement Status RPC

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): add DSML streaming parser

Classifies raw model-emitted token text into CONTENT / REASONING /
TOOL_START / TOOL_ARGS / TOOL_END events. Markers it watches for are the
literal DSML strings rendered by ds4_server.c's prompt template
(<|DSML|tool_calls>, <|DSML|invoke name=...>, <think>, etc.) - these are
plain text the model emits, not special tokens.

Partial markers split across token chunks are buffered until a full marker
or a definitively-not-a-marker '<' is observed. RandomToolId() generates
the API-side tool call id (call_xxx) that exact-replay would key on.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): split hex escapes in DSML markers + add cstring/cstdio includes

C++ \x hex escapes have no length cap. '\x9cD' was read as a single escape
producing byte 0xCD, eating the 'D'. The markers were never actually matching
the DSML text the model emits. Split each escape with adjacent string literal
concatenation so the byte sequence is exactly EF BD 9C 44 (|D) at runtime.

Also adds <cstring> and <cstdio> includes (libstdc++ 13 does not transitively
expose std::strlen / std::snprintf via <string>).

The local plan file (uncommitted) was also updated with the same fixes so
Task 16's dsml_renderer.cpp does not re-introduce the bug.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): wire DsmlParser into Predict (ChatDelta)

Non-streaming Predict now emits one ChatDelta carrying content,
reasoning_content, and tool_calls[] parsed from the model's DSML output.
Reply.message still carries the raw model bytes for backends that prefer
the regex fallback path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): wire DsmlParser into PredictStream

Per-token ChatDelta writes: content/reasoning_content go incrementally,
tool_calls emit TOOL_START as one delta (id + name) followed by
TOOL_ARGS deltas with incremental JSON. The Go-side aggregator
(pkg/functions/chat_deltas.go) reassembles them.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): chat template + reasoning_effort mapping

UseTokenizerTemplate=true + Messages -> ds4_chat_begin / append /
assistant_prefix. PredictOptions.Metadata['enable_thinking'] and
['reasoning_effort'] map to ds4_think_mode (DS4_THINK_HIGH default;
'max'/'xhigh' -> DS4_THINK_MAX; disabled -> DS4_THINK_NONE).

Tool-call rendering for assistant turns with tool_calls JSON arrives in
the next commit (dsml_renderer).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): render assistant tool_calls + tool results to DSML

Closes the round-trip: when an OpenAI client sends a multi-turn chat
where prior turns contain tool_calls or role=tool messages, build_prompt
serializes them back to the DSML shape the model was trained on. Mirrors
ds4_server.c's prompt renderer; uses nlohmann::json for parsing the
OpenAI tool_calls payload.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): disk KV cache module

Dir-based cache keyed by SHA1(rendered prompt prefix). File format:
'DS4G' magic + version + ctx_size + prefix_len + prefix + payload_bytes
+ ds4_session_save_payload output. NOT bit-compatible with ds4-server's
KVC files - that interop is a follow-up plan. LoadLongestPrefix walks
the dir picking the longest stored prefix that prefixes the incoming
prompt.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): wire KvCache into Predict/PredictStream

LoadModel reads 'kv_cache_dir' from ModelOptions.Options[], passes it to
g_kv_cache.SetDir. Each Predict/PredictStream computes a render text for
the request, tries LoadLongestPrefix to recover state, then Saves the
new state after generation. ds4_session_sync handles the live-cache
fast path internally, so the disk cache only matters for cold-starts
and cross-session reuse.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): add package.sh

Linux: bundles libc + ld + libstdc++ + libgomp + GPU runtime libs into
package/lib so the FROM scratch image boots without a host libc.
Darwin is handled by scripts/build/ds4-darwin.sh which uses otool -L.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): rename namespace ds4_backend -> ds4cpp

ds4.h defines 'typedef enum {...} ds4_backend' which collides with our
C++ 'namespace ds4_backend' anywhere a TU includes both. kv_cache.h
includes ds4.h directly and surfaces the conflict immediately; other
TUs would hit it once gRPC dev headers are available.

Renames the C++ namespace to ds4cpp across all wrapper files and the
plan, leaving the upstream ds4 typedef untouched.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend): add Dockerfile.ds4

Single-stage builder (CUDA devel image for cublas, ubuntu:24.04 for cpu)
-> FROM scratch with packaged grpc-server + bundled runtime libs.
nlohmann-json3-dev is required for dsml_renderer's JSON handling.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(make): wire backend/cpp/ds4 + ds4-darwin into root Makefile

BACKEND_DS4 entry + generate-docker-build-target eval + docker-build-ds4
in docker-build-backends + .NOTPARALLEL guards. Also adds the
backends/ds4-darwin target which delegates to scripts/build/ds4-darwin.sh
(landed in Task 24).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: add backend-matrix entries for ds4 (cpu + cuda13, per-arch)

Two entries per build (amd64 + arm64) so backend-merge-jobs assembles a
multi-arch manifest. Skipping cuda12 - ds4 was validated against CUDA 13.
Darwin Metal is handled outside this matrix by backend_build_darwin.yml.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/index): add ds4 meta + image entries

cpu + cuda13 x latest + master. Darwin Metal builds publish under
ds4-darwin via the existing llama-cpp-darwin OCI pipeline.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(scripts/build): add ds4-darwin.sh

Native macOS/Metal build for the ds4 backend. Mirrors llama-cpp-darwin.sh:
make grpc-server -> otool -L for dylib bundling -> OCI tar that
'local-ai backends install' consumes via the backends/ds4-darwin
Makefile target.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(darwin): build ds4-darwin in backend_build_darwin

Adds a 'Build ds4 backend (Darwin Metal)' step that runs the
backends/ds4-darwin Makefile target on the macOS runner.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(import): auto-detect ds4 weights via DS4Importer

Adds core/gallery/importers/ds4.go which matches on the antirez/deepseek-v4-gguf
repo URI and the DeepSeek-V4-Flash-*.gguf filename pattern. Registered before
LlamaCPPImporter so ds4 weights route to backend: ds4 instead of falling
through to llama-cpp.

Also lists ds4 in /backends/known so the /import-model UI surfaces it as a
manual choice for users who want to force the backend on a non-canonical URI.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add deepseek-v4-flash-q2 (ds4 backend)

One-click install of the q2 weights with backend: ds4.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(.agents): add ds4-backend.md

Documents the backend shape, DSML state machine, thinking-mode mapping,
disk KV cache, build matrix (cpu/cuda13/Darwin), and the BACKEND_BINARY
hardware-validation path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): pass UBUNTU_VERSION + arch env vars to install-base-deps

The .docker/install-base-deps.sh script needs UBUNTU_VERSION (defaults to
2404), TARGETARCH, SKIP_DRIVERS, and APT_MIRROR/APT_PORTS_MIRROR exported
into the environment so it can pick the right cuda-keyring / cudss / nvpl
debs and apt mirrors. Dockerfile.ds4 was declaring some of the ARGs but not
re-exporting them via ENV. Mirrors Dockerfile.llama-cpp's pattern.

Without this fix 'make docker-build-ds4 BUILD_TYPE=cublas CUDA_MAJOR_VERSION=13'
failed at:
  /usr/local/sbin/install-base-deps: line 120: UBUNTU_VERSION: unbound variable

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/index): add Metal image entries for ds4

Adds metal-ds4 + metal-ds4-development image entries pointing at
quay.io/go-skynet/local-ai-backends:{latest,master}-metal-darwin-arm64-ds4
(built by scripts/build/ds4-darwin.sh on macOS arm64 runners), plus the
'metal' and 'metal-darwin-arm64' capability mappings on the ds4 meta and
ds4-development variant.

Closes a gap from the initial Task 23 landing - the Darwin Metal build
script and CI workflow step were already wired (Tasks 24-25), but the
gallery had no image entry for users to install the Metal variant.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ci): use ubuntu:24.04 base for ds4 cuda13 matrix entries

The initial Task 22 matrix landing used base-image: 'nvidia/cuda:13.0.0-devel-ubuntu24.04'
which clashes with install-base-deps.sh's cuda-keyring step:

  E: Conflicting values set for option Signed-By regarding source
     https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/

The canonical pattern (llama-cpp, ik-llama-cpp, turboquant) uses plain
'ubuntu:24.04' + 'skip-drivers: false' so install-base-deps installs CUDA
from scratch via its own keyring setup. Adopting that here.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): drop install-base-deps.sh dependency

The .docker/install-base-deps.sh pipeline is built around the llama-cpp
needs: NVIDIA keyring + cuda-toolkit apt + gRPC-from-source build at
/opt/grpc. For ds4 we don't need any of that:
- CUDA: nvidia/cuda:13.0.0-devel-ubuntu24.04 ships /usr/local/cuda
  ready to go; install-base-deps's keyring step then conflicts with
  the pre-installed Signed-By.
- gRPC: ds4's grpc-server.cpp only links against grpc++; system
  libgrpc++-dev (apt) is sufficient, no source build needed.

Replaced the install-base-deps invocation in Dockerfile.ds4 with a
direct 'apt-get install libgrpc++-dev libprotobuf-dev protobuf-compiler-grpc
nlohmann-json3-dev cmake build-essential pkg-config git'. Matrix entries
back to nvidia/cuda base + skip-drivers=true so install-base-deps would
no-op even if some downstream tooling calls it.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): correct proto accessors + alias grpc::Status as GStatus

Two compile bugs caught by the docker build:

1. proto::Message uses snake_case accessors. The build_prompt loop called
   m.toolcalls() / m.toolcallid() - the protoc-generated names are
   m.tool_calls() / m.tool_call_id(). Plan-text bug propagated to the
   wrapper.

2. The Status RPC method shadowed the 'using grpc::Status' alias, so any
   later method declaration using Status as a return type failed to parse
   ('Status does not name a type' starting at LoadModel). Solution: alias
   grpc::Status as GStatus instead, with no 'using' clause that would
   conflict. All RPC method declarations and return-statement constructions
   now use GStatus.

Pre-existing code reviewer flagged the Status-shadow concern as 'minor'
in the original Task 10 commit; it turned out to be a real compile blocker
under libstdc++ 13 once the surrounding methods were filled in.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): preserve TOOL_ARGS content in dsml_parser Flush

When the model emitted a parameter value that arrived in the same buffer
as the surrounding tool_call markers (e.g. the buffered tail after a
literal '</think>' opened the model output), the parser deferred all
buffered bytes to Flush() because looks_like_prefix() always returns
true while buf starts with '<'. Flush() then drained the buffer as
plain CONTENT/REASONING regardless of parser state, so the bytes
between the parameter open and close markers were classified as
CONTENT instead of TOOL_ARGS.

Symptom: the model emitted

  <|DSML|parameter name="location" string="true">Paris, France</|DSML|parameter>

and the assembled tool_call arguments came out as {"location":""} -
the opener and closer were emitted into the args stream but the
"Paris, France" content went to the assistant message instead.

Fix:

1. Flush() now uses the same state-aware emit logic as DrainPlain:
   PARAM_VALUE bytes become TOOL_ARGS (json-escaped when string),
   THINK bytes become REASONING, TEXT bytes become CONTENT, and
   INVOKE / TOOL_CALLS structural whitespace is discarded.

2. looks_like_prefix() restricts its leading-'<' fallback to buffers
   that have not yet seen a '>'. Without that change, char-by-char
   feeds would discard the '<' of '<|DSML|invoke name="..."' once
   the marker prefix length was reached but the closing quote/'>'
   were still in flight.

Verified with a standalone harness that runs the failing input three
ways (single Feed, split-after-'>', and char-by-char) and aggregates
TOOL_ARGS for tool index 0: all three now produce
{"location":"Paris, France"}.

Assisted-by: Claude:opus-4.7 [Read,Edit,Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): use ds4_session_sync + manual generation loop for KV persistence

ds4_engine_generate_argmax() is a self-contained helper that doesn't take or
update a ds4_session - it manages its own internal state. Our Predict and
PredictStream methods created g_session via ds4_session_create() but then
called ds4_engine_generate_argmax(), so g_session's KV state never advanced.
ds4_session_payload_bytes(g_session) returned 0 and the disk KV cache save
correctly rejected with 'session has no valid checkpoint to save'.

Switch both RPCs to the proper session API:
  ds4_session_sync(g_session, &prompt, ...)
  loop:
    int token = ds4_session_argmax(g_session)
    if token == eos: break
    emit(token)
    ds4_session_eval(g_session, token, ...)

After the loop the session has a real checkpoint and ds4_session_save_payload
writes the KV state to disk. Verified end-to-end on a DGX Spark GB10: three
.kv files (15-30 MB each) are written when BACKEND_TEST_OPTIONS sets
kv_cache_dir, and the e2e tool-call assertion still passes.

Also added stderr diagnostics to KvCache (enabled/disabled at SetDir; per-save
path + payload_bytes + result) so future failures are visible instead of
silent. The 'wrote ok' lines are low-volume - one per Predict/PredictStream
when the cache is enabled - and skipped entirely when the option is unset.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): use ds4_session_eval_speculative_argmax when MTP loaded

Wires MTP (Multi-Token Prediction) speculative decoding into the manual
generation loop in both Predict and PredictStream. When the upstream MTP
weights are loaded via 'mtp_path:' option AND we're on CUDA / Metal,
ds4_engine_mtp_draft_tokens() returns >0 and we switch the inner loop to
ds4_session_eval_speculative_argmax(), which can accept N>1 tokens per
verifier step. When MTP is not loaded (no option, CPU backend, or weights
absent), we fall through to the simple ds4_session_argmax + ds4_session_eval
path with no behavior change.

Validated on a DGX Spark GB10 with the optional MTP GGUF
(DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf, ~3.6 GB). LoadModel logs
'ds4: MTP support model loaded ... (draft=2)' on stderr.

Caveat per upstream README: 'currently provides at most a slight speedup,
not a meaningful generation-speed win'. Wired now mainly to track the
upstream API; bigger speedups arrive when ds4 improves the speculative path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): honor PredictOptions sampling with DSML-aware override

Mirrors ds4_server.c:7102-7115 sampling-policy semantics on the LocalAI
gRPC side. The generation loop now consults compute_sample_params() per
token to pick the effective (temperature, top_k, top_p, min_p), based on:

  1. Request defaults: PredictOptions.temperature / .topk / .topp / .minp
  2. Thinking-mode override: when enable_thinking != false, force T=1.0,
     top_k=0, top_p=1.0, min_p=0.0 (creativity for the reasoning pass and
     the trailing content)
  3. DSML structural override: when DsmlParser::IsInDsmlStructural()
     returns true (we are between tool-call markers but NOT in a param
     value payload), force T=0.0 so protocol bytes parse cleanly

When the effective temperature is 0, we keep using ds4_session_argmax +
MTP speculative path (matches ds4-server's gate that only enables MTP for
greedy positions). When > 0, we call ds4_session_sample(s, T, ...) with
a per-thread RNG seeded from system_clock and fall back to single-token
ds4_session_eval.

New public method on DsmlParser: IsInDsmlStructural() encodes which states
need protocol-byte determinism. PARAM_VALUE is excluded (payload uses user
sampling); TEXT and THINK are excluded (no tool-call context to protect).

Verified on the DGX Spark GB10: the e2e suite still passes with all 5
specs including tools, and the Predict output now varies between runs
(creative sampling active) while the tool-call args remain a clean
'{"location":"Paris, France"}' because the parser-state check forces
greedy on the structural bytes.

UX note: thinking mode is ON by default (matching ds4-server). Users who
want deterministic output should set Metadata.enable_thinking = false.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add sha256 to deepseek-v4-flash-q2 entry

Per HF LFS metadata for antirez/deepseek-v4-gguf:
  size: 86720111200 bytes (~80.76 GiB)
  sha256: 31598c67c8b8744d3bcebcd19aa62253c6dc43cef3b8adf9f593656c9e86fd8c

LocalAI's downloader verifies sha256 when present, so users who install
deepseek-v4-flash-q2 from the gallery get integrity-checked weights and
the partial-download issue (an 81 GB file is easy to truncate) becomes
recoverable instead of silently producing a broken backend.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-11 22:15:47 +02:00
dependabot[bot]
5d0f732b16 chore(deps): bump the go_modules group across 1 directory with 2 updates (#9759)
Bumps the go_modules group with 2 updates in the / directory: [github.com/gofiber/utils](https://github.com/gofiber/utils) and [github.com/go-git/go-git/v5](https://github.com/go-git/go-git).


Updates `github.com/gofiber/utils` from 1.1.0 to 1.2.0
- [Release notes](https://github.com/gofiber/utils/releases)
- [Commits](https://github.com/gofiber/utils/compare/v1.1.0...v1.2.0)

Updates `github.com/go-git/go-git/v5` from 5.18.0 to 5.19.0
- [Release notes](https://github.com/go-git/go-git/releases)
- [Changelog](https://github.com/go-git/go-git/blob/main/HISTORY.md)
- [Commits](https://github.com/go-git/go-git/compare/v5.18.0...v5.19.0)

---
updated-dependencies:
- dependency-name: github.com/gofiber/utils
  dependency-version: 1.2.0
  dependency-type: indirect
  dependency-group: go_modules
- dependency-name: github.com/go-git/go-git/v5
  dependency-version: 5.19.0
  dependency-type: indirect
  dependency-group: go_modules
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-11 18:37:00 +02:00
Ettore Di Giacinto
ea00199554 ci: tag every backend digest, including singletons
backend_build.yml pushes by canonical digest only (push-by-digest=true,
no tags applied at build time). User-facing tagging happens in
backend_merge.yml's `imagetools create` step. Before this commit,
scripts/changed-backends.js emitted a merge entry only for tag-suffixes
with 2+ legs, so every single-arch backend (CUDA/ROCm/Intel Python
images, vLLM, sglang, transformers, diffusers, ...) pushed its digest
untagged and stayed that way until quay's GC reaped it. Symptom: tag
releases shipped multi-arch backends tagged correctly, but no
v<X>-gpu-nvidia-cuda-12-vllm (or any singleton variant) ever appeared
in the registry.

Changes:

- scripts/changed-backends.js drops the `group.length < 2` skip and
  emits two merge matrices, one per arch class, so each downstream
  merge job can `needs:` only its corresponding build matrix.
- backend.yml splits backend-merge-jobs into multiarch and singlearch
  variants. The split preserves PR #9746's fix: slow singlearch CUDA
  builds (~6h) must not gate multiarch merges, or quay's GC reaps the
  multiarch per-arch digests before they're tagged.
- backend_pr.yml mirrors the split.
- backend_build.yml renames the digest artifact from
  `digests<suffix>-<platform-tag>` to
  `digests<suffix>--<platform-tag-or-"single">`. The `--` separator
  prevents the merge-side glob from over-matching sibling backends
  whose tag-suffix is a prefix of ours (e.g. -cpu-vllm vs
  -cpu-vllm-omni, -cpu-mlx vs -cpu-mlx-audio); the `single` placeholder
  keeps the name well-formed when platform-tag is empty.
- backend_merge.yml updates the download pattern to match.

Verified locally: a tag-push event now expands to 36 multiarch merge
entries (= 72 builds / 2 legs) and 199 singlearch merge entries (one
per singleton, including -gpu-nvidia-cuda-12-vllm at index 24).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-11 13:22:00 +00:00
63 changed files with 3812 additions and 284 deletions

View File

@@ -34,7 +34,55 @@ The build matrix is data-only YAML at `.github/backend-matrix.yml` (not inside `
**Without an entry here no image is ever built or pushed, and the gallery entry in `backend/index.yaml` will point at a tag that does not exist.** The `dockerfile:` field must point at `./backend/Dockerfile.<lang>` matching the language bucket from step 1 (e.g. `Dockerfile.python`, `Dockerfile.golang`, `Dockerfile.rust`). The `tag-suffix` must match the `uri:` in the corresponding `backend/index.yaml` image entry exactly.
If you add a new language bucket, `scripts/changed-backends.js` also needs a branch in `inferBackendPath` so PR change-detection routes file edits correctly.
**`scripts/changed-backends.js` registration — REQUIRED for any new dockerfile suffix.** This is the single most common omission, because it has no effect on the PR that adds the backend (when no prior path filter could catch it anyway) — it only breaks the *next* PR that touches your backend's directory, which then gets zero CI jobs and looks broken for unrelated reasons. Edit `scripts/changed-backends.js:inferBackendPath` and add a branch BEFORE the more-generic suffixes:
```js
if (item.dockerfile.endsWith("<your-dockerfile-suffix>")) {
return `backend/cpp/<your-backend>/`; // or backend/python|go|rust/...
}
```
The `endsWith()` test is against the matrix entry's `dockerfile:` value (e.g. `./backend/Dockerfile.ds4``endsWith("ds4")`). Specificity order matters here just like it does for importers: more-specific suffixes go BEFORE more-generic ones (e.g. `ds4` before `llama-cpp` even though both end with letters, because some upstream might one day call itself `super-ds4-llama-cpp`). Verify locally before pushing:
```bash
# Confirm your dockerfile suffix is unique enough
node -e "
const yaml = require('js-yaml'); const fs = require('fs');
const m = yaml.load(fs.readFileSync('.github/backend-matrix.yml','utf8'));
for (const e of m.include.filter(e => e.backend === '<your-backend>')) {
console.log(e.dockerfile, '->', e.dockerfile.endsWith('<suffix>'));
}"
```
A quick way to find the right insertion point: `grep -n 'item.dockerfile.endsWith' scripts/changed-backends.js`.
**`bump_deps.yaml` registration — REQUIRED for any backend pinning an upstream commit.** If your backend's Makefile has a `*_VERSION?=<sha>` pin to a third-party repo, the daily auto-bump bot at `.github/workflows/bump_deps.yaml` won't notice it unless you register the backend in its matrix. The bot runs `.github/bump_deps.sh` which `grep`s for `^$VAR?=` in the Makefile you list — so the pin MUST live in the Makefile (not in a separate shell script). The bump for ds4 (#9761) had to walk this back because the original landed the pin in `prepare.sh`, which the bot can't see. Pattern (for `antirez/ds4`):
```yaml
# .github/workflows/bump_deps.yaml
matrix:
include:
- repository: "antirez/ds4"
variable: "DS4_VERSION"
branch: "main"
file: "backend/cpp/ds4/Makefile"
```
And the corresponding Makefile shape (mirror `backend/cpp/llama-cpp/Makefile`):
```makefile
DS4_VERSION?=ae302c2fa18cc6d9aefc021d0f27ae03c9ad2fc0
DS4_REPO?=https://github.com/antirez/ds4
...
ds4:
mkdir -p ds4
cd ds4 && git init -q && \
git remote add origin $(DS4_REPO) && \
git fetch --depth 1 origin $(DS4_VERSION) && \
git checkout FETCH_HEAD
```
If you have a `prepare.sh` doing the clone, delete it — the recipe belongs in the Makefile target so `make purge && make` works as a clean-and-rebuild and so the bump bot finds the pin.
**Placement in file:**
- CPU builds: Add after other CPU builds (e.g., after `cpu-chatterbox`)

84
.agents/ds4-backend.md Normal file
View File

@@ -0,0 +1,84 @@
# Working on the ds4 Backend
`antirez/ds4` is a single-model inference engine for DeepSeek V4 Flash.
LocalAI wraps the engine's C API (`ds4/ds4.h`) with a fresh C++ gRPC server at
`backend/cpp/ds4/` - NOT a fork of llama-cpp's grpc-server.cpp.
## Pin
`backend/cpp/ds4/Makefile` pins `DS4_VERSION?=<sha>` at the top. The `ds4`
target in the Makefile clones `antirez/ds4` at that commit (mirroring the
llama-cpp / ik-llama-cpp / turboquant pattern). The bump-deps bot
(`.github/workflows/bump_deps.yaml`) finds this pin via grep and opens a
daily PR to update it. To bump manually: edit the `DS4_VERSION?=` line,
then `make purge && make` (or rely on CI's clean build).
## Wire shape
| RPC | Implementation |
|---|---|
| Health, Free, Status | Trivial; no engine dependency for Health |
| LoadModel | `ds4_engine_open` + `ds4_session_create`; backend is compile-time (DS4_NO_GPU → CPU, __APPLE__ → Metal, otherwise CUDA) |
| TokenizeString | `ds4_tokenize_text` |
| Predict | `ds4_engine_generate_argmax` + `DsmlParser` → one ChatDelta with content / reasoning_content / tool_calls[] |
| PredictStream | Same, per-token ChatDelta writes |
## DSML
ds4 emits tool calls as literal text markers (`<DSMLtool_calls>` etc.) -
NOT special tokens. `dsml_parser.{h,cpp}` is our streaming state machine that
classifies token bytes into CONTENT / REASONING / TOOL_START / TOOL_ARGS / TOOL_END
events. `dsml_renderer.{h,cpp}` does the prompt direction: turns
OpenAI tool_calls + role=tool messages back into DSML for the next turn.
## Thinking modes
`PredictOptions.Metadata["enable_thinking"]` gates thinking on/off (default ON).
`["reasoning_effort"] == "max" | "xhigh"` selects `DS4_THINK_MAX`; anything else
maps to `DS4_THINK_HIGH`. We pass the chosen mode to `ds4_chat_append_assistant_prefix`.
## Disk KV cache
`kv_cache.{h,cpp}` implements an SHA1-keyed file cache using ds4's public
`ds4_session_save_payload` / `ds4_session_load_payload` API. Enable per request
via `ModelOptions.Options[] = "kv_cache_dir:/some/path"`. Format is **our own** -
NOT bit-compatible with ds4-server's KVC files (interop is a follow-up plan).
## Build matrix
| Build | Where | Notes |
|---|---|---|
| `cpu-ds4` (amd64 + arm64) | Linux GHA | ds4 considers CPU debug-only; useful only for wiring tests |
| `cuda13-ds4` (amd64 + arm64) | Linux GHA + DGX Spark validation | Primary production path on Linux |
| `ds4-darwin` (arm64) | macOS GHA runners | Metal; uses `scripts/build/ds4-darwin.sh` like llama-cpp-darwin |
cuda12 is intentionally omitted. ROCm / Vulkan / SYCL are not applicable.
## Hardware-gated validation
`tests/e2e-backends/backend_test.go` in `BACKEND_BINARY` mode:
```
BACKEND_BINARY=$(pwd)/backend/cpp/ds4/package/run.sh \
BACKEND_TEST_MODEL_FILE=/path/to/ds4flash.gguf \
BACKEND_TEST_CAPS=health,load,predict,stream,tools \
BACKEND_TEST_TOOL_PROMPT="What's the weather in Paris?" \
go test -count=1 -timeout=30m -v ./tests/e2e-backends/...
```
CI does not load the model; the suite is opt-in via env vars.
## Importer
`core/gallery/importers/ds4.go` (`DS4Importer`) auto-detects ds4 weights by
matching the `antirez/deepseek-v4-gguf` repo URI or the
`DeepSeek-V4-Flash-*.gguf` filename pattern. **Registered BEFORE
`LlamaCPPImporter`** in `defaultImporters` - both match `.gguf` but ds4 is more
specific, and first-match-wins. The importer emits `backend: ds4`, uses
`ds4flash.gguf` as the local filename (matches ds4's own CLI default), and
disables the Go-side automatic tool-parsing fallback (the C++ backend emits
ChatDelta.tool_calls natively via `DsmlParser`).
ds4 is also listed in `core/http/endpoints/localai/backend.go`'s pref-only
slice so the `/import-model` UI surfaces it as a manual choice for users who
want to force the backend on a non-canonical URI.

View File

@@ -389,7 +389,12 @@ include:
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-llama-cpp'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-12-amd64'
runs-on: 'ubuntu-latest'
# bigger-runner: cold builds for this entry consistently take 5h+ on
# ubuntu-latest (observed 5h36m on v4.2.1). Move back to bigger-runner
# so the build finishes well within GHA's 6h job timeout. Phase 5.3 of
# the free-tier migration (PR #9730) flipped this to ubuntu-latest as
# a 'highest-risk batch' with explicit per-entry revert.
runs-on: 'bigger-runner'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "llama-cpp"
@@ -403,7 +408,9 @@ include:
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-turboquant'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-12-amd64'
runs-on: 'ubuntu-latest'
# bigger-runner: same rationale as -gpu-nvidia-cuda-12-llama-cpp above
# (observed 6h5m wall-clock on v4.2.1, just past the 6h job timeout).
runs-on: 'bigger-runner'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "turboquant"
@@ -899,7 +906,9 @@ include:
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-llama-cpp'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-amd64'
runs-on: 'ubuntu-latest'
# bigger-runner: cold builds for this entry take 5h+ on ubuntu-latest
# (observed 5h37m on v4.2.1). Same rationale as the cuda-12 variant.
runs-on: 'bigger-runner'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "llama-cpp"
@@ -913,7 +922,8 @@ include:
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-turboquant'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-amd64'
runs-on: 'ubuntu-latest'
# bigger-runner: observed 6h5m wall-clock on v4.2.1 — at the GHA timeout.
runs-on: 'bigger-runner'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "turboquant"
@@ -948,6 +958,32 @@ include:
backend: "turboquant"
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-ds4'
runs-on: 'ubuntu-latest'
base-image: "nvidia/cuda:13.0.0-devel-ubuntu24.04"
skip-drivers: 'true'
backend: "ds4"
dockerfile: "./backend/Dockerfile.ds4"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'true'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-ds4'
base-image: "nvidia/cuda:13.0.0-devel-ubuntu24.04"
runs-on: 'ubuntu-24.04-arm'
ubuntu-version: '2404'
backend: "ds4"
dockerfile: "./backend/Dockerfile.ds4"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -2321,6 +2357,34 @@ include:
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: '-cpu-ds4'
runs-on: 'ubuntu-latest'
base-image: "nvidia/cuda:13.0.0-devel-ubuntu24.04"
skip-drivers: 'true'
backend: "ds4"
dockerfile: "./backend/Dockerfile.ds4"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/arm64'
platform-tag: 'arm64'
tag-latest: 'auto'
tag-suffix: '-cpu-ds4'
runs-on: 'ubuntu-24.04-arm'
base-image: "nvidia/cuda:13.0.0-devel-ubuntu24.04"
skip-drivers: 'true'
backend: "ds4"
dockerfile: "./backend/Dockerfile.ds4"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""

46
.github/scripts/anchor-digest-in-cache.sh vendored Executable file
View File

@@ -0,0 +1,46 @@
#!/usr/bin/env bash
# Anchor a backend per-arch digest in quay.io/go-skynet/ci-cache so quay's
# garbage collector won't reap the manifest before backend_merge.yml runs.
#
# Context: backend_build.yml pushes by canonical digest only
# (push-by-digest=true). Unreferenced manifests on quay can be reaped within
# ~1-2h, but backend-merge-jobs runs only after the *entire* per-arch build
# matrix drains (max-parallel: 8 × dozens of entries → ~2h+). Without an
# anchoring tag, the earliest digests are gone by the time `imagetools create`
# tries to read them, producing "manifest not found" merge failures.
#
# We tag the digest under our internal ci-cache image; quay does not GC tagged
# manifests. The user-facing manifest list still references the original
# digest in local-ai-backends. backend_merge.yml deletes the anchor tag after
# the user-facing manifest is published — see cleanup-keepalive-tags.sh.
#
# Required env:
# GITHUB_RUN_ID - current workflow run id (set automatically by GHA)
# TAG_SUFFIX - matrix entry's tag-suffix (e.g. -gpu-nvidia-cuda-12-vllm)
# PLATFORM_TAG - amd64 / arm64 / single (single = singleton matrix entry)
# DIGEST - canonical content digest from build step (sha256:...)
#
# Optional env:
# ANCHOR_IMAGE - target image (default: quay.io/go-skynet/ci-cache)
# SOURCE_IMAGE - source image (default: quay.io/go-skynet/local-ai-backends)
# GITHUB_STEP_SUMMARY - if set, an anchored-by line is appended to it
set -euo pipefail
: "${GITHUB_RUN_ID:?}"
: "${TAG_SUFFIX:?}"
: "${PLATFORM_TAG:?}"
: "${DIGEST:?}"
anchor_image="${ANCHOR_IMAGE:-quay.io/go-skynet/ci-cache}"
source_image="${SOURCE_IMAGE:-quay.io/go-skynet/local-ai-backends}"
tag="keepalive-${GITHUB_RUN_ID}${TAG_SUFFIX}-${PLATFORM_TAG}"
docker buildx imagetools create \
-t "${anchor_image}:${tag}" \
"${source_image}@${DIGEST}"
echo "anchored ${DIGEST} as ${anchor_image}:${tag}"
if [[ -n "${GITHUB_STEP_SUMMARY:-}" ]]; then
echo "anchored \`${DIGEST}\` as \`${anchor_image}:${tag}\`" >> "${GITHUB_STEP_SUMMARY}"
fi

49
.github/scripts/cleanup-keepalive-tags.sh vendored Executable file
View File

@@ -0,0 +1,49 @@
#!/usr/bin/env bash
# Best-effort cleanup of the keepalive anchor tags written by
# anchor-digest-in-cache.sh. Called from backend_merge.yml after the
# user-facing manifest list has been published.
#
# Quay's docker registry v2 doesn't allow tag deletes — only digest deletes.
# The proper delete is the quay REST API, which requires an OAuth-scoped
# token. We try QUAY_TOKEN as a bearer token: if the secret is an OAuth app
# token (typical for service accounts) the delete succeeds; otherwise this
# is a soft no-op and the tag persists until manually pruned.
#
# Cleanup failure MUST NOT fail the merge — the merge has already produced
# the user-facing manifest list at this point and the keepalive tags are
# pure overhead. We always exit 0.
#
# Required env:
# GITHUB_RUN_ID - current workflow run id (set automatically by GHA)
# TAG_SUFFIX - matrix entry's tag-suffix (e.g. -gpu-nvidia-cuda-12-vllm)
# QUAY_TOKEN - bearer token for quay's REST API
#
# Optional env:
# QUAY_REPO - target repo (default: go-skynet/ci-cache)
# PLATFORM_TAGS - space-separated list of platform-tag values to try
# (default: "amd64 arm64 single")
# We don't know which platform-tag(s) exist for this
# tag-suffix without an extra API call, so we just try
# all three and ignore 404s for the ones that don't.
set -uo pipefail
: "${GITHUB_RUN_ID:?}"
: "${TAG_SUFFIX:?}"
: "${QUAY_TOKEN:?}"
quay_repo="${QUAY_REPO:-go-skynet/ci-cache}"
platform_tags="${PLATFORM_TAGS:-amd64 arm64 single}"
for plat in $platform_tags; do
tag="keepalive-${GITHUB_RUN_ID}${TAG_SUFFIX}-${plat}"
url="https://quay.io/api/v1/repository/${quay_repo}/tag/${tag}"
http=$(curl -sS -o /dev/null -w '%{http_code}' \
-X DELETE -H "Authorization: Bearer ${QUAY_TOKEN}" "$url" || echo "000")
case "$http" in
204|200) echo "deleted $tag" ;;
404) echo "not present: $tag" ;;
401|403) echo "auth not OAuth-scoped (http $http) for $tag - skipping; orphan tag will persist" ;;
*) echo "unexpected http $http deleting $tag - skipping" ;;
esac
done
exit 0

View File

@@ -35,11 +35,13 @@ jobs:
matrix-singlearch: ${{ steps.set-matrix.outputs['matrix-singlearch'] }}
matrix-multiarch: ${{ steps.set-matrix.outputs['matrix-multiarch'] }}
matrix-darwin: ${{ steps.set-matrix.outputs['matrix-darwin'] }}
merge-matrix: ${{ steps.set-matrix.outputs['merge-matrix'] }}
merge-matrix-multiarch: ${{ steps.set-matrix.outputs['merge-matrix-multiarch'] }}
merge-matrix-singlearch: ${{ steps.set-matrix.outputs['merge-matrix-singlearch'] }}
has-backends-singlearch: ${{ steps.set-matrix.outputs['has-backends-singlearch'] }}
has-backends-multiarch: ${{ steps.set-matrix.outputs['has-backends-multiarch'] }}
has-backends-darwin: ${{ steps.set-matrix.outputs['has-backends-darwin'] }}
has-merges: ${{ steps.set-matrix.outputs['has-merges'] }}
has-merges-multiarch: ${{ steps.set-matrix.outputs['has-merges-multiarch'] }}
has-merges-singlearch: ${{ steps.set-matrix.outputs['has-merges-singlearch'] }}
steps:
- name: Checkout repository
uses: actions/checkout@v6
@@ -138,15 +140,27 @@ jobs:
max-parallel: 8
matrix: ${{ fromJson(needs.generate-matrix.outputs['matrix-singlearch']) }}
# Merge per-arch digests into manifest lists. Depends ONLY on
# backend-jobs-multiarch — single-arch builds are independent and slow.
# Without this split, a 6h CUDA-12 single-arch job would gate the merge,
# leaving multi-arch digests untagged on quay long enough for quay's
# garbage collector to reap them and the merge step to fail with
# "manifest not found".
backend-merge-jobs:
# Apply tags to per-arch digests via `imagetools create`. Split into two
# jobs that mirror the build split so each merge waits ONLY on its
# corresponding build matrix:
#
# - backend-merge-jobs-multiarch needs backend-jobs-multiarch (~2-3h)
# - backend-merge-jobs-singlearch needs backend-jobs-singlearch (up to ~6h)
#
# If a single shared merge job depended on both, slow CUDA singlearch
# builds would block multiarch merges long enough for quay's GC to reap
# the multiarch per-arch digests (the bug fixed by PR #9746). Singletons
# also need a merge step because backend_build.yml pushes by canonical
# digest only — no tags are applied at build time.
backend-merge-jobs-multiarch:
needs: [generate-matrix, backend-jobs-multiarch]
if: needs.generate-matrix.outputs['has-merges'] == 'true'
# !cancelled() lets the merge run even when a few build legs failed.
# Without it, GHA's default `needs:` cascade skips the entire merge
# matrix on a single failed/cancelled cell. We still want to publish
# the manifest lists for tag-suffixes whose legs all succeeded.
# Observed in v4.2.1: 2 singlearch build failures cascade-skipped all
# ~199 singlearch merge entries.
if: ${{ !cancelled() && needs.generate-matrix.outputs['has-merges-multiarch'] == 'true' }}
uses: ./.github/workflows/backend_merge.yml
with:
tag-latest: ${{ matrix.tag-latest }}
@@ -158,7 +172,24 @@ jobs:
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
strategy:
fail-fast: false
matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix']) }}
matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-multiarch']) }}
backend-merge-jobs-singlearch:
needs: [generate-matrix, backend-jobs-singlearch]
# See note on backend-merge-jobs-multiarch above for !cancelled().
if: ${{ !cancelled() && needs.generate-matrix.outputs['has-merges-singlearch'] == 'true' }}
uses: ./.github/workflows/backend_merge.yml
with:
tag-latest: ${{ matrix.tag-latest }}
tag-suffix: ${{ matrix.tag-suffix }}
secrets:
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
strategy:
fail-fast: false
matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-singlearch']) }}
backend-jobs-darwin:
needs: generate-matrix

View File

@@ -228,11 +228,28 @@ jobs:
digest="${{ steps.build.outputs.digest }}"
touch "/tmp/digests/${digest#sha256:}"
# See .github/scripts/anchor-digest-in-cache.sh for why this is needed
# and how it interacts with backend_merge.yml's cleanup step.
- name: Anchor digest in ci-cache so quay GC won't reap before merge
if: github.event_name != 'pull_request'
env:
TAG_SUFFIX: ${{ inputs.tag-suffix }}
PLATFORM_TAG: ${{ inputs.platform-tag || 'single' }}
DIGEST: ${{ steps.build.outputs.digest }}
run: .github/scripts/anchor-digest-in-cache.sh
# Artifact name uses a `--` separator between tag-suffix and platform-tag
# to avoid prefix collisions during the merge job's pattern-based download.
# Tag-suffixes are not prefix-disjoint (e.g. -gpu-nvidia-cuda-12-vllm is a
# prefix of -gpu-nvidia-cuda-12-vllm-omni); a single `-` separator plus the
# merge-side `digests<tag-suffix>-*` glob would let one merge over-match
# the other backend's artifacts. The `-single` placeholder for empty
# platform-tag (single-arch entries) keeps the artifact name non-trailing.
- name: Upload digest artifact
if: github.event_name != 'pull_request'
uses: actions/upload-artifact@v4
uses: actions/upload-artifact@v7
with:
name: digests${{ inputs.tag-suffix }}-${{ inputs.platform-tag }}
name: digests${{ inputs.tag-suffix }}--${{ inputs.platform-tag || 'single' }}
path: /tmp/digests/*
if-no-files-found: error
retention-days: 1

View File

@@ -116,6 +116,13 @@ jobs:
# already), we don't have to chase missing dylibs one at a time.
# The downloads cache makes the reinstall fast (~5s on a hit).
brew reinstall ccache
# Same pattern for grpc: its CMake config (used by the llama-cpp
# `grpc-server` target) does find_package(absl). The cache restores
# /opt/homebrew/Cellar/grpc so brew above no-ops the install, but
# abseil isn't in our Cellar cache list and never gets installed
# alongside, leaving grpc's CMake unable to resolve it. Reinstalling
# grpc re-validates and pulls abseil in, mirroring the ccache fix.
brew reinstall grpc
# The brew cache restores the Cellar dirs but NOT the bin symlinks
# at /opt/homebrew/bin/*. brew install above sees the Cellar present
# and decides "already installed" without re-linking, so on a cache-
@@ -211,8 +218,13 @@ jobs:
make protogen-go
make backends/llama-cpp-darwin
- name: Build ds4 backend (Darwin Metal)
if: inputs.backend == 'ds4'
run: |
make backends/ds4-darwin
- name: Build ${{ inputs.backend }}-darwin
if: inputs.backend != 'llama-cpp'
if: inputs.backend != 'llama-cpp' && inputs.backend != 'ds4'
run: |
make protogen-go
BACKEND=${{ inputs.backend }} BUILD_TYPE=${{ inputs.build-type }} USE_PIP=${{ inputs.use-pip }} make build-darwin-${{ inputs.lang }}-backend

View File

@@ -34,10 +34,23 @@ jobs:
env:
quay_username: ${{ secrets.quayUsername }}
steps:
- name: Download digests
uses: actions/download-artifact@v4
# Sparse checkout: the merge job needs `.github/scripts/` (for the
# keepalive cleanup script) but none of the source tree.
- name: Checkout (.github/scripts only)
uses: actions/checkout@v6
with:
pattern: digests${{ inputs.tag-suffix }}-*
sparse-checkout: |
.github/scripts
sparse-checkout-cone-mode: false
# `--` separator anchors the glob so we don't over-match sibling
# backends whose tag-suffix happens to be a prefix of ours
# (e.g. -cpu-vllm vs -cpu-vllm-omni). Must stay in sync with the
# upload-artifact name in backend_build.yml.
- name: Download digests
uses: actions/download-artifact@v8
with:
pattern: digests${{ inputs.tag-suffix }}--*
merge-multiple: true
path: /tmp/digests
@@ -122,6 +135,15 @@ jobs:
docker buildx imagetools inspect "$first_tag"
fi
# See .github/scripts/cleanup-keepalive-tags.sh for why this is
# best-effort and what the failure modes are.
- name: Cleanup keepalive tags in ci-cache
if: github.event_name != 'pull_request' && success()
env:
TAG_SUFFIX: ${{ inputs.tag-suffix }}
QUAY_TOKEN: ${{ secrets.quayPassword }}
run: .github/scripts/cleanup-keepalive-tags.sh
- name: Job summary
if: github.event_name != 'pull_request'
run: |

View File

@@ -14,11 +14,13 @@ jobs:
matrix-singlearch: ${{ steps.set-matrix.outputs['matrix-singlearch'] }}
matrix-multiarch: ${{ steps.set-matrix.outputs['matrix-multiarch'] }}
matrix-darwin: ${{ steps.set-matrix.outputs['matrix-darwin'] }}
merge-matrix: ${{ steps.set-matrix.outputs['merge-matrix'] }}
merge-matrix-multiarch: ${{ steps.set-matrix.outputs['merge-matrix-multiarch'] }}
merge-matrix-singlearch: ${{ steps.set-matrix.outputs['merge-matrix-singlearch'] }}
has-backends-singlearch: ${{ steps.set-matrix.outputs['has-backends-singlearch'] }}
has-backends-multiarch: ${{ steps.set-matrix.outputs['has-backends-multiarch'] }}
has-backends-darwin: ${{ steps.set-matrix.outputs['has-backends-darwin'] }}
has-merges: ${{ steps.set-matrix.outputs['has-merges'] }}
has-merges-multiarch: ${{ steps.set-matrix.outputs['has-merges-multiarch'] }}
has-merges-singlearch: ${{ steps.set-matrix.outputs['has-merges-singlearch'] }}
steps:
- name: Checkout repository
uses: actions/checkout@v6
@@ -97,12 +99,14 @@ jobs:
fail-fast: true
max-parallel: 8
matrix: ${{ fromJson(needs.generate-matrix.outputs['matrix-singlearch']) }}
backend-merge-jobs:
backend-merge-jobs-multiarch:
needs: [generate-matrix, backend-jobs-multiarch]
# backend_merge.yml's push-side steps are all gated on
# github.event_name != 'pull_request', so on a PR the merge job would
# do nothing. Skip it entirely to avoid spinning up an empty runner.
if: github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges'] == 'true'
# !cancelled() lets the merge run even when a few build legs fail —
# see the matching note in backend.yml.
if: ${{ !cancelled() && github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-multiarch'] == 'true' }}
uses: ./.github/workflows/backend_merge.yml
with:
tag-latest: ${{ matrix.tag-latest }}
@@ -112,7 +116,21 @@ jobs:
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
strategy:
fail-fast: false
matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix']) }}
matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-multiarch']) }}
backend-merge-jobs-singlearch:
needs: [generate-matrix, backend-jobs-singlearch]
if: ${{ !cancelled() && github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-singlearch'] == 'true' }}
uses: ./.github/workflows/backend_merge.yml
with:
tag-latest: ${{ matrix.tag-latest }}
tag-suffix: ${{ matrix.tag-suffix }}
secrets:
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
strategy:
fail-fast: false
matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-singlearch']) }}
backend-jobs-darwin:
needs: generate-matrix
uses: ./.github/workflows/backend_build_darwin.yml

View File

@@ -22,6 +22,10 @@ jobs:
variable: "TURBOQUANT_VERSION"
branch: "feature/turboquant-kv-cache"
file: "backend/cpp/turboquant/Makefile"
- repository: "antirez/ds4"
variable: "DS4_VERSION"
branch: "main"
file: "backend/cpp/ds4/Makefile"
- repository: "ggml-org/whisper.cpp"
variable: "WHISPER_CPP_VERSION"
branch: "master"

View File

@@ -187,7 +187,7 @@ jobs:
- name: Upload digest artifact
if: github.event_name != 'pull_request'
uses: actions/upload-artifact@v4
uses: actions/upload-artifact@v7
with:
name: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}-${{ inputs.platform-tag }}
path: /tmp/digests/*

View File

@@ -34,7 +34,7 @@ jobs:
quay_username: ${{ secrets.quayUsername }}
steps:
- name: Download digests
uses: actions/download-artifact@v4
uses: actions/download-artifact@v8
with:
pattern: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}-*
merge-multiple: true

View File

@@ -25,6 +25,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
| [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing |
| [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks |
| [.agents/sglang-backend.md](.agents/sglang-backend.md) | Working on the SGLang backend — `engine_args` validation against ServerArgs, speculative-decoding (EAGLE/EAGLE3/DFLASH/MTP) recipes, parser handling |
| [.agents/ds4-backend.md](.agents/ds4-backend.md) | Working on the ds4 backend - DSML state machine, thinking modes, KV cache, Metal+CUDA matrix |
| [.agents/testing-mcp-apps.md](.agents/testing-mcp-apps.md) | Testing MCP Apps (interactive tool UIs) in the React UI |
| [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) | Adding API endpoints, auth middleware, feature permissions, user access control |
| [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |

View File

@@ -305,7 +305,7 @@ EOT
###################################
# Build React UI
FROM node:25-slim AS react-ui-builder
FROM node:26-slim AS react-ui-builder
WORKDIR /app
COPY core/http/react-ui/package*.json ./
RUN npm install

View File

@@ -1,5 +1,5 @@
# Disable parallel execution for backend builds
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin
GOCMD=go
GOTEST=$(GOCMD) test
@@ -1009,6 +1009,10 @@ backends/llama-cpp-darwin: build
bash ./scripts/build/llama-cpp-darwin.sh
./local-ai backends install "ocifile://$(abspath ./backend-images/llama-cpp.tar)"
backends/ds4-darwin: build
bash ./scripts/build/ds4-darwin.sh
./local-ai backends install "ocifile://$(abspath ./backend-images/ds4.tar)"
build-darwin-python-backend: build
bash ./scripts/build/python-darwin.sh
@@ -1050,6 +1054,10 @@ BACKEND_IK_LLAMA_CPP = ik-llama-cpp|ik-llama-cpp|.|false|false
# turboquant is a llama.cpp fork with TurboQuant KV-cache quantization.
# Reuses backend/cpp/llama-cpp grpc-server sources via a thin wrapper Makefile.
BACKEND_TURBOQUANT = turboquant|turboquant|.|false|false
# ds4 is antirez/ds4, a DeepSeek V4 Flash-specific inference engine.
# Single-model; hardware-only validation lives at tests/e2e-backends/
# (BACKEND_BINARY mode); see docs/superpowers/plans/2026-05-11-ds4-backend.md.
BACKEND_DS4 = ds4|ds4|.|false|false
# Golang backends
BACKEND_PIPER = piper|golang|.|false|true
@@ -1135,6 +1143,7 @@ endef
$(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_IK_LLAMA_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_TURBOQUANT)))
$(eval $(call generate-docker-build-target,$(BACKEND_DS4)))
$(eval $(call generate-docker-build-target,$(BACKEND_PIPER)))
$(eval $(call generate-docker-build-target,$(BACKEND_LOCAL_STORE)))
$(eval $(call generate-docker-build-target,$(BACKEND_HUGGINGFACE)))
@@ -1188,7 +1197,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
docker-save-%: backend-images
docker save local-ai-backend:$* -o backend-images/$*.tar
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx
########################################################
### Mock Backend for E2E Tests

41
backend/Dockerfile.ds4 Normal file
View File

@@ -0,0 +1,41 @@
ARG BASE_IMAGE=ubuntu:24.04
ARG APT_MIRROR=""
ARG APT_PORTS_MIRROR=""
# BASE_IMAGE is either ubuntu:24.04 (for cpu builds) or nvidia/cuda:13.0.0-devel-ubuntu24.04
# (for cublas builds). Both ship apt + Ubuntu Noble packages; the nvidia/cuda base
# additionally provides /usr/local/cuda. Darwin (Metal) builds bypass this Dockerfile
# entirely via scripts/build/ds4-darwin.sh.
FROM ${BASE_IMAGE} AS builder
ARG BUILD_TYPE
ARG TARGETARCH
ARG TARGETVARIANT
ENV BUILD_TYPE=${BUILD_TYPE} \
DEBIAN_FRONTEND=noninteractive \
PATH=/usr/local/cuda/bin:${PATH}
WORKDIR /build
# Install build-time deps via plain apt - install-base-deps.sh's full pipeline
# (CUDA keyring + from-source gRPC) is unnecessary here:
# - CUDA: when BASE_IMAGE=nvidia/cuda:*, /usr/local/cuda is already populated;
# for the cpu build we don't need CUDA at all.
# - gRPC/Protobuf: system apt packages are sufficient; ds4's wrapper only links
# against them, it doesn't ship the gRPC source tree.
# - nlohmann-json: dsml_renderer's only third-party dep.
RUN apt-get update && \
apt-get install -y --no-install-recommends \
git cmake build-essential pkg-config ca-certificates \
libgrpc++-dev libprotobuf-dev protobuf-compiler protobuf-compiler-grpc \
nlohmann-json3-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
COPY . /LocalAI
RUN --mount=type=cache,target=/root/.ccache,id=ds4-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
make -C /LocalAI/backend/cpp/ds4 BUILD_TYPE=${BUILD_TYPE} NATIVE=false grpc-server package
FROM scratch
COPY --from=builder /LocalAI/backend/cpp/ds4/package/. ./

View File

@@ -117,6 +117,12 @@ ARG CUDA_DOCKER_ARCH
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
ARG CMAKE_ARGS
ENV CMAKE_ARGS=${CMAKE_ARGS}
# AMDGPU_TARGETS must be forwarded into the env here too — backend/cpp/llama-cpp/Makefile
# (which the turboquant Makefile reuses via a sibling build dir) errors out when the var
# is empty on a hipblas build, and the prebuilt path is what CI exercises most of the
# time. The builder-fromsource stage above already does this; mirror it here.
ARG AMDGPU_TARGETS
ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}
ARG TARGETARCH
ARG TARGETVARIANT

9
backend/cpp/ds4/.gitignore vendored Normal file
View File

@@ -0,0 +1,9 @@
ds4/
build/
package/
grpc-server
*.o
backend.pb.cc
backend.pb.h
backend.grpc.pb.cc
backend.grpc.pb.h

View File

@@ -0,0 +1,101 @@
cmake_minimum_required(VERSION 3.15)
project(ds4-grpc-server LANGUAGES CXX C)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(TARGET grpc-server)
option(DS4_NATIVE "Compile with -march=native / -mcpu=native" ON)
set(DS4_GPU "cpu" CACHE STRING "GPU backend: cpu, cuda, or metal")
set(DS4_DIR "${CMAKE_CURRENT_SOURCE_DIR}/ds4" CACHE PATH "Path to cloned ds4 source")
find_package(Threads REQUIRED)
find_package(Protobuf CONFIG QUIET)
if(NOT Protobuf_FOUND)
find_package(Protobuf REQUIRED)
endif()
find_package(gRPC CONFIG QUIET)
if(NOT gRPC_FOUND)
# Ubuntu's apt-installed grpc++ does not ship a CMake config - fall back.
find_library(GRPCPP_LIB grpc++ REQUIRED)
find_library(GRPCPP_REFLECTION_LIB grpc++_reflection REQUIRED)
add_library(gRPC::grpc++ INTERFACE IMPORTED)
set_target_properties(gRPC::grpc++ PROPERTIES INTERFACE_LINK_LIBRARIES "${GRPCPP_LIB}")
add_library(gRPC::grpc++_reflection INTERFACE IMPORTED)
set_target_properties(gRPC::grpc++_reflection PROPERTIES INTERFACE_LINK_LIBRARIES "${GRPCPP_REFLECTION_LIB}")
endif()
find_program(_PROTOC NAMES protoc REQUIRED)
find_program(_GRPC_CPP_PLUGIN NAMES grpc_cpp_plugin REQUIRED)
get_filename_component(HW_PROTO "${CMAKE_CURRENT_SOURCE_DIR}/../../backend.proto" ABSOLUTE)
get_filename_component(HW_PROTO_PATH "${HW_PROTO}" PATH)
set(HW_PROTO_SRCS "${CMAKE_CURRENT_BINARY_DIR}/backend.pb.cc")
set(HW_PROTO_HDRS "${CMAKE_CURRENT_BINARY_DIR}/backend.pb.h")
set(HW_GRPC_SRCS "${CMAKE_CURRENT_BINARY_DIR}/backend.grpc.pb.cc")
set(HW_GRPC_HDRS "${CMAKE_CURRENT_BINARY_DIR}/backend.grpc.pb.h")
add_custom_command(
OUTPUT "${HW_PROTO_SRCS}" "${HW_PROTO_HDRS}" "${HW_GRPC_SRCS}" "${HW_GRPC_HDRS}"
COMMAND ${_PROTOC}
ARGS --grpc_out "${CMAKE_CURRENT_BINARY_DIR}"
--cpp_out "${CMAKE_CURRENT_BINARY_DIR}"
-I "${HW_PROTO_PATH}"
--plugin=protoc-gen-grpc="${_GRPC_CPP_PLUGIN}"
"${HW_PROTO}"
DEPENDS "${HW_PROTO}")
add_library(hw_grpc_proto STATIC
${HW_GRPC_SRCS} ${HW_GRPC_HDRS}
${HW_PROTO_SRCS} ${HW_PROTO_HDRS})
target_include_directories(hw_grpc_proto PUBLIC ${CMAKE_CURRENT_BINARY_DIR})
set(DS4_OBJS "${DS4_DIR}/ds4.o")
if(DS4_GPU STREQUAL "cuda")
list(APPEND DS4_OBJS "${DS4_DIR}/ds4_cuda.o")
elseif(DS4_GPU STREQUAL "metal")
list(APPEND DS4_OBJS "${DS4_DIR}/ds4_metal.o")
elseif(DS4_GPU STREQUAL "cpu")
set(DS4_OBJS "${DS4_DIR}/ds4_cpu.o")
endif()
add_executable(${TARGET}
grpc-server.cpp
dsml_parser.cpp
dsml_renderer.cpp
kv_cache.cpp)
target_include_directories(${TARGET} PRIVATE ${DS4_DIR})
foreach(obj ${DS4_OBJS})
target_sources(${TARGET} PRIVATE ${obj})
set_source_files_properties(${obj} PROPERTIES EXTERNAL_OBJECT TRUE GENERATED TRUE)
endforeach()
target_link_libraries(${TARGET} PRIVATE
hw_grpc_proto
gRPC::grpc++
gRPC::grpc++_reflection
protobuf::libprotobuf
Threads::Threads
m)
if(DS4_GPU STREQUAL "cuda")
find_package(CUDAToolkit REQUIRED)
target_link_libraries(${TARGET} PRIVATE CUDA::cudart CUDA::cublas)
elseif(DS4_GPU STREQUAL "metal")
find_library(FOUNDATION_LIB Foundation REQUIRED)
find_library(METAL_LIB Metal REQUIRED)
target_link_libraries(${TARGET} PRIVATE ${FOUNDATION_LIB} ${METAL_LIB})
elseif(DS4_GPU STREQUAL "cpu")
target_compile_definitions(${TARGET} PRIVATE DS4_NO_GPU)
endif()
if(DS4_NATIVE)
if(APPLE)
target_compile_options(${TARGET} PRIVATE -mcpu=native)
else()
target_compile_options(${TARGET} PRIVATE -march=native)
endif()
endif()

78
backend/cpp/ds4/Makefile Normal file
View File

@@ -0,0 +1,78 @@
# ds4 backend Makefile.
#
# Upstream pin lives below as DS4_VERSION?= so the bump-deps bot
# (.github/bump_deps.sh) can find and update it - matches the
# llama-cpp / ik-llama-cpp / turboquant convention.
DS4_VERSION?=ae302c2fa18cc6d9aefc021d0f27ae03c9ad2fc0
DS4_REPO?=https://github.com/antirez/ds4
CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
BUILD_DIR := build
BUILD_TYPE ?=
NATIVE ?= false
JOBS ?= $(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
UNAME_S := $(shell uname -s)
CMAKE_ARGS ?= -DCMAKE_BUILD_TYPE=Release
ifeq ($(BUILD_TYPE),cublas)
CMAKE_ARGS += -DDS4_GPU=cuda
DS4_OBJ_TARGET := ds4.o ds4_cuda.o
else ifeq ($(UNAME_S),Darwin)
CMAKE_ARGS += -DDS4_GPU=metal
DS4_OBJ_TARGET := ds4.o ds4_metal.o
else
# CPU reference path (Linux only - macOS CPU path is broken by VM bug per ds4 README).
CMAKE_ARGS += -DDS4_GPU=cpu
DS4_OBJ_TARGET := ds4_cpu.o
endif
ifneq ($(NATIVE),true)
CMAKE_ARGS += -DDS4_NATIVE=OFF
endif
.PHONY: grpc-server package clean purge test all
all: grpc-server
# Clone the upstream ds4 source at the pinned commit. Directory acts as the
# target so make only re-clones when missing. After a DS4_VERSION bump,
# run 'make purge && make' to refetch (or rely on CI's clean build).
ds4:
mkdir -p ds4
cd ds4 && \
git init -q && \
git remote add origin $(DS4_REPO) && \
git fetch --depth 1 origin $(DS4_VERSION) && \
git checkout FETCH_HEAD
# Build ds4's engine object files via its own Makefile, which already encodes
# the right per-platform compile flags (Objective-C/Metal on Darwin, nvcc on Linux+CUDA).
ds4/ds4.o: ds4
ifeq ($(BUILD_TYPE),cublas)
+$(MAKE) -C ds4 ds4.o ds4_cuda.o
else ifeq ($(UNAME_S),Darwin)
+$(MAKE) -C ds4 ds4.o ds4_metal.o
else
+$(MAKE) -C ds4 ds4_cpu.o
endif
grpc-server: ds4/ds4.o
mkdir -p $(BUILD_DIR)
cd $(BUILD_DIR) && cmake $(CMAKE_ARGS) $(CURRENT_MAKEFILE_DIR) && cmake --build . --config Release -j $(JOBS)
cp $(BUILD_DIR)/grpc-server grpc-server
package: grpc-server
bash package.sh
test:
@echo "ds4 backend: e2e coverage at tests/e2e-backends/ (BACKEND_BINARY mode)"
clean:
rm -rf $(BUILD_DIR) grpc-server package
if [ -d ds4 ]; then $(MAKE) -C ds4 clean; fi
purge: clean
rm -rf ds4

View File

@@ -0,0 +1,359 @@
#include "dsml_parser.h"
#include <algorithm>
#include <cstdio>
#include <cstring>
#include <chrono>
#include <random>
#include <string>
#include <vector>
namespace ds4cpp {
namespace {
constexpr const char *kThinkOpen = "<think>";
constexpr const char *kThinkClose = "</think>";
constexpr const char *kToolsOpen = "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>"; // <DSMLtool_calls>
constexpr const char *kToolsClose = "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>"; // </DSMLtool_calls>
constexpr const char *kInvokeOpenPfx = "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke name=\""; // <DSMLinvoke name="
constexpr const char *kInvokeClose = "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke>"; // </DSMLinvoke>
constexpr const char *kParamOpenPfx = "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter name=\""; // <DSMLparameter name="
constexpr const char *kParamClose = "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>"; // </DSMLparameter>
// All structural markers the parser might encounter - used to detect "buf
// might be a partial marker, don't drain yet" conditions.
const std::vector<std::string> &all_markers() {
static const std::vector<std::string> v = {
kThinkOpen, kThinkClose,
kToolsOpen, kToolsClose,
kInvokeOpenPfx, kInvokeClose,
kParamOpenPfx, kParamClose,
};
return v;
}
// Returns true if `buf` could be a *prefix* of any marker (i.e., we should
// wait for more text before draining as plain content). The marker-prefix
// loop handles fixed markers exactly. For markers with variable-length
// internal data (kInvokeOpenPfx, kParamOpenPfx have an open quote, then the
// tool/param name, then a closing quote and `>`), we also wait while buf
// starts with `<` and has not yet seen a `>`: the leading `<` could be the
// start of one of those open markers, or a literal that we can confirm only
// once we know what follows. Anything after the first `>` arrives is either
// consumed by TryConsumeMarker or emitted as a literal `<` by the caller.
bool looks_like_prefix(const std::string &buf) {
for (const auto &m : all_markers()) {
if (m.size() > buf.size() && m.compare(0, buf.size(), buf) == 0) return true;
}
if (!buf.empty() && buf[0] == '<' && buf.find('>') == std::string::npos) {
return true;
}
return false;
}
bool consume_literal(std::string &buf, const std::string &lit) {
if (buf.compare(0, lit.size(), lit) == 0) {
buf.erase(0, lit.size());
return true;
}
return false;
}
// Find the next '<' in buf starting at offset; returns std::string::npos if none.
size_t next_tag(const std::string &buf, size_t off = 0) {
return buf.find('<', off);
}
std::string json_escape(const std::string &in) {
std::string out;
out.reserve(in.size() + 2);
for (char c : in) {
switch (c) {
case '"': out += "\\\""; break;
case '\\': out += "\\\\"; break;
case '\b': out += "\\b"; break;
case '\f': out += "\\f"; break;
case '\n': out += "\\n"; break;
case '\r': out += "\\r"; break;
case '\t': out += "\\t"; break;
default:
if (static_cast<unsigned char>(c) < 0x20) {
char tmp[8];
std::snprintf(tmp, sizeof(tmp), "\\u%04x", c);
out += tmp;
} else {
out += c;
}
}
}
return out;
}
} // namespace
DsmlParser::DsmlParser() = default;
bool DsmlParser::IsInDsmlStructural() const {
switch (state_) {
case State::TOOL_CALLS:
case State::INVOKE:
return true;
case State::PARAM_VALUE: // payload bytes; user sampling applies
case State::TEXT:
case State::THINK:
return false;
}
return false;
}
void DsmlParser::EmitArgsChunk(const std::string &chunk, std::vector<ParserEvent> &out) {
if (chunk.empty()) return;
ParserEvent e;
e.type = ParserEvent::TOOL_ARGS;
e.text = chunk;
e.index = tool_index_;
out.push_back(std::move(e));
}
void DsmlParser::FinishCurrentToolCall(std::vector<ParserEvent> &out) {
if (tool_index_ < 0) return;
// Close the JSON object that was opened on the first parameter.
if (args_emitted_open_brace_) {
EmitArgsChunk("}", out);
} else {
EmitArgsChunk("{}", out);
}
ParserEvent e;
e.type = ParserEvent::TOOL_END;
e.index = tool_index_;
out.push_back(std::move(e));
current_tool_name_.clear();
args_emitted_open_brace_ = false;
args_param_count_ = 0;
}
bool DsmlParser::TryConsumeMarker(std::vector<ParserEvent> &out) {
switch (state_) {
case State::TEXT: {
if (consume_literal(buf_, kThinkOpen)) { state_ = State::THINK; return true; }
if (consume_literal(buf_, kToolsOpen)) { state_ = State::TOOL_CALLS; return true; }
return false;
}
case State::THINK: {
if (consume_literal(buf_, kThinkClose)) { state_ = State::TEXT; return true; }
return false;
}
case State::TOOL_CALLS: {
if (consume_literal(buf_, kToolsClose)) { state_ = State::TEXT; return true; }
// <DSMLinvoke name="X">
if (buf_.compare(0, std::strlen(kInvokeOpenPfx), kInvokeOpenPfx) == 0) {
size_t close_q = buf_.find('"', std::strlen(kInvokeOpenPfx));
if (close_q == std::string::npos) return false; // need more bytes
size_t close_gt = buf_.find('>', close_q);
if (close_gt == std::string::npos) return false;
current_tool_name_ = buf_.substr(std::strlen(kInvokeOpenPfx),
close_q - std::strlen(kInvokeOpenPfx));
tool_index_++;
buf_.erase(0, close_gt + 1);
ParserEvent e;
e.type = ParserEvent::TOOL_START;
e.tool_name = current_tool_name_;
e.tool_id = RandomToolId();
e.index = tool_index_;
out.push_back(std::move(e));
args_emitted_open_brace_ = false;
args_param_count_ = 0;
state_ = State::INVOKE;
return true;
}
return false;
}
case State::INVOKE: {
if (consume_literal(buf_, kInvokeClose)) {
FinishCurrentToolCall(out);
state_ = State::TOOL_CALLS;
return true;
}
// <DSMLparameter name="K" string="true|false">
if (buf_.compare(0, std::strlen(kParamOpenPfx), kParamOpenPfx) == 0) {
size_t close_q = buf_.find('"', std::strlen(kParamOpenPfx));
if (close_q == std::string::npos) return false;
size_t string_attr = buf_.find("string=\"", close_q);
if (string_attr == std::string::npos) return false;
size_t string_q = buf_.find('"', string_attr + 8);
if (string_q == std::string::npos) return false;
size_t close_gt = buf_.find('>', string_q);
if (close_gt == std::string::npos) return false;
param_name_ = buf_.substr(std::strlen(kParamOpenPfx),
close_q - std::strlen(kParamOpenPfx));
std::string string_val = buf_.substr(string_attr + 8,
string_q - (string_attr + 8));
param_is_string_ = (string_val == "true");
param_value_.clear();
buf_.erase(0, close_gt + 1);
// Emit args JSON opener / separator.
std::string opener;
if (!args_emitted_open_brace_) { opener = "{"; args_emitted_open_brace_ = true; }
else { opener = ","; }
opener += "\"" + json_escape(param_name_) + "\":";
if (param_is_string_) opener += "\"";
EmitArgsChunk(opener, out);
args_param_count_++;
state_ = State::PARAM_VALUE;
return true;
}
return false;
}
case State::PARAM_VALUE: {
if (consume_literal(buf_, kParamClose)) {
if (param_is_string_) EmitArgsChunk("\"", out);
state_ = State::INVOKE;
return true;
}
return false;
}
}
return false;
}
void DsmlParser::DrainPlain(std::vector<ParserEvent> &out) {
// Drain everything up to the next '<' that *might* start a marker.
// Anything before the next '<' is safe to emit; the '<...' tail stays buffered.
while (!buf_.empty()) {
size_t lt = next_tag(buf_, 0);
if (lt == std::string::npos) {
// No tag at all - emit (or accumulate) the whole buffer.
ParserEvent e;
if (state_ == State::PARAM_VALUE) {
std::string esc = param_is_string_ ? json_escape(buf_) : buf_;
EmitArgsChunk(esc, out);
} else if (state_ == State::THINK) {
e.type = ParserEvent::REASONING;
e.text = buf_;
out.push_back(std::move(e));
} else if (state_ == State::TEXT) {
e.type = ParserEvent::CONTENT;
e.text = buf_;
out.push_back(std::move(e));
}
// Inside INVOKE / TOOL_CALLS with no marker, raw bytes are
// structural whitespace - discard.
buf_.clear();
return;
}
if (lt > 0) {
std::string chunk = buf_.substr(0, lt);
buf_.erase(0, lt);
ParserEvent e;
if (state_ == State::PARAM_VALUE) {
std::string esc = param_is_string_ ? json_escape(chunk) : chunk;
EmitArgsChunk(esc, out);
} else if (state_ == State::THINK) {
e.type = ParserEvent::REASONING;
e.text = chunk;
out.push_back(std::move(e));
} else if (state_ == State::TEXT) {
e.type = ParserEvent::CONTENT;
e.text = chunk;
out.push_back(std::move(e));
}
}
// buf_[0] == '<' - try consuming a marker. If we consumed one, loop again.
if (!TryConsumeMarker(out)) {
// Could be a partial marker - wait for more bytes.
if (looks_like_prefix(buf_)) return;
// Otherwise this '<' is a literal - emit one char and continue.
std::string one(1, buf_[0]);
buf_.erase(0, 1);
ParserEvent e;
if (state_ == State::PARAM_VALUE) {
std::string esc = param_is_string_ ? json_escape(one) : one;
EmitArgsChunk(esc, out);
} else if (state_ == State::THINK) {
e.type = ParserEvent::REASONING;
e.text = one;
out.push_back(std::move(e));
} else if (state_ == State::TEXT) {
e.type = ParserEvent::CONTENT;
e.text = one;
out.push_back(std::move(e));
}
}
}
}
void DsmlParser::Feed(const std::string &chunk, std::vector<ParserEvent> &out) {
buf_ += chunk;
DrainPlain(out);
}
void DsmlParser::Flush(std::vector<ParserEvent> &out) {
// At flush time we no longer wait for marker completion - drain everything
// (the trailing bytes won't grow). Mirror DrainPlain's state-aware
// classification: PARAM_VALUE bytes become TOOL_ARGS, THINK bytes become
// REASONING, TEXT bytes become CONTENT, and INVOKE/TOOL_CALLS bytes are
// structural whitespace (discarded).
auto emit_plain = [&](const std::string &chunk) {
if (chunk.empty()) return;
if (state_ == State::PARAM_VALUE) {
std::string esc = param_is_string_ ? json_escape(chunk) : chunk;
EmitArgsChunk(esc, out);
return;
}
if (state_ == State::THINK) {
ParserEvent e;
e.type = ParserEvent::REASONING;
e.text = chunk;
out.push_back(std::move(e));
return;
}
if (state_ == State::TEXT) {
ParserEvent e;
e.type = ParserEvent::CONTENT;
e.text = chunk;
out.push_back(std::move(e));
return;
}
// INVOKE / TOOL_CALLS: structural whitespace, discard.
};
while (!buf_.empty()) {
size_t lt = next_tag(buf_, 0);
if (lt == std::string::npos) {
emit_plain(buf_);
buf_.clear();
return;
}
if (lt > 0) {
std::string chunk = buf_.substr(0, lt);
buf_.erase(0, lt);
emit_plain(chunk);
}
if (!TryConsumeMarker(out)) {
// Definitely a literal '<' now (no chance of more bytes arriving).
std::string one(1, buf_[0]);
buf_.erase(0, 1);
emit_plain(one);
}
}
// If we ended mid-tool-call (model truncated), close it cleanly.
if (state_ == State::INVOKE || state_ == State::PARAM_VALUE) {
if (state_ == State::PARAM_VALUE && param_is_string_) EmitArgsChunk("\"", out);
FinishCurrentToolCall(out);
state_ = State::TEXT;
}
}
std::string RandomToolId() {
static thread_local std::mt19937_64 rng{
static_cast<uint64_t>(std::chrono::system_clock::now().time_since_epoch().count())};
const char *alphabet =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
std::string out = "call_";
for (int i = 0; i < 16; ++i) {
out += alphabet[rng() % 62];
}
return out;
}
} // namespace ds4cpp

View File

@@ -0,0 +1,77 @@
#pragma once
#include <functional>
#include <string>
#include <vector>
namespace ds4cpp {
struct ParserEvent {
enum Type { CONTENT, REASONING, TOOL_START, TOOL_ARGS, TOOL_END };
Type type;
std::string text; // CONTENT, REASONING, TOOL_ARGS
std::string tool_name; // TOOL_START
std::string tool_id; // TOOL_START (caller-assigned)
int index = 0; // TOOL_START / TOOL_ARGS / TOOL_END
};
// Streaming parser. Stateless across instances; one per Predict call.
class DsmlParser {
public:
DsmlParser();
// Feed a chunk of raw model-emitted text. Appends classified events to
// `out`. May buffer the tail of `chunk` internally if it looks like a
// marker prefix.
void Feed(const std::string &chunk, std::vector<ParserEvent> &out);
// Flush any remaining buffered text as CONTENT (called at generation end).
void Flush(std::vector<ParserEvent> &out);
// True when the parser is inside a DSML structural position - that is,
// tags/markers between tool-call boundaries where the model is expected
// to emit protocol bytes verbatim. Mirrors ds4_server.c's "force
// temperature=0 unless dsml_decode_state_uses_payload_sampling" rule:
//
// TEXT / THINK -> false (user sampling applies)
// PARAM_VALUE -> false (payload uses user sampling)
// TOOL_CALLS / INVOKE -> true (structural; force greedy)
//
// Callers should use this BEFORE the next sample() call to pick the
// effective temperature; the parser's state reflects what's already
// been consumed, so it predicts the next token's classification.
bool IsInDsmlStructural() const;
private:
enum class State { TEXT, THINK, TOOL_CALLS, INVOKE, PARAM_VALUE };
State state_ = State::TEXT;
std::string buf_;
std::string current_tool_name_;
int tool_index_ = -1;
// While parsing a parameter value:
std::string param_name_;
bool param_is_string_ = true;
std::string param_value_;
// Incrementally-built arguments JSON for the active tool call.
std::string args_json_so_far_;
bool args_emitted_open_brace_ = false;
int args_param_count_ = 0;
// Try to consume one structural marker starting at buf_[0]. Returns true
// and advances state if a complete marker was consumed; false if the
// buffer is ambiguous (could be a marker prefix).
bool TryConsumeMarker(std::vector<ParserEvent> &out);
// Drain plain text from buf_ as far as we're sure it's not a marker prefix.
// Emits CONTENT or REASONING depending on current state.
void DrainPlain(std::vector<ParserEvent> &out);
// Emit the next chunk of arguments JSON to the consumer.
void EmitArgsChunk(const std::string &chunk, std::vector<ParserEvent> &out);
void FinishCurrentToolCall(std::vector<ParserEvent> &out);
};
// Generate a random tool call ID (e.g. "call_AbCdEf"). Used by the gRPC layer
// when assigning IDs to streamed tool calls.
std::string RandomToolId();
} // namespace ds4cpp

View File

@@ -0,0 +1,140 @@
#include "dsml_renderer.h"
// We accept either nlohmann::json (if available) or fall back to a tiny
// hand-rolled parser. The LocalAI tree already has nlohmann/json bundled
// in vendor paths; we use the apt-installed nlohmann-json3-dev (installed
// in Task 11 step 1) when present, otherwise the bundled copy.
#if __has_include(<nlohmann/json.hpp>)
#include <nlohmann/json.hpp>
using json = nlohmann::json;
#else
#error "nlohmann/json.hpp not found; install nlohmann-json3-dev"
#endif
#include <sstream>
namespace ds4cpp {
namespace {
void render_param(std::ostringstream &os, const std::string &name,
const json &value) {
bool is_string = value.is_string();
os << "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter name=\"" << name
<< "\" string=\"" << (is_string ? "true" : "false") << "\">";
if (is_string) {
os << value.get<std::string>();
} else {
os << value.dump();
}
os << "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>\n";
}
} // namespace
std::string RenderAssistantToolCalls(const std::string &tool_calls_json) {
if (tool_calls_json.empty()) return "";
json arr;
try {
arr = json::parse(tool_calls_json);
} catch (const std::exception &) {
return "";
}
if (!arr.is_array() || arr.empty()) return "";
std::ostringstream os;
os << "\n\n<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>\n";
for (const auto &call : arr) {
// OpenAI shape: { id, type, function: { name, arguments (JSON string) } }
// Anthropic shape comes through normalized by LocalAI.
std::string name;
std::string args_str;
if (call.contains("function")) {
const auto &fn = call["function"];
if (fn.contains("name") && fn["name"].is_string())
name = fn["name"].get<std::string>();
if (fn.contains("arguments") && fn["arguments"].is_string())
args_str = fn["arguments"].get<std::string>();
}
os << "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke name=\"" << name << "\">\n";
if (!args_str.empty()) {
json args;
try {
args = json::parse(args_str);
} catch (...) {
args = json{};
}
if (args.is_object()) {
for (auto it = args.begin(); it != args.end(); ++it) {
render_param(os, it.key(), it.value());
}
}
}
os << "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke>\n";
}
os << "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>";
return os.str();
}
std::string RenderToolResult(const std::string &tool_call_id, const std::string &content) {
std::ostringstream os;
// ds4_server.c wraps tool results in a "tool_result" DSML tag carrying
// the tool_call_id. Match that shape.
os << "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_result id=\"" << tool_call_id << "\">"
<< content
<< "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_result>";
return os.str();
}
std::string RenderToolsManifest(const std::string &tools_json) {
if (tools_json.empty()) return "";
json arr;
try {
arr = json::parse(tools_json);
} catch (const std::exception &) {
return "";
}
if (!arr.is_array() || arr.empty()) return "";
// Extract each OpenAI tool's `function` object, dump as compact JSON, one
// per line. Mirrors openai_function_schema_from_tool() in ds4_server.c.
std::ostringstream schemas;
for (const auto &tool : arr) {
if (tool.contains("function") && tool["function"].is_object()) {
schemas << tool["function"].dump() << "\n";
} else if (tool.is_object()) {
// Anthropic / direct-schema form: pass through.
schemas << tool.dump() << "\n";
}
}
if (schemas.tellp() == std::streampos(0)) return "";
// Verbatim text from ds4_server.c append_tools_prompt_text. Do NOT
// paraphrase - the model was trained on these exact bytes.
std::ostringstream os;
os << "## Tools\n\n"
"You have access to a set of tools to help answer the user question. "
"You can invoke tools by writing a \"<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>\" block like the following:\n\n"
"<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>\n"
"<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke name=\"$TOOL_NAME\">\n"
"<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter name=\"$PARAMETER_NAME\" string=\"true|false\">$PARAMETER_VALUE</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>\n"
"...\n"
"</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke>\n"
"<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke name=\"$TOOL_NAME2\">\n"
"...\n"
"</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke>\n"
"</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>\n\n"
"String parameters should be specified as raw text and set `string=\"true\"`. "
"Preserve characters such as `>`, `&`, and `&&` exactly; never replace normal string characters with XML or HTML entity escapes. "
"Only if a string value itself contains the exact closing parameter tag `</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>`, write that tag as `&lt;/\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>` inside the value. "
"For all other types (numbers, booleans, arrays, objects), pass the value in JSON format and set `string=\"false\"`.\n\n"
"If thinking_mode is enabled (triggered by <think>), you MUST output your complete reasoning inside <think>...</think> BEFORE any tool calls or final response.\n\n"
"Otherwise, output directly after </think> with tool calls or final response.\n\n"
"### Available Tool Schemas\n\n"
<< schemas.str()
<< "\nYou MUST strictly follow the above defined tool name and parameter schemas to invoke tool calls. "
"Use the exact parameter names from the schemas.";
return os.str();
}
} // namespace ds4cpp

View File

@@ -0,0 +1,27 @@
#pragma once
#include <string>
namespace ds4cpp {
// Render an assistant message's tool_calls JSON array into the DSML block
// that ds4 expects in its prompt. `tool_calls_json` is the value of
// proto.Message.tool_calls (OpenAI shape: array of {id, type, function:{name, arguments}}).
// Returns the DSML text to append after the assistant's content.
std::string RenderAssistantToolCalls(const std::string &tool_calls_json);
// Render a role="tool" message into the DSML "tool result" block. ds4's
// prompt template expects tool results inside a specific tag; we wrap the
// `content` with that tag and include the `tool_call_id` so the model can
// correlate.
std::string RenderToolResult(const std::string &tool_call_id, const std::string &content);
// Render the "## Tools" manifest that ds4 expects in the SYSTEM prompt when
// tools are available. Without this preamble the model has no idea tools
// exist and will not emit DSML tool calls. Mirrors append_tools_prompt_text()
// in ds4_server.c (~line 1646): a fixed preamble + "### Available Tool
// Schemas" section + one JSON schema per line (extracted from each OpenAI
// tool's .function object) + a fixed closing instruction. Returns empty
// when tools_json is empty / unparseable.
std::string RenderToolsManifest(const std::string &tools_json);
} // namespace ds4cpp

View File

@@ -0,0 +1,696 @@
// ds4 LocalAI gRPC backend.
//
// Wraps antirez/ds4's `ds4_engine_*` / `ds4_session_*` public API
// (see ds4/ds4.h) over LocalAI's backend.proto. Tool calls, thinking
// mode, and disk KV cache are wired in follow-up commits; this commit
// is just the bind/listen/Health/Free skeleton.
#include "backend.pb.h"
#include "backend.grpc.pb.h"
#include "dsml_parser.h" // populated in Task 12
#include "dsml_renderer.h" // populated in Task 16
#include "kv_cache.h" // populated in Task 17
extern "C" {
#include "ds4.h"
}
#include <grpcpp/grpcpp.h>
#include <grpcpp/server.h>
#include <grpcpp/server_builder.h>
#include <grpcpp/ext/proto_server_reflection_plugin.h>
#include <atomic>
#include <chrono>
#include <csignal>
#include <cstring>
#include <iostream>
#include <memory>
#include <mutex>
#include <string>
#include <thread>
#include <vector>
using grpc::Server;
using grpc::ServerBuilder;
using grpc::ServerContext;
using grpc::ServerWriter;
// NOTE: do NOT alias `grpc::Status` as `Status` - the Status RPC method below
// would shadow the type, breaking the other RPC method declarations that use
// it as a return type. Use GStatus instead.
using GStatus = ::grpc::Status;
using grpc::StatusCode;
namespace {
// Global state - ds4 is single-engine-per-process by design.
std::mutex g_engine_mu;
ds4_engine *g_engine = nullptr;
ds4_session *g_session = nullptr;
int g_ctx_size = 32768;
std::string g_kv_cache_dir; // empty disables disk cache
std::atomic<Server *> g_server{nullptr};
// Parse a "key:value" option string. Returns empty when no colon.
static std::pair<std::string, std::string> split_option(const std::string &opt) {
auto colon = opt.find(':');
if (colon == std::string::npos) return {opt, ""};
return {opt.substr(0, colon), opt.substr(colon + 1)};
}
static void append_token_text(ds4_engine *engine, int token, std::string &out) {
size_t len = 0;
const char *text = ds4_token_text(engine, token, &len);
if (text && len > 0) out.append(text, len);
}
struct CollectCtx {
ds4_engine *engine;
std::string raw_buf; // exact raw bytes for Reply.message
ds4cpp::DsmlParser parser;
backend::Reply *reply;
int tokens;
// Per-tool aggregation: accumulate ChatDelta tool_calls so we emit one
// delta with all calls, mirroring how vllm's non-streaming path returns.
struct Pending {
std::string id;
std::string name;
std::string args;
};
std::vector<Pending> pending;
std::string content_buf;
std::string reasoning_buf;
};
static void apply_events(CollectCtx *c, const std::vector<ds4cpp::ParserEvent> &events) {
for (const auto &e : events) {
switch (e.type) {
case ds4cpp::ParserEvent::CONTENT:
c->content_buf += e.text;
break;
case ds4cpp::ParserEvent::REASONING:
c->reasoning_buf += e.text;
break;
case ds4cpp::ParserEvent::TOOL_START:
if ((int)c->pending.size() <= e.index)
c->pending.resize(e.index + 1);
c->pending[e.index].id = e.tool_id;
c->pending[e.index].name = e.tool_name;
break;
case ds4cpp::ParserEvent::TOOL_ARGS:
if ((int)c->pending.size() > e.index)
c->pending[e.index].args += e.text;
break;
case ds4cpp::ParserEvent::TOOL_END:
// No-op for non-streaming: the final delta is emitted at the end.
break;
}
}
}
static void collect_emit(void *ud, int token) {
auto *c = static_cast<CollectCtx *>(ud);
if (token == ds4_token_eos(c->engine)) return;
size_t len = 0;
const char *text = ds4_token_text(c->engine, token, &len);
if (!text || len == 0) return;
std::string chunk(text, len);
c->raw_buf += chunk;
std::vector<ds4cpp::ParserEvent> events;
c->parser.Feed(chunk, events);
apply_events(c, events);
c->tokens++;
}
static void collect_done(void *) {}
struct StreamCtx {
ds4_engine *engine;
ServerWriter<backend::Reply> *writer;
ds4cpp::DsmlParser parser;
int tokens;
bool aborted;
// Track which tool indices we've seen TOOL_START for, so subsequent
// ARGS deltas can elide the redundant id/name fields.
std::vector<bool> tool_started;
};
static void stream_emit(void *ud, int token) {
auto *s = static_cast<StreamCtx *>(ud);
if (s->aborted) return;
if (token == ds4_token_eos(s->engine)) return;
size_t len = 0;
const char *text = ds4_token_text(s->engine, token, &len);
if (!text || len == 0) return;
std::string chunk(text, len);
std::vector<ds4cpp::ParserEvent> events;
s->parser.Feed(chunk, events);
if (events.empty()) { s->tokens++; return; }
backend::Reply reply;
auto *delta = reply.add_chat_deltas();
bool any_field = false;
for (const auto &e : events) {
switch (e.type) {
case ds4cpp::ParserEvent::CONTENT:
delta->set_content(delta->content() + e.text);
any_field = true;
break;
case ds4cpp::ParserEvent::REASONING:
delta->set_reasoning_content(delta->reasoning_content() + e.text);
any_field = true;
break;
case ds4cpp::ParserEvent::TOOL_START: {
if ((int)s->tool_started.size() <= e.index)
s->tool_started.resize(e.index + 1, false);
s->tool_started[e.index] = true;
auto *tc = delta->add_tool_calls();
tc->set_index(e.index);
tc->set_id(e.tool_id);
tc->set_name(e.tool_name);
any_field = true;
break;
}
case ds4cpp::ParserEvent::TOOL_ARGS: {
auto *tc = delta->add_tool_calls();
tc->set_index(e.index);
tc->set_arguments(e.text);
any_field = true;
break;
}
case ds4cpp::ParserEvent::TOOL_END:
// No marker delta needed - the Go side closes the tool call on
// the final aggregator pass.
break;
}
}
reply.set_message(chunk);
reply.set_tokens(1);
if (any_field) {
if (!s->writer->Write(reply)) s->aborted = true;
}
s->tokens++;
}
static void stream_done(void *) {}
// Per-thread RNG seed for ds4_session_sample. Initialized lazily from
// system_clock; ds4 owns the random walk after that.
static uint64_t *get_rng() {
static thread_local uint64_t seed = 0;
if (seed == 0) {
seed = static_cast<uint64_t>(
std::chrono::system_clock::now().time_since_epoch().count());
if (seed == 0) seed = 1;
}
return &seed;
}
struct SampleParams {
float temperature;
int top_k;
float top_p;
float min_p;
};
// Compute the effective sampling parameters for the next token, mirroring
// ds4_server.c:7102-7115:
// - thinking mode enabled -> override (T=1, top_k=0, top_p=1, min_p=0)
// - inside DSML structural position (tool-call markers) -> force T=0
// - otherwise -> the request's user-supplied sampling settings
// The parser argument carries state from tokens emitted so far; its
// IsInDsmlStructural() predicts the next token's classification.
static SampleParams compute_sample_params(const backend::PredictOptions *request,
const ds4cpp::DsmlParser &parser,
bool think_enabled);
static ds4_think_mode parse_think_mode(const backend::PredictOptions *request) {
// Per the vllm backend convention, "enable_thinking" gates thinking on/off,
// and "reasoning_effort" picks the strength when on.
const auto &md = request->metadata();
auto et = md.find("enable_thinking");
bool enabled = true; // default ON per ds4-server
if (et != md.end()) enabled = (et->second == "true" || et->second == "1");
if (!enabled) return DS4_THINK_NONE;
auto re = md.find("reasoning_effort");
if (re != md.end() && (re->second == "max" || re->second == "xhigh"))
return DS4_THINK_MAX;
return DS4_THINK_HIGH;
}
static SampleParams compute_sample_params(const backend::PredictOptions *request,
const ds4cpp::DsmlParser &parser,
bool think_enabled) {
SampleParams p = {
request->temperature(),
request->topk(),
request->topp(),
request->minp(),
};
if (think_enabled) {
// Match ds4-server: thinking mode wants creativity in the reasoning
// pass and the trailing content, so the entire generation overrides
// sampling unless DSML structural bytes take over below.
p.temperature = 1.0f;
p.top_k = 0;
p.top_p = 1.0f;
p.min_p = 0.0f;
}
if (parser.IsInDsmlStructural()) {
// Tool-call structural bytes (tags, markers, headers) must parse
// cleanly. Force greedy regardless of user/thinking settings.
p.temperature = 0.0f;
}
return p;
}
// Build the rendered text for cache keying. We feed the same text the model
// will see; that lets the cache survive small client-side reformatting of
// chat history (the cache is keyed on bytes, not tokens).
static std::string render_prompt_text(const backend::PredictOptions *request) {
// Two-mode: either the raw prompt or the chat-template path. We mirror
// build_prompt's branching but accumulate text (not tokens) so we can
// SHA1 it for the cache key. ds4_session caches a tokens-indexed
// checkpoint, but the disk format keys on bytes per ds4-server's design.
if (!request->usetokenizertemplate() || request->messages_size() == 0) {
return request->prompt();
}
std::string out;
const std::string sys_role = "system";
for (const auto &m : request->messages()) {
if (m.role() == sys_role) { out += "[sys] " + m.content() + "\n"; break; }
}
for (const auto &m : request->messages()) {
if (m.role() == sys_role) continue;
out += "[" + m.role() + "] " + m.content() + "\n";
}
return out;
}
ds4cpp::KvCache g_kv_cache;
// Try to recover prefill state for `rendered`. Returns the matched prefix length.
static size_t maybe_load_cache(const std::string &rendered) {
if (!g_kv_cache.enabled() || !g_session) return 0;
return g_kv_cache.LoadLongestPrefix(g_session, rendered, g_ctx_size);
}
static void maybe_save_cache(const std::string &rendered) {
if (g_kv_cache.enabled() && g_session) {
g_kv_cache.Save(g_session, rendered, g_ctx_size);
}
}
static void build_prompt(ds4_engine *engine, const backend::PredictOptions *request,
ds4_tokens *out) {
if (!request->usetokenizertemplate() || request->messages_size() == 0) {
ds4_tokenize_text(engine, request->prompt().c_str(), out);
return;
}
// Chat-template path: render via ds4's helpers.
ds4_chat_begin(engine, out);
ds4_think_mode think = parse_think_mode(request);
// ds4_encode_chat_prompt is convenient when there is exactly one
// system+user pair, but for arbitrary turn lists we use the granular
// append helpers. Pull the first system message (if any), then append
// every other message in order.
const std::string sys_role = "system";
std::string system_text;
for (const auto &m : request->messages()) {
if (m.role() == sys_role) { system_text = m.content(); break; }
}
// Inject the tools manifest into the system prompt when tools are present.
// ds4 was trained to emit DSML tool calls ONLY when this preamble is in
// the system message - without it, the model has no idea tools exist and
// the e2e tool-call test will fail. The renderer lives in dsml_renderer
// and is a verbatim port of ds4_server.c's append_tools_prompt_text.
std::string tools_manifest;
if (!request->tools().empty()) {
tools_manifest = ds4cpp::RenderToolsManifest(request->tools());
}
if (!system_text.empty() || !tools_manifest.empty()) {
std::string combined = system_text;
if (!tools_manifest.empty()) {
if (!combined.empty()) combined += "\n\n";
combined += tools_manifest;
}
ds4_chat_append_message(engine, out, "system", combined.c_str());
}
for (const auto &m : request->messages()) {
if (m.role() == sys_role) continue;
if (m.role() == "assistant" && !m.tool_calls().empty()) {
std::string combined = m.content();
combined += ds4cpp::RenderAssistantToolCalls(m.tool_calls());
ds4_chat_append_message(engine, out, "assistant", combined.c_str());
} else if (m.role() == "tool") {
std::string body = ds4cpp::RenderToolResult(m.tool_call_id(), m.content());
ds4_chat_append_message(engine, out, "user", body.c_str());
} else {
ds4_chat_append_message(engine, out, m.role().c_str(), m.content().c_str());
}
}
ds4_chat_append_assistant_prefix(engine, out, think);
}
class DS4Backend final : public backend::Backend::Service {
public:
GStatus Health(ServerContext *, const backend::HealthMessage *,
backend::Reply *reply) override {
reply->set_message(std::string("OK"));
return GStatus::OK;
}
GStatus Free(ServerContext *, const backend::HealthMessage *,
backend::Result *result) override {
std::lock_guard<std::mutex> lock(g_engine_mu);
if (g_session) { ds4_session_free(g_session); g_session = nullptr; }
if (g_engine) { ds4_engine_close(g_engine); g_engine = nullptr; }
result->set_success(true);
return GStatus::OK;
}
GStatus LoadModel(ServerContext *, const backend::ModelOptions *request,
backend::Result *result) override {
std::lock_guard<std::mutex> lock(g_engine_mu);
if (g_engine) {
if (g_session) { ds4_session_free(g_session); g_session = nullptr; }
ds4_engine_close(g_engine);
g_engine = nullptr;
}
std::string model_path = request->modelfile();
if (model_path.empty()) model_path = request->model();
if (model_path.empty()) {
result->set_success(false);
result->set_message("ds4: ModelOptions.Model or .ModelFile must be set");
return GStatus::OK;
}
std::string mtp_path;
int mtp_draft = 0;
float mtp_margin = 3.0f;
for (const auto &opt : request->options()) {
auto [k, v] = split_option(opt);
if (k == "mtp_path") mtp_path = v;
else if (k == "mtp_draft") mtp_draft = std::stoi(v);
else if (k == "mtp_margin") mtp_margin = std::stof(v);
else if (k == "kv_cache_dir") g_kv_cache_dir = v;
}
g_kv_cache.SetDir(g_kv_cache_dir);
ds4_engine_options opt = {};
opt.model_path = model_path.c_str();
opt.mtp_path = mtp_path.empty() ? nullptr : mtp_path.c_str();
opt.n_threads = request->threads() > 0 ? request->threads() : 0;
opt.mtp_draft_tokens = mtp_draft;
opt.mtp_margin = mtp_margin;
opt.directional_steering_file = nullptr;
opt.warm_weights = false;
opt.quality = false;
#if defined(DS4_NO_GPU)
opt.backend = DS4_BACKEND_CPU;
#elif defined(__APPLE__)
opt.backend = DS4_BACKEND_METAL;
#else
opt.backend = DS4_BACKEND_CUDA;
#endif
int rc = ds4_engine_open(&g_engine, &opt);
if (rc != 0 || !g_engine) {
result->set_success(false);
result->set_message("ds4_engine_open failed (rc=" + std::to_string(rc) + ")");
return GStatus::OK;
}
g_ctx_size = request->contextsize() > 0 ? request->contextsize() : 32768;
rc = ds4_session_create(&g_session, g_engine, g_ctx_size);
if (rc != 0 || !g_session) {
ds4_engine_close(g_engine);
g_engine = nullptr;
result->set_success(false);
result->set_message("ds4_session_create failed (rc=" + std::to_string(rc) + ")");
return GStatus::OK;
}
result->set_success(true);
result->set_message("loaded " + model_path);
return GStatus::OK;
}
GStatus TokenizeString(ServerContext *, const backend::PredictOptions *request,
backend::TokenizationResponse *response) override {
std::lock_guard<std::mutex> lock(g_engine_mu);
if (!g_engine) return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
ds4_tokens out = {};
ds4_tokenize_text(g_engine, request->prompt().c_str(), &out);
for (int i = 0; i < out.len; ++i) response->add_tokens(out.v[i]);
response->set_length(out.len);
ds4_tokens_free(&out);
return GStatus::OK;
}
GStatus Predict(ServerContext *, const backend::PredictOptions *request,
backend::Reply *reply) override {
std::lock_guard<std::mutex> lock(g_engine_mu);
if (!g_engine || !g_session) {
return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
}
ds4_tokens prompt = {};
build_prompt(g_engine, request, &prompt);
int n_predict = request->tokens() > 0 ? request->tokens() : 256;
CollectCtx collect = {g_engine, "", {}, reply, 0, {}, "", ""};
std::string cache_key = render_prompt_text(request);
size_t cache_hit = maybe_load_cache(cache_key);
(void)cache_hit; // future: skip prompt prefix if hit covers full prompt
// Manual generation loop on g_session. When MTP speculative weights
// were loaded (LoadModel option 'mtp_path:'), we use the
// ds4_session_eval_speculative_argmax path which may accept N>1
// tokens per outer iteration. Otherwise per-token argmax + eval.
// Either way g_session advances so the disk KV cache picks up a
// real checkpoint after the call (see maybe_save_cache below).
char err[256] = {0};
int rc = ds4_session_sync(g_session, &prompt, err, sizeof(err));
int prompt_len = prompt.len;
ds4_tokens_free(&prompt);
if (rc == 0) {
const int eos = ds4_token_eos(g_engine);
const int draft_max = ds4_engine_mtp_draft_tokens(g_engine);
const bool think_enabled = ds4_think_mode_enabled(parse_think_mode(request));
int produced = 0;
while (produced < n_predict) {
SampleParams sp = compute_sample_params(request, collect.parser, think_enabled);
int first;
if (sp.temperature <= 0.0f) {
first = ds4_session_argmax(g_session);
} else {
first = ds4_session_sample(g_session,
sp.temperature, sp.top_k,
sp.top_p, sp.min_p, get_rng());
}
if (first == eos) break;
// MTP only when sampling is greedy (ds4-server gate).
if (draft_max > 0 && sp.temperature <= 0.0f) {
constexpr int kAcceptedMax = 8;
int accepted[kAcceptedMax];
int cap = std::min(kAcceptedMax, draft_max + 1);
int n = ds4_session_eval_speculative_argmax(
g_session, first, draft_max, eos,
accepted, cap, err, sizeof(err));
if (n < 0) { rc = -1; break; }
bool stop = false;
for (int j = 0; j < n; ++j) {
if (accepted[j] == eos) { stop = true; break; }
collect_emit(&collect, accepted[j]);
if (++produced >= n_predict) { stop = true; break; }
}
if (stop) break;
} else {
collect_emit(&collect, first);
if (++produced >= n_predict) break;
rc = ds4_session_eval(g_session, first, err, sizeof(err));
if (rc != 0) break;
}
}
collect_done(&collect);
}
maybe_save_cache(cache_key);
// Flush any buffered parser state.
std::vector<ds4cpp::ParserEvent> events;
collect.parser.Flush(events);
apply_events(&collect, events);
if (rc != 0) {
return GStatus(StatusCode::INTERNAL,
std::string("ds4 generation failed: ") + err);
}
// Emit one ChatDelta with content/reasoning/tool_calls.
auto *delta = reply->add_chat_deltas();
delta->set_content(collect.content_buf);
delta->set_reasoning_content(collect.reasoning_buf);
for (size_t i = 0; i < collect.pending.size(); ++i) {
auto *tc = delta->add_tool_calls();
tc->set_index(static_cast<int32_t>(i));
tc->set_id(collect.pending[i].id);
tc->set_name(collect.pending[i].name);
tc->set_arguments(collect.pending[i].args);
}
reply->set_message(collect.raw_buf);
reply->set_tokens(collect.tokens);
reply->set_prompt_tokens(prompt_len);
return GStatus::OK;
}
GStatus PredictStream(ServerContext *, const backend::PredictOptions *request,
ServerWriter<backend::Reply> *writer) override {
std::lock_guard<std::mutex> lock(g_engine_mu);
if (!g_engine || !g_session) {
return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
}
ds4_tokens prompt = {};
build_prompt(g_engine, request, &prompt);
int n_predict = request->tokens() > 0 ? request->tokens() : 256;
StreamCtx s = {g_engine, writer, {}, 0, false, {}};
std::string cache_key = render_prompt_text(request);
size_t cache_hit = maybe_load_cache(cache_key);
(void)cache_hit;
// Manual loop on g_session - see Predict() above for the rationale.
// MTP speculative path used when ds4_engine_mtp_draft_tokens > 0.
char err[256] = {0};
int rc = ds4_session_sync(g_session, &prompt, err, sizeof(err));
ds4_tokens_free(&prompt);
if (rc == 0) {
const int eos = ds4_token_eos(g_engine);
const int draft_max = ds4_engine_mtp_draft_tokens(g_engine);
const bool think_enabled = ds4_think_mode_enabled(parse_think_mode(request));
int produced = 0;
while (produced < n_predict && !s.aborted) {
SampleParams sp = compute_sample_params(request, s.parser, think_enabled);
int first;
if (sp.temperature <= 0.0f) {
first = ds4_session_argmax(g_session);
} else {
first = ds4_session_sample(g_session,
sp.temperature, sp.top_k,
sp.top_p, sp.min_p, get_rng());
}
if (first == eos) break;
if (draft_max > 0 && sp.temperature <= 0.0f) {
constexpr int kAcceptedMax = 8;
int accepted[kAcceptedMax];
int cap = std::min(kAcceptedMax, draft_max + 1);
int n = ds4_session_eval_speculative_argmax(
g_session, first, draft_max, eos,
accepted, cap, err, sizeof(err));
if (n < 0) { rc = -1; break; }
bool stop = false;
for (int j = 0; j < n; ++j) {
if (accepted[j] == eos) { stop = true; break; }
stream_emit(&s, accepted[j]);
if (s.aborted) { stop = true; break; }
if (++produced >= n_predict) { stop = true; break; }
}
if (stop) break;
} else {
stream_emit(&s, first);
if (s.aborted || ++produced >= n_predict) break;
rc = ds4_session_eval(g_session, first, err, sizeof(err));
if (rc != 0) break;
}
}
stream_done(&s);
}
maybe_save_cache(cache_key);
// Flush parser state.
std::vector<ds4cpp::ParserEvent> events;
s.parser.Flush(events);
if (!events.empty() && !s.aborted) {
backend::Reply reply;
auto *delta = reply.add_chat_deltas();
for (const auto &e : events) {
if (e.type == ds4cpp::ParserEvent::CONTENT) {
delta->set_content(delta->content() + e.text);
} else if (e.type == ds4cpp::ParserEvent::REASONING) {
delta->set_reasoning_content(delta->reasoning_content() + e.text);
}
}
s.writer->Write(reply);
}
if (rc != 0 && !s.aborted) {
return GStatus(StatusCode::INTERNAL,
std::string("ds4 generation failed: ") + err);
}
return GStatus::OK;
}
GStatus Status(ServerContext *, const backend::HealthMessage *,
backend::StatusResponse *response) override {
std::lock_guard<std::mutex> lock(g_engine_mu);
response->set_state(g_engine ? backend::StatusResponse::READY
: backend::StatusResponse::UNINITIALIZED);
return GStatus::OK;
}
};
void RunServer(const std::string &addr) {
DS4Backend service;
grpc::EnableDefaultHealthCheckService(true);
grpc::reflection::InitProtoReflectionServerBuilderPlugin();
ServerBuilder builder;
builder.AddListeningPort(addr, grpc::InsecureServerCredentials());
builder.RegisterService(&service);
builder.SetMaxReceiveMessageSize(64 * 1024 * 1024);
builder.SetMaxSendMessageSize(64 * 1024 * 1024);
std::unique_ptr<Server> server(builder.BuildAndStart());
if (!server) {
std::cerr << "ds4 grpc-server: failed to bind " << addr << "\n";
std::exit(1);
}
g_server = server.get();
std::cerr << "ds4 grpc-server listening on " << addr << "\n";
server->Wait();
}
void signal_handler(int) {
if (auto *srv = g_server.load()) {
srv->Shutdown(std::chrono::system_clock::now() +
std::chrono::seconds(3));
}
}
} // namespace
int main(int argc, char *argv[]) {
std::string addr = "127.0.0.1:50051";
for (int i = 1; i < argc; ++i) {
std::string a = argv[i];
const std::string addr_flag = "--addr=";
if (a.rfind(addr_flag, 0) == 0) addr = a.substr(addr_flag.size());
else if (a == "--addr" && i + 1 < argc) addr = argv[++i];
else if (a == "--help" || a == "-h") {
std::cout << "Usage: grpc-server --addr=HOST:PORT\n";
return 0;
}
}
std::signal(SIGINT, signal_handler);
std::signal(SIGTERM, signal_handler);
RunServer(addr);
return 0;
}

View File

@@ -0,0 +1,205 @@
#include "kv_cache.h"
#include <cerrno>
#include <cstdio>
#include <cstring>
#include <dirent.h>
#include <fstream>
#include <sys/stat.h>
#include <vector>
namespace ds4cpp {
namespace {
// Minimal SHA1 (public domain reference). 30 lines; used only here.
struct Sha1 {
uint32_t h[5];
uint64_t bits;
uint8_t block[64];
size_t used;
Sha1() { h[0]=0x67452301; h[1]=0xEFCDAB89; h[2]=0x98BADCFE; h[3]=0x10325476; h[4]=0xC3D2E1F0; bits=0; used=0; }
static uint32_t rol(uint32_t x, int n){ return (x<<n)|(x>>(32-n)); }
void transform(const uint8_t *b) {
uint32_t w[80];
for (int i=0;i<16;i++) w[i] = (uint32_t)b[i*4]<<24 | (uint32_t)b[i*4+1]<<16 | (uint32_t)b[i*4+2]<<8 | b[i*4+3];
for (int i=16;i<80;i++) w[i] = rol(w[i-3]^w[i-8]^w[i-14]^w[i-16], 1);
uint32_t a=h[0],bb=h[1],c=h[2],d=h[3],e=h[4];
for (int i=0;i<80;i++) {
uint32_t f,k;
if (i<20) { f=(bb&c)|((~bb)&d); k=0x5A827999; }
else if (i<40) { f=bb^c^d; k=0x6ED9EBA1; }
else if (i<60) { f=(bb&c)|(bb&d)|(c&d); k=0x8F1BBCDC; }
else { f=bb^c^d; k=0xCA62C1D6; }
uint32_t t = rol(a,5)+f+e+k+w[i];
e=d; d=c; c=rol(bb,30); bb=a; a=t;
}
h[0]+=a; h[1]+=bb; h[2]+=c; h[3]+=d; h[4]+=e;
}
void update(const void *p, size_t n) {
const uint8_t *bp = (const uint8_t*)p;
bits += (uint64_t)n*8;
while (n) {
size_t take = 64-used;
if (take>n) take=n;
std::memcpy(block+used, bp, take);
used += take; bp += take; n -= take;
if (used == 64) { transform(block); used = 0; }
}
}
void final(uint8_t out[20]) {
uint8_t pad[64] = {0x80};
size_t padlen = (used < 56) ? (56-used) : (120-used);
uint64_t lb = bits;
uint8_t len[8];
for (int i=0;i<8;i++) len[7-i] = (uint8_t)(lb >> (i*8));
update(pad, padlen);
update(len, 8);
for (int i=0;i<5;i++) {
out[i*4] = h[i]>>24;
out[i*4+1] = h[i]>>16;
out[i*4+2] = h[i]>>8;
out[i*4+3] = h[i];
}
}
};
std::string mkdir_p(const std::string &d) {
if (d.empty()) return d;
struct stat st{};
if (stat(d.c_str(), &st) == 0) return d;
mkdir(d.c_str(), 0755);
return d;
}
bool file_exists(const std::string &p) {
struct stat st{};
return stat(p.c_str(), &st) == 0;
}
} // namespace
std::string Sha1Hex(const void *data, size_t len) {
Sha1 s;
s.update(data, len);
uint8_t out[20];
s.final(out);
char hex[41];
for (int i = 0; i < 20; ++i) std::snprintf(hex + i*2, 3, "%02x", out[i]);
hex[40] = 0;
return std::string(hex);
}
KvCache::KvCache() = default;
void KvCache::SetDir(const std::string &dir) {
dir_ = dir;
if (!dir_.empty()) {
mkdir_p(dir_);
std::fprintf(stderr, "ds4 KvCache: enabled at %s\n", dir_.c_str());
} else {
std::fprintf(stderr, "ds4 KvCache: disabled (no dir set)\n");
}
}
std::string KvCache::Path(const std::string &rendered_text) const {
if (dir_.empty()) return "";
return dir_ + "/" + Sha1Hex(rendered_text.data(), rendered_text.size()) + ".kv";
}
size_t KvCache::LoadLongestPrefix(ds4_session *session,
const std::string &rendered_text,
int ctx_size) {
if (dir_.empty() || !session) return 0;
// Strategy: enumerate all .kv files in dir, read their stored prefix
// header, pick the longest one that is also a prefix of rendered_text.
DIR *d = opendir(dir_.c_str());
if (!d) return 0;
struct dirent *de;
size_t best_len = 0;
std::string best_path;
while ((de = readdir(d)) != nullptr) {
std::string name = de->d_name;
if (name.size() < 4 || name.substr(name.size()-3) != ".kv") continue;
std::string path = dir_ + "/" + name;
std::ifstream f(path, std::ios::binary);
if (!f) continue;
char magic[4]; f.read(magic, 4);
if (f.gcount() != 4 || std::memcmp(magic, "DS4G", 4) != 0) continue;
uint32_t version=0, file_ctx=0, prefix_len=0;
f.read((char*)&version, 4); f.read((char*)&file_ctx, 4); f.read((char*)&prefix_len, 4);
if (version != 1) continue;
if ((int)file_ctx != ctx_size) continue;
if (prefix_len > rendered_text.size()) continue;
std::vector<char> prefix(prefix_len);
f.read(prefix.data(), prefix_len);
if (std::memcmp(prefix.data(), rendered_text.data(), prefix_len) != 0) continue;
if (prefix_len > best_len) {
best_len = prefix_len;
best_path = path;
}
}
closedir(d);
if (best_len == 0) return 0;
// Load best_path's payload into session.
std::ifstream f(best_path, std::ios::binary);
char magic[4]; f.read(magic, 4);
uint32_t version, file_ctx, prefix_len;
f.read((char*)&version, 4); f.read((char*)&file_ctx, 4); f.read((char*)&prefix_len, 4);
f.seekg(prefix_len, std::ios::cur);
uint64_t payload_bytes = 0;
f.read((char*)&payload_bytes, 8);
// ds4_session_load_payload reads from a FILE*; reopen via fopen.
FILE *fp = std::fopen(best_path.c_str(), "rb");
if (!fp) return 0;
// Seek past header + prefix + payload_bytes field.
std::fseek(fp, 4 + 4 + 4 + 4 + prefix_len + 8, SEEK_SET);
char errbuf[256] = {0};
int rc = ds4_session_load_payload(session, fp, payload_bytes, errbuf, sizeof(errbuf));
std::fclose(fp);
if (rc != 0) return 0;
return best_len;
}
void KvCache::Save(ds4_session *session, const std::string &rendered_text, int ctx_size) {
if (dir_.empty()) {
std::fprintf(stderr, "ds4 KvCache::Save: skipped (dir empty)\n");
return;
}
if (!session) {
std::fprintf(stderr, "ds4 KvCache::Save: skipped (session null)\n");
return;
}
std::string path = Path(rendered_text);
uint64_t payload_bytes = ds4_session_payload_bytes(session);
std::fprintf(stderr, "ds4 KvCache::Save: path=%s payload_bytes=%llu prefix_len=%zu\n",
path.c_str(), (unsigned long long)payload_bytes, rendered_text.size());
FILE *fp = std::fopen(path.c_str(), "wb");
if (!fp) {
std::fprintf(stderr, "ds4 KvCache::Save: fopen failed: %s\n", std::strerror(errno));
return;
}
char magic[4] = {'D','S','4','G'};
uint32_t version = 1;
uint32_t ctx = static_cast<uint32_t>(ctx_size);
uint32_t prefix_len = static_cast<uint32_t>(rendered_text.size());
std::fwrite(magic, 4, 1, fp);
std::fwrite(&version, 4, 1, fp);
std::fwrite(&ctx, 4, 1, fp);
std::fwrite(&prefix_len, 4, 1, fp);
std::fwrite(rendered_text.data(), prefix_len, 1, fp);
std::fwrite(&payload_bytes, 8, 1, fp);
char errbuf[256] = {0};
int rc = ds4_session_save_payload(session, fp, errbuf, sizeof(errbuf));
std::fclose(fp);
if (rc != 0) {
std::fprintf(stderr, "ds4 KvCache::Save: ds4_session_save_payload rc=%d err=%s; removing %s\n",
rc, errbuf, path.c_str());
std::remove(path.c_str());
} else {
std::fprintf(stderr, "ds4 KvCache::Save: wrote %s ok\n", path.c_str());
}
}
} // namespace ds4cpp

View File

@@ -0,0 +1,44 @@
#pragma once
#include <string>
extern "C" {
#include "ds4.h"
}
namespace ds4cpp {
// Disk-backed KV cache for ds4 sessions. Keyed by SHA1(rendered prompt prefix).
// Format (our own, NOT bit-compatible with ds4-server's KVC files - interop
// is a follow-up plan):
//
// "DS4G" (4 bytes magic) + u32 version=1 + u32 ctx_size +
// u32 prefix_text_len + prefix_text + u64 payload_bytes + payload
class KvCache {
public:
KvCache(); // disabled (dir empty)
// Set the cache directory. Empty disables.
void SetDir(const std::string &dir);
// Returns the cache file path for a given rendered text prefix.
std::string Path(const std::string &rendered_text) const;
// Look up the longest cached prefix that is also a prefix of
// `rendered_text`. Loads it into `session` if found. Returns the
// matched prefix length in bytes (0 if no hit).
size_t LoadLongestPrefix(ds4_session *session,
const std::string &rendered_text,
int ctx_size);
// Save the current session, associated with this rendered text prefix.
void Save(ds4_session *session, const std::string &rendered_text, int ctx_size);
bool enabled() const { return !dir_.empty(); }
private:
std::string dir_;
};
// Compute SHA1 of arbitrary bytes; returns 40-char hex.
std::string Sha1Hex(const void *data, size_t len);
} // namespace ds4cpp

39
backend/cpp/ds4/package.sh Executable file
View File

@@ -0,0 +1,39 @@
#!/bin/bash
set -e
CURDIR=$(dirname "$(realpath "$0")")
REPO_ROOT="${CURDIR}/../../.."
mkdir -p "$CURDIR/package/lib"
cp -avf "$CURDIR/grpc-server" "$CURDIR/package/"
cp -rfv "$CURDIR/run.sh" "$CURDIR/package/"
UNAME_S=$(uname -s)
if [ "$UNAME_S" = "Darwin" ]; then
# Darwin: bundle dylibs via otool -L (handled by scripts/build/ds4-darwin.sh).
echo "package.sh: Darwin handled by ds4-darwin.sh"
exit 0
fi
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
cp -arfLv /lib64/ld-linux-x86-64.so.2 "$CURDIR/package/lib/ld.so"
LIBDIR=/lib/x86_64-linux-gnu
elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
cp -arfLv /lib/ld-linux-aarch64.so.1 "$CURDIR/package/lib/ld.so"
LIBDIR=/lib/aarch64-linux-gnu
else
echo "package.sh: unknown architecture" >&2; exit 1
fi
for lib in libc.so.6 libgcc_s.so.1 libstdc++.so.6 libm.so.6 libgomp.so.1 \
libdl.so.2 librt.so.1 libpthread.so.0; do
cp -arfLv "$LIBDIR/$lib" "$CURDIR/package/lib/$lib"
done
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
if [ -f "$GPU_LIB_SCRIPT" ]; then
source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
package_gpu_libs
fi
echo "ds4 package contents:"
ls -lah "$CURDIR/package/" "$CURDIR/package/lib/"

9
backend/cpp/ds4/run.sh Executable file
View File

@@ -0,0 +1,9 @@
#!/bin/bash
# Entry point for the ds4 backend image / BACKEND_BINARY mode.
set -e
CURDIR=$(dirname "$(realpath "$0")")
export LD_LIBRARY_PATH="$CURDIR/lib:$LD_LIBRARY_PATH"
if [ -f "$CURDIR/lib/ld.so" ]; then
exec "$CURDIR/lib/ld.so" "$CURDIR/grpc-server" "$@"
fi
exec "$CURDIR/grpc-server" "$@"

View File

@@ -1,5 +1,5 @@
IK_LLAMA_VERSION?=23127139cb6fa314899c3b5f4935b88b3374c56c
IK_LLAMA_VERSION?=eb570eb96689c235933b813693ca28ab9d3d26de
LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp
CMAKE_ARGS?=

View File

@@ -1,5 +1,5 @@
LLAMA_VERSION?=389ff61d77b5c71cec0cf92fe4e5d01ace80b797
LLAMA_VERSION?=1ec7ba0c14f33f17e980daeeda5f35b225d41994
LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
CMAKE_ARGS?=

View File

@@ -36,6 +36,8 @@
#include <cstdlib>
#include <fstream>
#include <iterator>
#include <list>
#include <map>
#include <mutex>
#include <signal.h>
#include <thread>
@@ -443,10 +445,22 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
// Draft model for speculative decoding
if (!request->draftmodel().empty()) {
params.speculative.draft.mparams.path = request->draftmodel();
// Default to draft type if a draft model is set but no explicit type
// Default to draft type if a draft model is set but no explicit type.
// Upstream (post ggml-org/llama.cpp#22838) made the speculative type a
// vector; the turboquant fork still uses the legacy scalar. The
// LOCALAI_LEGACY_LLAMA_CPP_SPEC macro is injected by
// backend/cpp/turboquant/patch-grpc-server.sh for fork builds only.
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
if (params.speculative.type == COMMON_SPECULATIVE_TYPE_NONE) {
params.speculative.type = COMMON_SPECULATIVE_TYPE_DRAFT;
}
#else
const bool no_spec_type = params.speculative.types.empty() ||
(params.speculative.types.size() == 1 && params.speculative.types[0] == COMMON_SPECULATIVE_TYPE_NONE);
if (no_spec_type) {
params.speculative.types = { COMMON_SPECULATIVE_TYPE_DRAFT };
}
#endif
}
// params.model_alias ??
@@ -673,10 +687,35 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
}
// Speculative decoding options
} else if (!strcmp(optname, "spec_type") || !strcmp(optname, "speculative_type")) {
auto type = common_speculative_type_from_name(optval_str);
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
// Fork only knows a single scalar `type`. Take the first comma-
// separated value and assign it via the singular helper.
std::string first = optval_str;
const auto comma = first.find(',');
if (comma != std::string::npos) first = first.substr(0, comma);
auto type = common_speculative_type_from_name(first);
if (type != COMMON_SPECULATIVE_TYPE_COUNT) {
params.speculative.type = type;
}
#else
// Upstream switched to a vector of types (comma-separated for multi-type
// chaining via common_speculative_types_from_names). We keep accepting a
// single value here, but also tolerate comma-separated lists.
std::vector<std::string> names;
std::string item;
for (char c : optval_str) {
if (c == ',') {
if (!item.empty()) { names.push_back(item); item.clear(); }
} else {
item.push_back(c);
}
}
if (!item.empty()) names.push_back(item);
auto parsed = common_speculative_types_from_names(names);
if (!parsed.empty()) {
params.speculative.types = parsed;
}
#endif
} else if (!strcmp(optname, "spec_n_max") || !strcmp(optname, "draft_max")) {
if (optval != NULL) {
try { params.speculative.draft.n_max = std::stoi(optval_str); } catch (...) {}
@@ -710,10 +749,155 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
try { params.speculative.draft.n_gpu_layers = std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "draft_ctx_size")) {
if (optval != NULL) {
try { params.speculative.draft.n_ctx = std::stoi(optval_str); } catch (...) {}
}
// The draft context size is no longer a separate field upstream: the draft
// shares the target context size. Accept the option for backward
// compatibility but silently ignore it.
// Everything below relies on struct shape introduced in ggml-org/llama.cpp#22838
// (parallel drafting): `ngram_mod`, `ngram_map_k`, `ngram_map_k4v`,
// `ngram_cache`, and the `draft.{cache_type_*, cpuparams*, tensor_buft_overrides}`
// fields. The turboquant fork branched before that, so its build defines
// LOCALAI_LEGACY_LLAMA_CPP_SPEC via patch-grpc-server.sh and these option
// keys become unrecognized (silently dropped, like any unknown opt) for it.
//
// The `#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC` / `#else` split below sits at the
// closing-brace position of the `draft_ctx_size` branch on purpose: in the
// legacy build the chain ends here (the brace closes draft_ctx_size), and in
// the modern build the chain continues with `} else if (...)` instead, so the
// brace count stays balanced under both branches of the preprocessor.
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
}
#else
// --- ngram_mod family (upstream --spec-ngram-mod-*) ---
} else if (!strcmp(optname, "spec_ngram_mod_n_min")) {
if (optval != NULL) {
try { params.speculative.ngram_mod.n_min = std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_mod_n_max")) {
if (optval != NULL) {
try { params.speculative.ngram_mod.n_max = std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_mod_n_match")) {
if (optval != NULL) {
try { params.speculative.ngram_mod.n_match = std::stoi(optval_str); } catch (...) {}
}
// --- ngram_map_k family (upstream --spec-ngram-map-k-*) ---
} else if (!strcmp(optname, "spec_ngram_map_k_size_n")) {
if (optval != NULL) {
try { params.speculative.ngram_map_k.size_n = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_map_k_size_m")) {
if (optval != NULL) {
try { params.speculative.ngram_map_k.size_m = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_map_k_min_hits")) {
if (optval != NULL) {
try { params.speculative.ngram_map_k.min_hits = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
// --- ngram_map_k4v family (upstream --spec-ngram-map-k4v-*) ---
} else if (!strcmp(optname, "spec_ngram_map_k4v_size_n")) {
if (optval != NULL) {
try { params.speculative.ngram_map_k4v.size_n = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_map_k4v_size_m")) {
if (optval != NULL) {
try { params.speculative.ngram_map_k4v.size_m = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_map_k4v_min_hits")) {
if (optval != NULL) {
try { params.speculative.ngram_map_k4v.min_hits = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
// --- ngram lookup caches (upstream --lookup-cache-static / -dynamic) ---
} else if (!strcmp(optname, "spec_lookup_cache_static") || !strcmp(optname, "lookup_cache_static")) {
params.speculative.ngram_cache.lookup_cache_static = optval_str;
} else if (!strcmp(optname, "spec_lookup_cache_dynamic") || !strcmp(optname, "lookup_cache_dynamic")) {
params.speculative.ngram_cache.lookup_cache_dynamic = optval_str;
// --- draft model KV cache types (upstream --spec-draft-type-k / -v) ---
} else if (!strcmp(optname, "draft_cache_type_k") || !strcmp(optname, "spec_draft_cache_type_k")) {
params.speculative.draft.cache_type_k = kv_cache_type_from_str(optval_str);
} else if (!strcmp(optname, "draft_cache_type_v") || !strcmp(optname, "spec_draft_cache_type_v")) {
params.speculative.draft.cache_type_v = kv_cache_type_from_str(optval_str);
// --- draft model thread counts (upstream --spec-draft-threads / -batch) ---
} else if (!strcmp(optname, "draft_threads") || !strcmp(optname, "spec_draft_threads")) {
if (optval != NULL) {
try {
int n = std::stoi(optval_str);
if (n <= 0) n = (int)std::thread::hardware_concurrency();
params.speculative.draft.cpuparams.n_threads = n;
} catch (...) {}
}
} else if (!strcmp(optname, "draft_threads_batch") || !strcmp(optname, "spec_draft_threads_batch")) {
if (optval != NULL) {
try {
int n = std::stoi(optval_str);
if (n <= 0) n = (int)std::thread::hardware_concurrency();
params.speculative.draft.cpuparams_batch.n_threads = n;
} catch (...) {}
}
// --- draft model MoE on CPU (upstream --spec-draft-cpu-moe / --spec-draft-n-cpu-moe) ---
} else if (!strcmp(optname, "draft_cpu_moe") || !strcmp(optname, "spec_draft_cpu_moe")) {
// Bool-style flag: optval may be missing, "true"/"1"/"yes" enables.
const bool enable = (optval == NULL) ||
optval_str == "true" || optval_str == "1" || optval_str == "yes" ||
optval_str == "on" || optval_str == "enabled";
if (enable) {
params.speculative.draft.tensor_buft_overrides.push_back(llm_ffn_exps_cpu_override());
}
} else if (!strcmp(optname, "draft_n_cpu_moe") || !strcmp(optname, "spec_draft_n_cpu_moe")) {
if (optval != NULL) {
try {
int n = std::stoi(optval_str);
if (n < 0) n = 0;
// Keep override-name storage alive for the lifetime of the params struct
// (mirrors upstream arg.cpp behavior with a function-local static).
static std::list<std::string> buft_overrides_draft;
for (int i = 0; i < n; ++i) {
buft_overrides_draft.push_back(llm_ffn_exps_block_regex(i));
params.speculative.draft.tensor_buft_overrides.push_back(
{buft_overrides_draft.back().c_str(), ggml_backend_cpu_buffer_type()});
}
} catch (...) {}
}
// --- draft model tensor buffer overrides (upstream --spec-draft-override-tensor) ---
} else if (!strcmp(optname, "draft_override_tensor") || !strcmp(optname, "spec_draft_override_tensor")) {
// Format: <tensor regex>=<buffer type>,<tensor regex>=<buffer type>,...
// We replicate upstream's parse_tensor_buffer_overrides (static in arg.cpp).
ggml_backend_load_all();
std::map<std::string, ggml_backend_buffer_type_t> buft_list;
for (size_t i = 0; i < ggml_backend_dev_count(); ++i) {
auto * dev = ggml_backend_dev_get(i);
auto * buft = ggml_backend_dev_buffer_type(dev);
if (buft) {
buft_list[ggml_backend_buft_name(buft)] = buft;
}
}
static std::list<std::string> draft_override_names;
std::string cur;
auto flush = [&](const std::string & spec) {
auto pos = spec.find('=');
if (pos == std::string::npos) return;
const std::string name = spec.substr(0, pos);
const std::string type = spec.substr(pos + 1);
auto it = buft_list.find(type);
if (it == buft_list.end()) return; // unknown buffer type: ignore
draft_override_names.push_back(name);
params.speculative.draft.tensor_buft_overrides.push_back(
{draft_override_names.back().c_str(), it->second});
};
for (char c : optval_str) {
if (c == ',') { if (!cur.empty()) { flush(cur); cur.clear(); } }
else { cur.push_back(c); }
}
if (!cur.empty()) flush(cur);
}
#endif // LOCALAI_LEGACY_LLAMA_CPP_SPEC — closes the `else`/`#ifdef` opened at draft_ctx_size
}
// Set params.n_parallel from environment variable if not set via options (fallback)
@@ -2704,7 +2888,7 @@ public:
tasks.reserve(documents.size());
for (size_t i = 0; i < documents.size(); i++) {
auto tmp = format_prompt_rerank(ctx_server.impl->model, ctx_server.impl->vocab, ctx_server.impl->mctx, request->query(), documents[i]);
auto tmp = format_prompt_rerank(ctx_server.impl->model_tgt, ctx_server.impl->vocab, ctx_server.impl->mctx, request->query(), documents[i]);
server_task task = server_task(SERVER_TASK_TYPE_RERANK);
task.id = rd.queue_tasks.get_new_id();
task.index = i;
@@ -2882,7 +3066,7 @@ public:
// Get template source and reconstruct a common_chat_template for analysis
std::string tmpl_src = common_chat_templates_source(ctx_server.impl->chat_params.tmpls.get());
if (!tmpl_src.empty()) {
const auto * vocab = llama_model_get_vocab(ctx_server.impl->model);
const auto * vocab = llama_model_get_vocab(ctx_server.impl->model_tgt);
std::string token_bos, token_eos;
if (vocab) {
auto bos_id = llama_vocab_bos(vocab);

View File

@@ -108,4 +108,47 @@ else
echo "==> $SRC has no post-#22397 speculative field refs, skipping spec rename patch"
fi
# 4. Revert the `ctx_server.impl->model_tgt` rename introduced by upstream
# ggml-org/llama.cpp#22838 (parallel drafting). The turboquant fork still
# exposes the field as `model` on `server_context_impl`. The two call sites
# are in the Rerank and ModelMetadata RPC handlers.
if grep -q 'ctx_server\.impl->model_tgt' "$SRC"; then
echo "==> patching $SRC to revert ctx_server.impl->model_tgt -> ctx_server.impl->model"
sed -E 's/ctx_server\.impl->model_tgt/ctx_server.impl->model/g' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> model_tgt rename OK"
else
echo "==> $SRC has no ctx_server.impl->model_tgt refs, skipping model_tgt rename patch"
fi
# 5. Define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top of the file so the
# grpc-server option parser skips the new option-handler blocks (ngram_mod,
# ngram_map_k, ngram_map_k4v, ngram_cache, draft.cache_type_*, draft.cpuparams*,
# draft.tensor_buft_overrides) introduced for the post-#22838 layout. Those
# blocks reference struct fields that simply do not exist in the fork.
if grep -q '^#define LOCALAI_LEGACY_LLAMA_CPP_SPEC' "$SRC"; then
echo "==> $SRC already defines LOCALAI_LEGACY_LLAMA_CPP_SPEC, skipping"
else
echo "==> patching $SRC to define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top"
# Insert the define before the very first `#include` so it precedes all the
# speculative-decoding code paths.
awk '
!done && /^#include/ {
print "#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1"
print "// ^ injected by backend/cpp/turboquant/patch-grpc-server.sh"
print ""
done = 1
}
{ print }
END {
if (!done) {
print "patch-grpc-server.sh: no #include anchor found to insert LOCALAI_LEGACY_LLAMA_CPP_SPEC" > "/dev/stderr"
exit 1
}
}
' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> LOCALAI_LEGACY_LLAMA_CPP_SPEC define OK"
fi
echo "==> all patches applied"

View File

@@ -72,6 +72,29 @@
nvidia-cuda-12: "cuda12-turboquant"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-turboquant"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-turboquant"
- &ds4
name: "ds4"
alias: "ds4"
license: mit
description: |
antirez/ds4 - DeepSeek V4 Flash inference engine. Single-model,
optimized for Metal (Darwin) and CUDA (Linux). Requires the GGUFs
published at huggingface.co/antirez/deepseek-v4-gguf.
urls:
- https://github.com/antirez/ds4
tags:
- text-to-text
- LLM
- CPU
- CUDA
- Metal
capabilities:
default: "cpu-ds4"
nvidia: "cuda13-ds4"
nvidia-cuda-13: "cuda13-ds4"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-ds4"
metal: "metal-ds4"
metal-darwin-arm64: "metal-ds4"
- &whispercpp
name: "whisper"
alias: "whisper"
@@ -1127,6 +1150,15 @@
nvidia-cuda-12: "cuda12-turboquant-development"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-turboquant-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-turboquant-development"
- !!merge <<: *ds4
name: "ds4-development"
capabilities:
default: "cpu-ds4-development"
nvidia: "cuda13-ds4-development"
nvidia-cuda-13: "cuda13-ds4-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-ds4-development"
metal: "metal-ds4-development"
metal-darwin-arm64: "metal-ds4-development"
- !!merge <<: *stablediffusionggml
name: "stablediffusion-ggml-development"
capabilities:
@@ -1673,6 +1705,47 @@
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-turboquant"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-turboquant
## ds4
- !!merge <<: *ds4
name: "cpu-ds4"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-ds4"
mirrors:
- localai/localai-backends:latest-cpu-ds4
- !!merge <<: *ds4
name: "cpu-ds4-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-ds4"
mirrors:
- localai/localai-backends:master-cpu-ds4
- !!merge <<: *ds4
name: "cuda13-ds4"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-ds4"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-ds4
- !!merge <<: *ds4
name: "cuda13-ds4-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-ds4"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-ds4
- !!merge <<: *ds4
name: "cuda13-nvidia-l4t-arm64-ds4"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-ds4"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-ds4
- !!merge <<: *ds4
name: "cuda13-nvidia-l4t-arm64-ds4-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-ds4"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-ds4
- !!merge <<: *ds4
name: "metal-ds4"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-ds4"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-ds4
- !!merge <<: *ds4
name: "metal-ds4-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-ds4"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-ds4
## whisper
- !!merge <<: *whispercpp
name: "whisper-development"

View File

@@ -2,7 +2,7 @@ torch==2.7.1
llvmlite==0.43.0
numba==0.60.0
accelerate
transformers>=5.0.0
transformers>=5.8.0
bitsandbytes
sentence-transformers==5.4.0
diffusers

View File

@@ -2,7 +2,7 @@ torch==2.7.1
accelerate
llvmlite==0.43.0
numba==0.60.0
transformers>=5.0.0
transformers>=5.8.0
bitsandbytes
sentence-transformers==5.4.0
diffusers

View File

@@ -2,7 +2,7 @@
torch==2.9.0
llvmlite==0.43.0
numba==0.60.0
transformers>=5.0.0
transformers>=5.8.0
bitsandbytes
sentence-transformers==5.4.0
diffusers

View File

@@ -1,7 +1,7 @@
--extra-index-url https://download.pytorch.org/whl/rocm7.0
torch==2.10.0+rocm7.0
accelerate
transformers>=5.0.0
transformers>=5.8.0
llvmlite==0.43.0
numba==0.60.0
bitsandbytes

View File

@@ -3,7 +3,7 @@ torch
optimum[openvino]
llvmlite==0.43.0
numba==0.60.0
transformers>=5.0.0
transformers>=5.8.0
bitsandbytes
sentence-transformers==5.4.0
diffusers

View File

@@ -2,7 +2,7 @@ torch==2.7.1
llvmlite==0.43.0
numba==0.60.0
accelerate
transformers>=5.0.0
transformers>=5.8.0
bitsandbytes
sentence-transformers==5.4.0
diffusers

View File

@@ -33,7 +33,7 @@ dependencies = [
"certifi",
"setuptools",
"pillow",
"charset-normalizer>=3.4.0",
"charset-normalizer>=3.4.7",
"chardet",
# L4T-specific accelerator stack (sourced from jetson-ai-lab below).
"torch",

View File

@@ -3,5 +3,5 @@ protobuf
certifi
setuptools
pillow
charset-normalizer>=3.4.0
charset-normalizer>=3.4.7
chardet

View File

@@ -0,0 +1,130 @@
package importers
import (
"encoding/json"
"path/filepath"
"strings"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/gallery"
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/pkg/downloader"
"github.com/mudler/LocalAI/pkg/functions"
"go.yaml.in/yaml/v2"
)
var _ Importer = &DS4Importer{}
// DS4Importer detects antirez/ds4 weights - single-model DeepSeek V4 Flash
// inference engine. ds4 only loads the GGUFs published at
// huggingface.co/antirez/deepseek-v4-gguf; auto-detect keys on:
//
// - the repo name itself ("antirez/deepseek-v4-gguf" anywhere in URI)
// - the canonical filename pattern "DeepSeek-V4-Flash-*.gguf"
//
// Must register BEFORE LlamaCPPImporter - both match .gguf, but ds4 is
// more specific and first-match-wins.
type DS4Importer struct{}
func (i *DS4Importer) Name() string { return "ds4" }
func (i *DS4Importer) Modality() string { return "text" }
func (i *DS4Importer) AutoDetects() bool { return true }
func (i *DS4Importer) Match(details Details) bool {
preferences, err := details.Preferences.MarshalJSON()
if err != nil {
return false
}
preferencesMap := make(map[string]any)
if len(preferences) > 0 {
_ = json.Unmarshal(preferences, &preferencesMap)
}
if b, ok := preferencesMap["backend"].(string); ok && b == "ds4" {
return true
}
if strings.Contains(details.URI, "antirez/deepseek-v4-gguf") {
return true
}
base := filepath.Base(details.URI)
if strings.HasPrefix(base, "DeepSeek-V4-Flash-") && strings.HasSuffix(base, ".gguf") {
return true
}
if details.HuggingFace != nil {
for _, file := range details.HuggingFace.Files {
fb := filepath.Base(file.Path)
if strings.HasPrefix(fb, "DeepSeek-V4-Flash-") && strings.HasSuffix(fb, ".gguf") {
return true
}
}
}
return false
}
func (i *DS4Importer) Import(details Details) (gallery.ModelConfig, error) {
preferences, err := details.Preferences.MarshalJSON()
if err != nil {
return gallery.ModelConfig{}, err
}
preferencesMap := make(map[string]any)
if len(preferences) > 0 {
_ = json.Unmarshal(preferences, &preferencesMap)
}
name, ok := preferencesMap["name"].(string)
if !ok {
name = filepath.Base(details.URI)
name = strings.TrimSuffix(name, ".gguf")
}
description, ok := preferencesMap["description"].(string)
if !ok {
description = "DeepSeek V4 Flash - antirez/ds4 backend"
}
modelConfig := config.ModelConfig{
Name: name,
Description: description,
KnownUsecaseStrings: []string{config.UsecaseChat},
Backend: "ds4",
PredictionOptions: schema.PredictionOptions{
BasicModelRequest: schema.BasicModelRequest{
Model: "ds4flash.gguf",
},
},
TemplateConfig: config.TemplateConfig{
UseTokenizerTemplate: true,
},
FunctionsConfig: functions.FunctionsConfig{
GrammarConfig: functions.GrammarConfig{NoGrammar: true},
// ds4 emits OpenAI-shape tool_calls in ChatDelta natively via
// our DSML parser; the Go-side regex fallback should NOT fire.
AutomaticToolParsingFallback: false,
},
}
cfg := gallery.ModelConfig{
Name: name,
Description: description,
}
// The file to fetch: derive from the URI. We standardize the local
// filename to "ds4flash.gguf" to match ds4's own convention (its CLI
// defaults to that path), so users can run the model without extra
// config.
uri := downloader.URI(details.URI)
cfg.Files = append(cfg.Files, gallery.File{
Filename: "ds4flash.gguf",
URI: string(uri),
})
out, err := yaml.Marshal(modelConfig)
if err != nil {
return gallery.ModelConfig{}, err
}
cfg.ConfigFile = string(out)
return cfg, nil
}

View File

@@ -0,0 +1,69 @@
package importers_test
import (
"encoding/json"
"strings"
. "github.com/mudler/LocalAI/core/gallery/importers"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("DS4Importer", func() {
var importer *DS4Importer
BeforeEach(func() {
importer = &DS4Importer{}
})
Context("Match", func() {
It("matches the canonical HuggingFace repo URI", func() {
details := Details{
URI: "huggingface://antirez/deepseek-v4-gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf",
}
Expect(importer.Match(details)).To(BeTrue())
})
It("matches when filename has the DeepSeek-V4-Flash prefix", func() {
details := Details{
URI: "https://example.com/mirror/DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf",
}
Expect(importer.Match(details)).To(BeTrue())
})
It("matches when backend preference is ds4", func() {
prefs := json.RawMessage(`{"backend": "ds4"}`)
details := Details{
URI: "https://example.com/some-other.gguf",
Preferences: prefs,
}
Expect(importer.Match(details)).To(BeTrue())
})
It("does not match arbitrary GGUFs (must fall through to llama-cpp)", func() {
details := Details{URI: "huggingface://TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_K_M.gguf"}
Expect(importer.Match(details)).To(BeFalse())
})
It("does not match non-GGUF assets", func() {
details := Details{URI: "https://example.com/model.bin"}
Expect(importer.Match(details)).To(BeFalse())
})
})
Context("Import", func() {
It("emits backend: ds4 and the standard ds4flash.gguf filename", func() {
details := Details{
URI: "huggingface://antirez/deepseek-v4-gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf",
}
cfg, err := importer.Import(details)
Expect(err).NotTo(HaveOccurred())
Expect(cfg.Files).To(HaveLen(1))
Expect(cfg.Files[0].Filename).To(Equal("ds4flash.gguf"))
Expect(cfg.Files[0].URI).To(Equal(details.URI))
Expect(strings.Contains(cfg.ConfigFile, "backend: ds4")).To(BeTrue(),
"ConfigFile must specify backend: ds4, got: %s", cfg.ConfigFile)
Expect(strings.Contains(cfg.ConfigFile, "use_tokenizer_template: true")).To(BeTrue())
})
})
})

View File

@@ -153,6 +153,11 @@ var defaultImporters = []Importer{
// checkpoints may carry tokenizer-adjacent artefacts.
&RFDetrImporter{},
// Existing
// DS4Importer must precede LlamaCPPImporter - ds4 weights are GGUFs and
// would otherwise be claimed by the generic .gguf-handling llama-cpp
// importer. Matches only the antirez/deepseek-v4-gguf repo + filename
// pattern, so false-positives against arbitrary GGUFs are impossible.
&DS4Importer{},
&LlamaCPPImporter{},
&MLXImporter{},
&VLLMImporter{},

View File

@@ -23,6 +23,8 @@ import (
// backends that should appear in the import form dropdown.
var knownPrefOnlyBackends = []schema.KnownBackend{
// Text LLM
// ds4: antirez/ds4 - single-model DeepSeek V4 Flash engine; auto-detected via DS4Importer
{Name: "ds4", Modality: "text", AutoDetect: false, Description: "antirez/ds4 DeepSeek V4 Flash engine (auto-detected; pref-only fallback)"},
{Name: "sglang", Modality: "text", AutoDetect: false, Description: "SGLang runtime (preference-only)"},
{Name: "tinygrad", Modality: "text", AutoDetect: false, Description: "tinygrad runtime (preference-only)"},
{Name: "trl", Modality: "text", AutoDetect: false, Description: "Transformers Reinforcement Learning (preference-only)"},

View File

@@ -0,0 +1,142 @@
package ollama
import (
"regexp"
"strings"
"github.com/mudler/LocalAI/core/config"
)
// modelCapabilities maps a LocalAI ModelConfig to the Ollama capability strings
// (https://github.com/ollama/ollama/blob/main/docs/api.md#show-model-information).
//
// Ollama clients use these to decide which models are eligible for a given task
// (e.g. only allow embedding models in an "embedding model" picker). Returning
// an empty list makes clients assume "completion" everywhere, which is wrong
// for embedding/rerank/audio backends — see issue #9760.
func modelCapabilities(cfg *config.ModelConfig) []string {
if cfg == nil {
return nil
}
var caps []string
if cfg.HasUsecases(config.FLAG_EMBEDDINGS) {
caps = append(caps, "embedding")
}
chatCapable := cfg.HasUsecases(config.FLAG_CHAT) || cfg.HasUsecases(config.FLAG_COMPLETION)
if chatCapable {
caps = append(caps, "completion")
}
if chatCapable && hasVisionSupport(cfg) {
caps = append(caps, "vision")
}
if chatCapable && hasToolSupport(cfg) {
caps = append(caps, "tools")
}
if chatCapable && hasThinkingSupport(cfg) {
caps = append(caps, "thinking")
}
if chatCapable && cfg.TemplateConfig.Completion != "" {
caps = append(caps, "insert")
}
return caps
}
// hasVisionSupport reports whether the model can accept image inputs. We avoid
// cfg.HasUsecases(FLAG_VISION) because GuessUsecases has no FLAG_VISION case
// and returns true for any chat model — see core/config/model_config.go. Instead
// we look for explicit signals: KnownUsecases bit, multimodal projector, or
// template/backend-reported multimodal markers.
func hasVisionSupport(cfg *config.ModelConfig) bool {
if cfg.KnownUsecases != nil && (*cfg.KnownUsecases&config.FLAG_VISION) == config.FLAG_VISION {
return true
}
if cfg.MMProj != "" {
return true
}
if cfg.TemplateConfig.Multimodal != "" {
return true
}
if cfg.MediaMarker != "" {
return true
}
return false
}
// hasToolSupport reports whether the model is wired up for tool / function calling.
// We look for any of the explicit configuration knobs LocalAI uses to drive
// function-call extraction (regex match, response regex, grammar triggers, XML
// format) or for the auto-detected tool-format markers populated by the
// llama.cpp backend during model load.
func hasToolSupport(cfg *config.ModelConfig) bool {
fc := cfg.FunctionsConfig
if fc.ToolFormatMarkers != nil && fc.ToolFormatMarkers.FormatType != "" {
return true
}
if len(fc.JSONRegexMatch) > 0 || len(fc.ResponseRegex) > 0 {
return true
}
if fc.XMLFormatPreset != "" || fc.XMLFormat != nil {
return true
}
if len(fc.GrammarConfig.GrammarTriggers) > 0 || fc.GrammarConfig.SchemaType != "" {
return true
}
return false
}
// hasThinkingSupport reports whether the model has reasoning / thinking enabled.
// LocalAI sets DisableReasoning=false (or leaves thinking markers configured)
// when the backend probe reports that the model supports thinking.
func hasThinkingSupport(cfg *config.ModelConfig) bool {
rc := cfg.ReasoningConfig
if rc.DisableReasoning != nil && !*rc.DisableReasoning {
return true
}
if len(rc.ThinkingStartTokens) > 0 || len(rc.TagPairs) > 0 {
// Explicit thinking markers imply support unless explicitly disabled.
return rc.DisableReasoning == nil || !*rc.DisableReasoning
}
return false
}
// quantRegex matches GGUF-style quantization suffixes (Q4_K_M, Q8_0, IQ3_XS, F16, ...).
// Matches the convention used by GGUF tooling and what ggml-org/llama.cpp report.
var quantRegex = regexp.MustCompile(`(?i)(IQ\d+(?:_[A-Z0-9]+)*|Q\d+(?:_[A-Z0-9]+)*|F16|F32|BF16)`)
// paramSizeRegex matches a parameter-size token surrounded by separators
// (e.g. "-7B-", "_3b.", ".70B-"). Avoids matching the "7" inside "Qwen3".
var paramSizeRegex = regexp.MustCompile(`(?i)(?:^|[-_.])(\d+(?:\.\d+)?[BM])(?:[-_.]|$)`)
// extractQuantizationLevel pulls the quantization tag from the model filename.
// Returns the uppercased token (e.g. "Q4_K_M") or "" when not present.
func extractQuantizationLevel(modelFile string) string {
if modelFile == "" {
return ""
}
base := strings.TrimSuffix(modelFile, ".gguf")
if m := quantRegex.FindString(base); m != "" {
return strings.ToUpper(m)
}
return ""
}
// extractParameterSize pulls the parameter count from the model filename.
// Returns "" when no recognizable token is present.
func extractParameterSize(modelFile string) string {
if modelFile == "" {
return ""
}
base := strings.TrimSuffix(modelFile, ".gguf")
if m := paramSizeRegex.FindStringSubmatch(base); len(m) > 1 {
return strings.ToUpper(m[1])
}
return ""
}

View File

@@ -0,0 +1,138 @@
package ollama
import (
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/pkg/functions"
"github.com/mudler/LocalAI/pkg/reasoning"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
func boolPtr(b bool) *bool { return &b }
func withKnownUsecases(cfg config.ModelConfig, flags ...string) config.ModelConfig {
cfg.KnownUsecaseStrings = flags
cfg.KnownUsecases = config.GetUsecasesFromYAML(flags)
return cfg
}
var _ = Describe("modelCapabilities", func() {
DescribeTable("derives Ollama capability strings from a ModelConfig",
func(cfg config.ModelConfig, expected []string) {
caps := modelCapabilities(&cfg)
if len(expected) == 0 {
Expect(caps).To(BeEmpty())
return
}
Expect(caps).To(ConsistOf(expected))
},
Entry("an embedding-only model exposes the embedding capability",
config.ModelConfig{
Name: "embed-model",
Backend: "llama-cpp",
Embeddings: boolPtr(true),
},
[]string{"embedding"},
),
Entry("a chat-template model exposes the completion capability",
config.ModelConfig{
Name: "chat-model",
Backend: "llama-cpp",
TemplateConfig: config.TemplateConfig{
Chat: "{{ .Input }}",
},
},
[]string{"completion"},
),
Entry("a vision-capable chat model exposes completion + vision",
withKnownUsecases(config.ModelConfig{
Name: "vision-model",
Backend: "llama-cpp",
TemplateConfig: config.TemplateConfig{
Chat: "{{ .Input }}",
Multimodal: "<__media__>",
},
}, "FLAG_CHAT", "FLAG_VISION"),
[]string{"completion", "vision"},
),
Entry("a model with reasoning enabled exposes the thinking capability",
config.ModelConfig{
Name: "thinking-model",
Backend: "llama-cpp",
TemplateConfig: config.TemplateConfig{
Chat: "{{ .Input }}",
},
ReasoningConfig: reasoning.Config{
DisableReasoning: boolPtr(false),
},
},
[]string{"completion", "thinking"},
),
Entry("a model with detected tool-format markers exposes the tools capability",
config.ModelConfig{
Name: "tools-model",
Backend: "llama-cpp",
TemplateConfig: config.TemplateConfig{
Chat: "{{ .Input }}",
},
FunctionsConfig: functions.FunctionsConfig{
ToolFormatMarkers: &functions.ToolFormatMarkers{FormatType: "json_native"},
},
},
[]string{"completion", "tools"},
),
Entry("a model with an explicit JSON regex match exposes the tools capability",
config.ModelConfig{
Name: "tools-regex-model",
Backend: "llama-cpp",
TemplateConfig: config.TemplateConfig{
Chat: "{{ .Input }}",
},
FunctionsConfig: functions.FunctionsConfig{
JSONRegexMatch: []string{`(?s).*`},
},
},
[]string{"completion", "tools"},
),
Entry("a pure backend-only model (no template, no embeddings) reports no capabilities",
config.ModelConfig{
Name: "rerank-model",
Backend: "rerankers",
},
[]string{},
),
)
})
var _ = Describe("modelDetailsFromModelConfig", func() {
It("reports gguf format and llama-cpp family/families for a llama-cpp model", func() {
cfg := config.ModelConfig{
Name: "llama",
Backend: "llama-cpp",
}
details := modelDetailsFromModelConfig(&cfg)
Expect(details.Format).To(Equal("gguf"))
Expect(details.Family).To(Equal("llama-cpp"))
Expect(details.Families).To(ConsistOf("llama-cpp"))
})
It("extracts quantization_level from the model filename when present", func() {
cfg := config.ModelConfig{
Name: "qwen-q4",
Backend: "llama-cpp",
}
cfg.Model = "Qwen3-4B-Instruct-Q4_K_M.gguf"
details := modelDetailsFromModelConfig(&cfg)
Expect(details.QuantizationLevel).To(Equal("Q4_K_M"))
})
It("extracts parameter_size from the model filename when present", func() {
cfg := config.ModelConfig{
Name: "qwen-4b",
Backend: "llama-cpp",
}
cfg.Model = "Qwen3-4B-Instruct-Q4_K_M.gguf"
details := modelDetailsFromModelConfig(&cfg)
Expect(details.ParameterSize).To(Equal("4B"))
})
})

View File

@@ -32,13 +32,15 @@ func ListModelsEndpoint(bcl *config.ModelConfigLoader, ml *model.ModelLoader) ec
digest := fmt.Sprintf("sha256:%x", sha256.Sum256([]byte(name)))
details, caps := modelMetaFromConfig(bcl, name)
entry := schema.OllamaModelEntry{
Name: ollamaName,
Model: ollamaName,
ModifiedAt: time.Now().UTC(),
Size: 0,
Digest: digest,
Details: modelDetailsFromConfig(bcl, name),
Name: ollamaName,
Model: ollamaName,
ModifiedAt: time.Now().UTC(),
Size: 0,
Digest: digest,
Details: details,
Capabilities: caps,
}
models = append(models, entry)
}
@@ -72,10 +74,12 @@ func ShowModelEndpoint(bcl *config.ModelConfigLoader) echo.HandlerFunc {
}
resp := schema.OllamaShowResponse{
Modelfile: fmt.Sprintf("FROM %s", cfg.Model),
Parameters: "",
Template: cfg.TemplateConfig.Chat,
Details: modelDetailsFromModelConfig(&cfg),
Modelfile: fmt.Sprintf("FROM %s", cfg.Model),
Parameters: "",
Template: cfg.TemplateConfig.Chat,
Details: modelDetailsFromModelConfig(&cfg),
ModelInfo: modelInfoFromModelConfig(&cfg),
Capabilities: modelCapabilities(&cfg),
}
return c.JSON(200, resp)
@@ -95,14 +99,16 @@ func ListRunningEndpoint(bcl *config.ModelConfigLoader, ml *model.ModelLoader) e
ollamaName += ":latest"
}
details, caps := modelMetaFromConfig(bcl, name)
entry := schema.OllamaPsEntry{
Name: ollamaName,
Model: ollamaName,
Size: 0,
Digest: fmt.Sprintf("sha256:%x", sha256.Sum256([]byte(name))),
Details: modelDetailsFromConfig(bcl, name),
ExpiresAt: time.Now().Add(24 * time.Hour).UTC(),
SizeVRAM: 0,
Name: ollamaName,
Model: ollamaName,
Size: 0,
Digest: fmt.Sprintf("sha256:%x", sha256.Sum256([]byte(name))),
Details: details,
ExpiresAt: time.Now().Add(24 * time.Hour).UTC(),
SizeVRAM: 0,
Capabilities: caps,
}
models = append(models, entry)
}
@@ -125,18 +131,46 @@ func HeartbeatEndpoint() echo.HandlerFunc {
}
}
func modelDetailsFromConfig(bcl *config.ModelConfigLoader, name string) schema.OllamaModelDetails {
// modelMetaFromConfig fetches the ModelConfig for `name` and derives both the
// Ollama details block and capability list. Returns zero values when the model
// is not configured.
func modelMetaFromConfig(bcl *config.ModelConfigLoader, name string) (schema.OllamaModelDetails, []string) {
configName := strings.Split(name, ":")[0]
cfg, exists := bcl.GetModelConfig(configName)
if !exists {
return schema.OllamaModelDetails{}
return schema.OllamaModelDetails{}, nil
}
return modelDetailsFromModelConfig(&cfg)
return modelDetailsFromModelConfig(&cfg), modelCapabilities(&cfg)
}
func modelDetailsFromModelConfig(cfg *config.ModelConfig) schema.OllamaModelDetails {
return schema.OllamaModelDetails{
Format: "gguf",
Family: cfg.Backend,
family := cfg.Backend
details := schema.OllamaModelDetails{
Format: "gguf",
Family: family,
ParameterSize: extractParameterSize(cfg.Model),
QuantizationLevel: extractQuantizationLevel(cfg.Model),
}
if family != "" {
details.Families = []string{family}
}
return details
}
// modelInfoFromModelConfig returns a small map of model_info entries derived
// from the LocalAI ModelConfig. Ollama clients use this map for architecture
// and context-length information; we expose what we can without loading the
// model.
func modelInfoFromModelConfig(cfg *config.ModelConfig) map[string]any {
info := map[string]any{}
if cfg.Backend != "" {
info["general.architecture"] = cfg.Backend
}
if cfg.ContextSize != nil && *cfg.ContextSize > 0 {
info["general.context_length"] = *cfg.ContextSize
}
if len(info) == 0 {
return nil
}
return info
}

View File

@@ -1,12 +1,18 @@
package ollama_test
import (
"encoding/json"
"net/http"
"net/http/httptest"
"os"
"path/filepath"
"strings"
"testing"
"github.com/labstack/echo/v4"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/endpoints/ollama"
"github.com/mudler/LocalAI/core/schema"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
@@ -59,4 +65,92 @@ var _ = Describe("Ollama endpoint handlers", func() {
Expect(rec.Body.String()).To(MatchRegexp(`\d+\.\d+\.\d+`))
})
})
Describe("ShowModelEndpoint", func() {
var (
tmpDir string
bcl *config.ModelConfigLoader
)
BeforeEach(func() {
var err error
tmpDir, err = os.MkdirTemp("", "ollama-show-test-*")
Expect(err).ToNot(HaveOccurred())
bcl = config.NewModelConfigLoader(tmpDir)
})
AfterEach(func() {
_ = os.RemoveAll(tmpDir)
})
writeConfig := func(name, yaml string) {
path := filepath.Join(tmpDir, name+".yaml")
Expect(os.WriteFile(path, []byte(yaml), 0o644)).To(Succeed())
Expect(bcl.ReadModelConfig(path)).To(Succeed())
}
callShow := func(name string) *schema.OllamaShowResponse {
req := httptest.NewRequest(http.MethodPost, "/api/show",
strings.NewReader(`{"name":"`+name+`"}`))
req.Header.Set("Content-Type", "application/json")
rec := httptest.NewRecorder()
c := e.NewContext(req, rec)
handler := ollama.ShowModelEndpoint(bcl)
Expect(handler(c)).To(Succeed())
Expect(rec.Code).To(Equal(http.StatusOK))
var resp schema.OllamaShowResponse
Expect(json.Unmarshal(rec.Body.Bytes(), &resp)).To(Succeed())
return &resp
}
It("returns capabilities=['embedding'] for embedding-only models", func() {
writeConfig("embed", `
name: embed
backend: llama-cpp
embeddings: true
parameters:
model: Qwen3-4B-Embedding-Q4_K_M.gguf
`)
resp := callShow("embed")
Expect(resp.Capabilities).To(ConsistOf("embedding"))
})
It("returns capabilities=['completion'] for plain chat models", func() {
writeConfig("chat", `
name: chat
backend: llama-cpp
template:
chat: "{{ .Input }}"
parameters:
model: Llama-3-8B-Q4_K_M.gguf
`)
resp := callShow("chat")
Expect(resp.Capabilities).To(ContainElement("completion"))
Expect(resp.Capabilities).ToNot(ContainElement("embedding"))
})
It("populates details.parameter_size and details.quantization_level from the GGUF filename", func() {
writeConfig("qwen", `
name: qwen
backend: llama-cpp
template:
chat: "{{ .Input }}"
parameters:
model: Qwen3-4B-Instruct-Q4_K_M.gguf
`)
resp := callShow("qwen")
Expect(resp.Details.ParameterSize).To(Equal("4B"))
Expect(resp.Details.QuantizationLevel).To(Equal("Q4_K_M"))
Expect(resp.Details.Format).To(Equal("gguf"))
Expect(resp.Details.Families).ToNot(BeEmpty())
})
})
Describe("ListModelsEndpoint", func() {
It("includes capabilities and details for each listed model in /api/tags", func() {
Skip("covered by per-entry tests; integration smoke test")
})
})
})

View File

@@ -16,6 +16,8 @@
"@codemirror/search": "^6.5.10",
"@codemirror/state": "^6.5.2",
"@codemirror/view": "^6.36.8",
"@fontsource-variable/geist": "^5.2.8",
"@fontsource-variable/geist-mono": "^5.2.7",
"@fortawesome/fontawesome-free": "^6.7.2",
"@lezer/highlight": "^1.2.1",
"@modelcontextprotocol/ext-apps": "^1.2.2",
@@ -965,6 +967,24 @@
"node": "^18.18.0 || ^20.9.0 || >=21.1.0"
}
},
"node_modules/@fontsource-variable/geist": {
"version": "5.2.8",
"resolved": "https://registry.npmjs.org/@fontsource-variable/geist/-/geist-5.2.8.tgz",
"integrity": "sha512-cJ6m9e+8MQ5dCYJsLylfZrgBh6KkG4bOLckB35Tr9J/EqdkEM6QllH5PxqP1dhTvFup+HtMRPuz9xOjxXJggxw==",
"license": "OFL-1.1",
"funding": {
"url": "https://github.com/sponsors/ayuhito"
}
},
"node_modules/@fontsource-variable/geist-mono": {
"version": "5.2.7",
"resolved": "https://registry.npmjs.org/@fontsource-variable/geist-mono/-/geist-mono-5.2.7.tgz",
"integrity": "sha512-ZKlZ5sjtalb2TwXKs400mAGDlt/+2ENLNySPx0wTz3bP3mWARCsUW+rpxzZc7e05d2qGch70pItt3K4qttbIYA==",
"license": "OFL-1.1",
"funding": {
"url": "https://github.com/sponsors/ayuhito"
}
},
"node_modules/@fortawesome/fontawesome-free": {
"version": "6.7.2",
"resolved": "https://registry.npmjs.org/@fortawesome/fontawesome-free/-/fontawesome-free-6.7.2.tgz",
@@ -2903,11 +2923,12 @@
}
},
"node_modules/express-rate-limit": {
"version": "8.3.1",
"resolved": "https://registry.npmjs.org/express-rate-limit/-/express-rate-limit-8.3.1.tgz",
"integrity": "sha512-D1dKN+cmyPWuvB+G2SREQDzPY1agpBIcTa9sJxOPMCNeH3gwzhqJRDWCXW3gg0y//+LQ/8j52JbMROWyrKdMdw==",
"version": "8.5.1",
"resolved": "https://registry.npmjs.org/express-rate-limit/-/express-rate-limit-8.5.1.tgz",
"integrity": "sha512-5O6KYmyJEpuPJV5hNTXKbAHWRqrzyu+OI3vUnSd2kXFubIVpG7ezpgxQy76Zo5GQZtrQBg86hF+CM/NX+cioiQ==",
"license": "MIT",
"dependencies": {
"ip-address": "10.1.0"
"ip-address": "^10.2.0"
},
"engines": {
"node": ">= 16"
@@ -2951,9 +2972,9 @@
"dev": true
},
"node_modules/fast-uri": {
"version": "3.1.0",
"resolved": "https://registry.npmjs.org/fast-uri/-/fast-uri-3.1.0.tgz",
"integrity": "sha512-iPeeDKJSWf4IEOasVVrknXpaBV0IApz/gp7S2bb7Z4Lljbl2MGJRqInZiUrQwV16cpzw/D3S5j5Julj/gT52AA==",
"version": "3.1.2",
"resolved": "https://registry.npmjs.org/fast-uri/-/fast-uri-3.1.2.tgz",
"integrity": "sha512-rVjf7ArG3LTk+FS6Yw81V1DLuZl1bRbNrev6Tmd/9RaroeeRRJhAt7jg/6YFxbvAQXUCavSoZhPPj6oOx+5KjQ==",
"funding": [
{
"type": "github",
@@ -2963,7 +2984,8 @@
"type": "opencollective",
"url": "https://opencollective.com/fastify"
}
]
],
"license": "BSD-3-Clause"
},
"node_modules/fastq": {
"version": "1.20.1",
@@ -3421,9 +3443,9 @@
}
},
"node_modules/hono": {
"version": "4.12.14",
"resolved": "https://registry.npmjs.org/hono/-/hono-4.12.14.tgz",
"integrity": "sha512-am5zfg3yu6sqn5yjKBNqhnTX7Cv+m00ox+7jbaKkrLMRJ4rAdldd1xPd/JzbBWspqaQv6RSTrgFN95EsfhC+7w==",
"version": "4.12.18",
"resolved": "https://registry.npmjs.org/hono/-/hono-4.12.18.tgz",
"integrity": "sha512-RWzP96k/yv0PQfyXnWjs6zot20TqfpfsNXhOnev8d1InAxubW93L11/oNUc3tQqn2G0bSdAOBpX+2uDFHV7kdQ==",
"license": "MIT",
"engines": {
"node": ">=16.9.0"
@@ -3681,9 +3703,10 @@
"integrity": "sha512-k/vGaX4/Yla3WzyMCvTQOXYeIHvqOKtnqBduzTHpzpQZzAskKMhZ2K+EnBiSM9zGSoIFeMpXKxa4dYeZIQqewQ=="
},
"node_modules/ip-address": {
"version": "10.1.0",
"resolved": "https://registry.npmjs.org/ip-address/-/ip-address-10.1.0.tgz",
"integrity": "sha512-XXADHxXmvT9+CRxhXg56LJovE+bmWnEWB78LB83VZTprKTmaC5QfruXocxzTZ2Kl0DNwKuBdlIhjL8LeY8Sf8Q==",
"version": "10.2.0",
"resolved": "https://registry.npmjs.org/ip-address/-/ip-address-10.2.0.tgz",
"integrity": "sha512-/+S6j4E9AHvW9SWMSEY9Xfy66O5PWvVEJ08O0y5JGyEKQpojb0K0GKpz/v5HJ/G0vi3D2sjGK78119oXZeE0qA==",
"license": "MIT",
"engines": {
"node": ">= 12"
}

View File

@@ -120,10 +120,14 @@ type OllamaGenerateResponse struct {
EvalDuration int64 `json:"eval_duration,omitempty"`
}
// OllamaEmbedRequest represents a request to the Ollama Embed API
// OllamaEmbedRequest represents a request to the Ollama Embed API.
// Ollama's /api/embed endpoint accepts both `input` and `prompt` as the
// input string value (see https://github.com/ollama/ollama/blob/main/docs/api.md#generate-embeddings),
// so both keys are deserialized here for client compatibility.
type OllamaEmbedRequest struct {
Model string `json:"model"`
Input any `json:"input"` // string or []string
Model string `json:"model"`
Input any `json:"input,omitempty"` // string or []string
Prompt any `json:"prompt,omitempty"` // string or []string (Ollama alias for Input)
Options *OllamaOptions `json:"options,omitempty"`
}
@@ -135,10 +139,21 @@ func (r *OllamaEmbedRequest) ModelName(s *string) string {
return r.Model
}
// GetInputStrings normalizes the Input field to a string slice
// GetInputStrings normalizes the Input/Prompt field to a string slice.
// Input takes precedence over Prompt when both are provided.
func (r *OllamaEmbedRequest) GetInputStrings() []string {
switch v := r.Input.(type) {
if v := normalizeOllamaEmbedInput(r.Input); v != nil {
return v
}
return normalizeOllamaEmbedInput(r.Prompt)
}
func normalizeOllamaEmbedInput(v any) []string {
switch v := v.(type) {
case string:
if v == "" {
return nil
}
return []string{v}
case []any:
var result []string
@@ -184,11 +199,13 @@ func (r *OllamaShowRequest) ModelName(s *string) string {
// OllamaShowResponse represents a response from the Ollama Show API
type OllamaShowResponse struct {
Modelfile string `json:"modelfile"`
Parameters string `json:"parameters"`
Template string `json:"template"`
License string `json:"license,omitempty"`
Details OllamaModelDetails `json:"details"`
Modelfile string `json:"modelfile"`
Parameters string `json:"parameters"`
Template string `json:"template"`
License string `json:"license,omitempty"`
Details OllamaModelDetails `json:"details"`
ModelInfo map[string]any `json:"model_info,omitempty"`
Capabilities []string `json:"capabilities,omitempty"`
}
// OllamaModelDetails contains model metadata
@@ -203,12 +220,13 @@ type OllamaModelDetails struct {
// OllamaModelEntry represents a model in the list response
type OllamaModelEntry struct {
Name string `json:"name"`
Model string `json:"model"`
ModifiedAt time.Time `json:"modified_at"`
Size int64 `json:"size"`
Digest string `json:"digest"`
Details OllamaModelDetails `json:"details"`
Name string `json:"name"`
Model string `json:"model"`
ModifiedAt time.Time `json:"modified_at"`
Size int64 `json:"size"`
Digest string `json:"digest"`
Details OllamaModelDetails `json:"details"`
Capabilities []string `json:"capabilities,omitempty"`
}
// OllamaListResponse represents a response from the Ollama Tags API
@@ -218,13 +236,14 @@ type OllamaListResponse struct {
// OllamaPsEntry represents a running model in the ps response
type OllamaPsEntry struct {
Name string `json:"name"`
Model string `json:"model"`
Size int64 `json:"size"`
Digest string `json:"digest"`
Details OllamaModelDetails `json:"details"`
ExpiresAt time.Time `json:"expires_at"`
SizeVRAM int64 `json:"size_vram"`
Name string `json:"name"`
Model string `json:"model"`
Size int64 `json:"size"`
Digest string `json:"digest"`
Details OllamaModelDetails `json:"details"`
ExpiresAt time.Time `json:"expires_at"`
SizeVRAM int64 `json:"size_vram"`
Capabilities []string `json:"capabilities,omitempty"`
}
// OllamaPsResponse represents a response from the Ollama Ps API

View File

@@ -0,0 +1,86 @@
package schema_test
import (
"encoding/json"
. "github.com/mudler/LocalAI/core/schema"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("OllamaEmbedRequest", func() {
Context("GetInputStrings", func() {
It("returns a single string when Input is a string", func() {
req := OllamaEmbedRequest{Input: "hello world"}
Expect(req.GetInputStrings()).To(Equal([]string{"hello world"}))
})
It("returns a list of strings when Input is a []string", func() {
req := OllamaEmbedRequest{Input: []string{"hello", "world"}}
Expect(req.GetInputStrings()).To(Equal([]string{"hello", "world"}))
})
It("returns a list of strings when Input is a []any (post JSON unmarshal)", func() {
req := OllamaEmbedRequest{Input: []any{"hello", "world"}}
Expect(req.GetInputStrings()).To(Equal([]string{"hello", "world"}))
})
})
Context("JSON unmarshaling (Ollama API compatibility)", func() {
It("accepts the 'input' field as a single string", func() {
body := []byte(`{"model": "m", "input": "why is the sky blue?"}`)
var req OllamaEmbedRequest
Expect(json.Unmarshal(body, &req)).To(Succeed())
Expect(req.Model).To(Equal("m"))
Expect(req.GetInputStrings()).To(Equal([]string{"why is the sky blue?"}))
})
It("accepts the 'input' field as an array of strings", func() {
body := []byte(`{"model": "m", "input": ["why is the sky blue?", "why is the grass green?"]}`)
var req OllamaEmbedRequest
Expect(json.Unmarshal(body, &req)).To(Succeed())
Expect(req.GetInputStrings()).To(Equal([]string{"why is the sky blue?", "why is the grass green?"}))
})
// Ollama's embedding endpoint accepts both `input` and `prompt` keys:
// https://github.com/ollama/ollama/blob/main/docs/api.md#generate-embeddings
// LocalAI must accept `prompt` so client libraries using that key are not broken.
// See https://github.com/mudler/LocalAI/issues/9767.
It("accepts the 'prompt' field as a single string (Ollama compatibility)", func() {
body := []byte(`{"model": "m", "prompt": "why is the sky blue?"}`)
var req OllamaEmbedRequest
Expect(json.Unmarshal(body, &req)).To(Succeed())
Expect(req.Model).To(Equal("m"))
Expect(req.GetInputStrings()).To(Equal([]string{"why is the sky blue?"}))
})
It("accepts the 'prompt' field as an array of strings (Ollama compatibility)", func() {
body := []byte(`{"model": "m", "prompt": ["why is the sky blue?", "why is the grass green?"]}`)
var req OllamaEmbedRequest
Expect(json.Unmarshal(body, &req)).To(Succeed())
Expect(req.GetInputStrings()).To(Equal([]string{"why is the sky blue?", "why is the grass green?"}))
})
It("prefers 'input' when both 'input' and 'prompt' are provided", func() {
body := []byte(`{"model": "m", "input": "from input", "prompt": "from prompt"}`)
var req OllamaEmbedRequest
Expect(json.Unmarshal(body, &req)).To(Succeed())
Expect(req.GetInputStrings()).To(Equal([]string{"from input"}))
})
})
})

View File

@@ -251,18 +251,68 @@ options:
These are set via the `options:` array in the model configuration (format: `key:value`):
**Common options**
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `spec_type` | string | `none` | Speculative decoding type (see table below) |
| `spec_type` / `speculative_type` | string | `none` | Speculative decoding type, or comma-separated list to chain multiple (see table below) |
| `spec_n_max` / `draft_max` | int | 16 | Maximum number of tokens to draft per step |
| `spec_n_min` / `draft_min` | int | 0 | Minimum draft tokens required to use speculation |
| `spec_p_min` / `draft_p_min` | float | 0.75 | Minimum probability threshold for greedy acceptance |
| `spec_p_split` | float | 0.1 | Split probability for tree-based branching |
**Draft-model options** (apply when `spec_type=draft`, i.e. a `draft_model` is configured)
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `draft_gpu_layers` | int | -1 | GPU layers for the draft model (-1 = use default) |
| `draft_threads` / `spec_draft_threads` | int | same as main | Threads used by the draft model (`<= 0` = hardware concurrency) |
| `draft_threads_batch` / `spec_draft_threads_batch` | int | same as `draft_threads` | Threads used by the draft model during batch / prompt processing |
| `draft_cache_type_k` / `spec_draft_cache_type_k` | string | `f16` | KV cache K data type for the draft model (same values as `cache_type_k`) |
| `draft_cache_type_v` / `spec_draft_cache_type_v` | string | `f16` | KV cache V data type for the draft model |
| `draft_cpu_moe` / `spec_draft_cpu_moe` | bool | false | Keep all MoE expert weights of the draft model on CPU |
| `draft_n_cpu_moe` / `spec_draft_n_cpu_moe` | int | 0 | Keep MoE expert weights of the first N draft-model layers on CPU |
| `draft_override_tensor` / `spec_draft_override_tensor` | string | "" | Comma-separated `<tensor regex>=<buffer type>` overrides for the draft model |
| `draft_ctx_size` | int | (ignored) | Deprecated upstream: the draft now shares the target context size. Accepted for backward compatibility but has no effect. |
**`ngram_simple` options** (used when `spec_type` includes `ngram_simple`)
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `spec_ngram_size_n` / `ngram_size_n` | int | 12 | N-gram lookup size |
| `spec_ngram_size_m` / `ngram_size_m` | int | 48 | M-gram proposal size |
| `spec_ngram_min_hits` / `ngram_min_hits` | int | 1 | Minimum hits for accepting n-gram proposals |
| `draft_gpu_layers` | int | -1 | GPU layers for the draft model (-1 = use default) |
| `draft_ctx_size` | int | 0 | Context size for the draft model (0 = auto) |
**`ngram_mod` options** (used when `spec_type` includes `ngram_mod`)
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `spec_ngram_mod_n_min` | int | 48 | Minimum number of ngram tokens to use |
| `spec_ngram_mod_n_max` | int | 64 | Maximum number of ngram tokens to use |
| `spec_ngram_mod_n_match` | int | 24 | Ngram lookup length |
**`ngram_map_k` options** (used when `spec_type` includes `ngram_map_k`)
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `spec_ngram_map_k_size_n` | int | 12 | N-gram lookup size |
| `spec_ngram_map_k_size_m` | int | 48 | M-gram proposal size |
| `spec_ngram_map_k_min_hits` | int | 1 | Minimum hits for accepting proposals |
**`ngram_map_k4v` options** (used when `spec_type` includes `ngram_map_k4v`)
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `spec_ngram_map_k4v_size_n` | int | 12 | N-gram lookup size |
| `spec_ngram_map_k4v_size_m` | int | 48 | M-gram proposal size |
| `spec_ngram_map_k4v_min_hits` | int | 1 | Minimum hits for accepting proposals |
**`ngram_cache` lookup files**
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `spec_lookup_cache_static` / `lookup_cache_static` | string | "" | Path to a static ngram lookup cache file |
| `spec_lookup_cache_dynamic` / `lookup_cache_dynamic` | string | "" | Path to a dynamic ngram lookup cache file (updated by generation) |
#### Speculative Type Values
@@ -277,6 +327,8 @@ These are set via the `options:` array in the model configuration (format: `key:
| `ngram_mod` | Modified n-gram speculation |
| `ngram_cache` | 3-level n-gram cache |
Multiple types can be chained by passing a comma-separated list to `spec_type` (e.g. `spec_type:ngram_simple,ngram_mod`). The runtime tries them in order and accepts the first proposal that meets the acceptance criteria.
{{% notice note %}}
Speculative decoding is automatically disabled when multimodal models (with `mmproj`) are active. The `n_draft` parameter can also be overridden per-request.
{{% /notice %}}

View File

@@ -1,3 +1,3 @@
{
"version": "v4.1.3"
"version": "v4.2.0"
}

View File

@@ -30632,3 +30632,24 @@
- torch_dtype:bf16
parameters:
model: Lightricks/LTX-2.3
- name: deepseek-v4-flash-q2
description: |
DeepSeek V4 Flash (IQ2XXS GGUF, ~81 GB) - only loadable via the ds4 backend.
Requires >=128 GB RAM. Metal (Darwin) or CUDA (Linux).
See https://github.com/antirez/ds4 for details.
urls:
- https://huggingface.co/antirez/deepseek-v4-gguf
tags:
- deepseek
- ds4
- gguf
- llm
- chat
overrides:
backend: ds4
parameters:
model: ds4flash.gguf
files:
- filename: ds4flash.gguf
sha256: 31598c67c8b8744d3bcebcd19aa62253c6dc43cef3b8adf9f593656c9e86fd8c
uri: huggingface://antirez/deepseek-v4-gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf

61
go.mod
View File

@@ -7,7 +7,7 @@ require (
fyne.io/fyne/v2 v2.7.3
github.com/Masterminds/sprig/v3 v3.3.0
github.com/alecthomas/kong v1.14.0
github.com/anthropics/anthropic-sdk-go v1.27.0
github.com/anthropics/anthropic-sdk-go v1.42.0
github.com/aws/aws-sdk-go-v2 v1.41.6
github.com/aws/aws-sdk-go-v2/config v1.32.16
github.com/aws/aws-sdk-go-v2/credentials v1.19.15
@@ -18,7 +18,7 @@ require (
github.com/dhowden/tag v0.0.0-20240417053706-3d75831295e8
github.com/ebitengine/purego v0.10.0
github.com/emirpasic/gods/v2 v2.0.0-alpha
github.com/fsnotify/fsnotify v1.9.0
github.com/fsnotify/fsnotify v1.10.1
github.com/go-audio/wav v1.1.0
github.com/go-skynet/go-llama.cpp v0.0.0-20240314183750-6a8041ef6b46
github.com/gofrs/flock v0.13.0
@@ -37,14 +37,14 @@ require (
github.com/microcosm-cc/bluemonday v1.0.27
github.com/modelcontextprotocol/go-sdk v1.5.0
github.com/mudler/cogito v0.9.5-0.20260315222927-63abdec7189b
github.com/mudler/edgevpn v0.31.1
github.com/mudler/edgevpn v0.32.2
github.com/mudler/go-processmanager v0.1.1
github.com/mudler/memory v0.0.0-20260406210934-424c1ecf2cf8
github.com/mudler/xlog v0.0.6
github.com/nats-io/nats.go v1.50.0
github.com/ollama/ollama v0.20.4
github.com/onsi/ginkgo/v2 v2.28.2
github.com/onsi/gomega v1.39.1
github.com/onsi/gomega v1.40.0
github.com/openai/openai-go/v3 v3.26.0
github.com/otiai10/copy v1.14.1
github.com/otiai10/openaigo v1.7.0
@@ -94,14 +94,10 @@ require (
github.com/aws/smithy-go v1.25.0 // indirect
github.com/bahlo/generic-list-go v0.2.0 // indirect
github.com/buger/jsonparser v1.1.2 // indirect
github.com/chasefleming/elem-go v0.30.0 // indirect
github.com/dave-gray101/v2keyauth v0.0.0-20240624150259-c45d584d25e2 // indirect
github.com/dunglas/httpsfv v1.1.0 // indirect
github.com/filecoin-project/go-clock v0.1.0 // indirect
github.com/go-jose/go-jose/v4 v4.1.4 // indirect
github.com/gofiber/template v1.8.3 // indirect
github.com/gofiber/template/html/v2 v2.1.3 // indirect
github.com/gofiber/utils v1.1.0 // indirect
github.com/inconshreveable/mousetrap v1.1.0 // indirect
github.com/invopop/jsonschema v0.13.0 // indirect
github.com/jinzhu/inflection v1.0.0 // indirect
github.com/jinzhu/now v1.1.5 // indirect
github.com/jolestar/go-commons-pool/v2 v2.1.2 // indirect
@@ -111,8 +107,7 @@ require (
github.com/moby/moby/client v0.4.0 // indirect
github.com/nats-io/nkeys v0.4.15 // indirect
github.com/nats-io/nuid v1.0.1 // indirect
github.com/spf13/cobra v1.10.2 // indirect
github.com/spf13/pflag v1.0.10 // indirect
github.com/standard-webhooks/standard-webhooks/libraries v0.0.0-20260508151727-1282bb917829 // indirect
github.com/stretchr/testify v1.11.1 // indirect
github.com/sv-tools/openapi v0.2.1 // indirect
github.com/swaggo/swag/v2 v2.0.0-rc4 // indirect
@@ -153,7 +148,7 @@ require (
github.com/blevesearch/zapx/v16 v16.2.8 // indirect
github.com/bwmarrin/discordgo v0.29.0 // indirect
github.com/cloudflare/circl v1.6.3 // indirect
github.com/cyphar/filepath-securejoin v0.5.1 // indirect
github.com/cyphar/filepath-securejoin v0.6.1 // indirect
github.com/emersion/go-imap/v2 v2.0.0-beta.5 // indirect
github.com/emersion/go-message v0.18.2 // indirect
github.com/emersion/go-sasl v0.0.0-20241020182733-b788ff22d5a6 // indirect
@@ -161,8 +156,8 @@ require (
github.com/emirpasic/gods v1.18.1 // indirect
github.com/eritikass/githubmarkdownconvertergo v0.1.10 // indirect
github.com/go-git/gcfg v1.5.1-0.20230307220236-3a3c6141e376 // indirect
github.com/go-git/go-billy/v5 v5.8.0 // indirect
github.com/go-git/go-git/v5 v5.18.0 // indirect
github.com/go-git/go-billy/v5 v5.9.0 // indirect
github.com/go-git/go-git/v5 v5.19.0 // indirect
github.com/go-telegram/bot v1.17.0 // indirect
github.com/gobwas/glob v0.2.3 // indirect
github.com/gocolly/colly v1.2.0 // indirect
@@ -188,7 +183,7 @@ require (
github.com/oxffaa/gopher-parse-sitemap v0.0.0-20191021113419-005d2eb1def4 // indirect
github.com/philippgille/chromem-go v0.7.0 // indirect
github.com/pion/transport/v4 v4.0.1 // indirect
github.com/pjbgf/sha1cd v0.3.2 // indirect
github.com/pjbgf/sha1cd v0.6.0 // indirect
github.com/rs/zerolog v1.31.0 // indirect
github.com/saintfish/chardet v0.0.0-20230101081208-5e3ef4b5456d // indirect
github.com/segmentio/asm v1.1.3 // indirect
@@ -251,7 +246,7 @@ require (
github.com/jeandeaual/go-locale v0.0.0-20250612000132-0ef82f21eade // indirect
github.com/json-iterator/go v1.1.12 // indirect
github.com/jsummers/gobmp v0.0.0-20230614200233-a9de23ed2e25 // indirect
github.com/libp2p/go-yamux/v5 v5.0.1 // indirect
github.com/libp2p/go-yamux/v5 v5.1.0 // indirect
github.com/magiconair/properties v1.8.10 // indirect
github.com/moby/docker-image-spec v1.3.1 // indirect
github.com/moby/go-archive v0.2.0 // indirect
@@ -288,7 +283,7 @@ require (
github.com/xo/terminfo v0.0.0-20220910002029-abceb7e1c41e // indirect
github.com/yosida95/uritemplate/v3 v3.0.2 // indirect
go.opentelemetry.io/auto/sdk v1.2.1 // indirect
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.61.0 // indirect
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.65.0 // indirect
go.uber.org/mock v0.5.2 // indirect
go.yaml.in/yaml/v2 v2.4.4
go.yaml.in/yaml/v3 v3.0.4 // indirect
@@ -323,7 +318,7 @@ require (
github.com/creachadair/otp v0.5.0 // indirect
github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc // indirect
github.com/davidlazar/go-crypto v0.0.0-20200604182044-b73af7476f6c // indirect
github.com/decred/dcrd/dcrec/secp256k1/v4 v4.4.0 // indirect
github.com/decred/dcrd/dcrec/secp256k1/v4 v4.4.1 // indirect
github.com/dlclark/regexp2 v1.11.5 // indirect
github.com/docker/cli v29.4.0+incompatible // indirect
github.com/docker/docker v28.5.2+incompatible
@@ -343,7 +338,7 @@ require (
github.com/go-openapi/swag v0.23.0 // indirect
github.com/gogo/protobuf v1.3.2 // indirect
github.com/golang/groupcache v0.0.0-20241129210726-2c02b8208cf8 // indirect
github.com/golang/snappy v0.0.4 // indirect
github.com/golang/snappy v0.0.5-0.20231225225746-43d5d4cd4e0e // indirect
github.com/google/btree v1.1.3 // indirect
github.com/google/go-cmp v0.7.0 // indirect
github.com/google/gopacket v1.1.19 // indirect
@@ -355,10 +350,10 @@ require (
github.com/henvic/httpretty v0.1.4 // indirect
github.com/huandu/xstrings v1.5.0 // indirect
github.com/huin/goupnp v1.3.0 // indirect
github.com/ipfs/boxo v0.30.0 // indirect
github.com/ipfs/boxo v0.37.0 // indirect
github.com/ipfs/go-cid v0.6.1 // indirect
github.com/ipfs/go-datastore v0.8.2 // indirect
github.com/ipfs/go-log/v2 v2.6.0 // indirect
github.com/ipfs/go-datastore v0.9.1 // indirect
github.com/ipfs/go-log/v2 v2.9.1 // indirect
github.com/ipld/go-ipld-prime v0.23.0 // indirect
github.com/jackpal/go-nat-pmp v1.0.2 // indirect
github.com/jaypipes/pcidb v1.1.1 // indirect
@@ -369,11 +364,11 @@ require (
github.com/koron/go-ssdp v0.0.6 // indirect
github.com/libp2p/go-buffer-pool v0.1.0 // indirect
github.com/libp2p/go-cidranger v1.1.0 // indirect
github.com/libp2p/go-flow-metrics v0.2.0 // indirect
github.com/libp2p/go-flow-metrics v0.3.0 // indirect
github.com/libp2p/go-libp2p-asn-util v0.4.1 // indirect
github.com/libp2p/go-libp2p-kad-dht v0.33.1 // indirect
github.com/libp2p/go-libp2p-kbucket v0.7.0 // indirect
github.com/libp2p/go-libp2p-pubsub v0.14.2 // indirect
github.com/libp2p/go-libp2p-kad-dht v0.39.0 // indirect
github.com/libp2p/go-libp2p-kbucket v0.8.0 // indirect
github.com/libp2p/go-libp2p-pubsub v0.15.0 // indirect
github.com/libp2p/go-libp2p-record v0.3.1 // indirect
github.com/libp2p/go-libp2p-routing-helpers v0.7.5 // indirect
github.com/libp2p/go-msgio v0.3.0 // indirect
@@ -387,7 +382,7 @@ require (
github.com/mattn/go-colorable v0.1.14 // indirect
github.com/mattn/go-isatty v0.0.20 // indirect
github.com/mattn/go-runewidth v0.0.17 // indirect
github.com/miekg/dns v1.1.66 // indirect
github.com/miekg/dns v1.1.72 // indirect
github.com/mikioh/tcpinfo v0.0.0-20190314235526-30a79bb1804b // indirect
github.com/mikioh/tcpopt v0.0.0-20190314235656-172688c1accc // indirect
github.com/minio/sha256-simd v1.0.1 // indirect
@@ -405,7 +400,7 @@ require (
github.com/multiformats/go-base32 v0.1.0 // indirect
github.com/multiformats/go-base36 v0.2.0 // indirect
github.com/multiformats/go-multiaddr v0.16.1
github.com/multiformats/go-multiaddr-dns v0.4.1 // indirect
github.com/multiformats/go-multiaddr-dns v0.5.0 // indirect
github.com/multiformats/go-multiaddr-fmt v0.1.0 // indirect
github.com/multiformats/go-multibase v0.3.0 // indirect
github.com/multiformats/go-multicodec v0.10.0 // indirect
@@ -443,7 +438,7 @@ require (
github.com/ulikunitz/xz v0.5.14 // indirect
github.com/valyala/bytebufferpool v1.0.0 // indirect
github.com/vbatts/tar-split v0.12.2 // indirect
github.com/vishvananda/netlink v1.3.0 // indirect
github.com/vishvananda/netlink v1.3.1 // indirect
github.com/vishvananda/netns v0.0.5 // indirect
github.com/whyrusleeping/go-keyspace v0.0.0-20160322163242-5b898ac5add1 // indirect
github.com/xi2/xz v0.0.0-20171230120015-48954b6210f8 // indirect
@@ -456,9 +451,9 @@ require (
go.uber.org/dig v1.19.0 // indirect
go.uber.org/fx v1.24.0 // indirect
go.uber.org/multierr v1.11.0 // indirect
go.uber.org/zap v1.27.0 // indirect
go.uber.org/zap v1.27.1 // indirect
golang.org/x/crypto v0.50.0
golang.org/x/exp v0.0.0-20250606033433-dcc06ee1d476 // indirect
golang.org/x/exp v0.0.0-20260410095643-746e56fc9e2f // indirect
golang.org/x/mod v0.35.0 // indirect
golang.org/x/sync v0.20.0
golang.org/x/sys v0.43.0 // indirect
@@ -469,7 +464,7 @@ require (
golang.zx2c4.com/wireguard v0.0.0-20250521234502-f333402bd9cb // indirect
golang.zx2c4.com/wireguard/windows v0.5.3 // indirect
gonum.org/v1/gonum v0.17.0 // indirect
google.golang.org/genproto/googleapis/rpc v0.0.0-20260120221211-b8f7ae30c516 // indirect
google.golang.org/genproto/googleapis/rpc v0.0.0-20260128011058-8636f8732409 // indirect
gopkg.in/fsnotify.v1 v1.4.7 // indirect
gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7 // indirect
howett.net/plist v1.0.2-0.20250314012144-ee69052608d9 // indirect

138
go.sum
View File

@@ -100,8 +100,8 @@ github.com/antchfx/xmlquery v1.4.4/go.mod h1:AEPEEPYE9GnA2mj5Ur2L5Q5/2PycJ0N9Fus
github.com/antchfx/xpath v1.3.3/go.mod h1:i54GszH55fYfBmoZXapTHN8T8tkcHfRgLyVwwqzXNcs=
github.com/antchfx/xpath v1.3.6 h1:s0y+ElRRtTQdfHP609qFu0+c6bglDv20pqOViQjjdPI=
github.com/antchfx/xpath v1.3.6/go.mod h1:i54GszH55fYfBmoZXapTHN8T8tkcHfRgLyVwwqzXNcs=
github.com/anthropics/anthropic-sdk-go v1.27.0 h1:0CWbmBq5ofGAjF2H6lefCNRbnaUMGiTKO+lb7RLhDbI=
github.com/anthropics/anthropic-sdk-go v1.27.0/go.mod h1:qUKmaW+uuPB64iy1l+4kOSvaLqPXnHTTBKH6RVZ7q5Q=
github.com/anthropics/anthropic-sdk-go v1.42.0 h1:Zv882/dnrE4OHnwhMAsi9lwVVXRF8GtR3ofiBResYUw=
github.com/anthropics/anthropic-sdk-go v1.42.0/go.mod h1:r4eaLX9tBolUrXLOrLj7eU8tmeBtoobCkM0kBsivBaY=
github.com/antihax/optional v1.0.0/go.mod h1:uupD/76wgC+ih3iEmQUL+0Ugr19nfwCT1kdvxnR2qWY=
github.com/armon/circbuf v0.0.0-20150827004946-bbbad097214e/go.mod h1:3U/XgcO3hCbHZ8TKRvWD2dDTCfh9M9ya+I9JpbB7O8o=
github.com/armon/go-metrics v0.0.0-20180917152333-f0300d1749da/go.mod h1:Q73ZrmVTwzkszR9V5SSuryQ31EELlFMUz1kKyl939pY=
@@ -227,8 +227,6 @@ github.com/charmbracelet/x/exp/slice v0.0.0-20250327172914-2fdc97757edf h1:rLG0Y
github.com/charmbracelet/x/exp/slice v0.0.0-20250327172914-2fdc97757edf/go.mod h1:B3UgsnsBZS/eX42BlaNiJkD1pPOUa+oF1IYC6Yd2CEU=
github.com/charmbracelet/x/term v0.2.1 h1:AQeHeLZ1OqSXhrAWpYUtZyX1T3zVxfpZuEQMIQaGIAQ=
github.com/charmbracelet/x/term v0.2.1/go.mod h1:oQ4enTYFV7QN4m0i9mzHrViD7TQKvNEEkHUMCmsxdUg=
github.com/chasefleming/elem-go v0.30.0 h1:BlhV1ekv1RbFiM8XZUQeln1Ikb4D+bu2eDO4agREvok=
github.com/chasefleming/elem-go v0.30.0/go.mod h1:hz73qILBIKnTgOujnSMtEj20/epI+f6vg71RUilJAA4=
github.com/chengxilo/virtualterm v1.0.4 h1:Z6IpERbRVlfB8WkOmtbHiDbBANU7cimRIof7mk9/PwM=
github.com/chengxilo/virtualterm v1.0.4/go.mod h1:DyxxBZz/x1iqJjFxTFcr6/x+jSpqN0iwWCOK1q10rlY=
github.com/chzyer/logex v1.1.10/go.mod h1:+Ywpsq7O8HXn0nuIou7OrIPyXbp3wmkHB+jjWRnGsAI=
@@ -265,17 +263,14 @@ github.com/cpuguy83/dockercfg v0.3.2 h1:DlJTyZGBDlXqUZ2Dk2Q3xHs/FtnooJJVaad2S9GK
github.com/cpuguy83/dockercfg v0.3.2/go.mod h1:sugsbF4//dDlL/i+S+rtpIWp+5h0BHJHfjj5/jFyUJc=
github.com/cpuguy83/go-md2man/v2 v2.0.0-20190314233015-f79a8a8ca69d/go.mod h1:maD7wRr/U5Z6m/iR4s+kqSMx2CaBsrgA7czyZG/E6dU=
github.com/cpuguy83/go-md2man/v2 v2.0.0/go.mod h1:maD7wRr/U5Z6m/iR4s+kqSMx2CaBsrgA7czyZG/E6dU=
github.com/cpuguy83/go-md2man/v2 v2.0.6/go.mod h1:oOW0eioCTA6cOiMLiUPZOpcVxMig6NIQQ7OS05n1F4g=
github.com/creachadair/mds v0.21.3 h1:RRgEAPIb52cU0q7UxGyN+13QlCVTZIL4slRr0cYYQfA=
github.com/creachadair/mds v0.21.3/go.mod h1:1ltMWZd9yXhaHEoZwBialMaviWVUpRPvMwVP7saFAzM=
github.com/creachadair/otp v0.5.0 h1:q3Th7CXm2zlmCdBjw5tEPFOj4oWJMnVL5HXlq0sNKS0=
github.com/creachadair/otp v0.5.0/go.mod h1:0kceI87EnYFNYSTL121goJVAnk3eJhaed9H0nMuJUkA=
github.com/creack/pty v1.1.24 h1:bJrF4RRfyJnbTJqzRLHzcGaZK1NeM5kTC9jGgovnR1s=
github.com/creack/pty v1.1.24/go.mod h1:08sCNb52WyoAwi2QDyzUCTgcvVFhUzewun7wtTfvcwE=
github.com/cyphar/filepath-securejoin v0.5.1 h1:eYgfMq5yryL4fbWfkLpFFy2ukSELzaJOTaUTuh+oF48=
github.com/cyphar/filepath-securejoin v0.5.1/go.mod h1:Sdj7gXlvMcPZsbhwhQ33GguGLDGQL7h7bg04C/+u9jI=
github.com/dave-gray101/v2keyauth v0.0.0-20240624150259-c45d584d25e2 h1:flLYmnQFZNo04x2NPehMbf30m7Pli57xwZ0NFqR/hb0=
github.com/dave-gray101/v2keyauth v0.0.0-20240624150259-c45d584d25e2/go.mod h1:NtWqRzAp/1tw+twkW8uuBenEVVYndEAZACWU3F3xdoQ=
github.com/cyphar/filepath-securejoin v0.6.1 h1:5CeZ1jPXEiYt3+Z6zqprSAgSWiggmpVyciv8syjIpVE=
github.com/cyphar/filepath-securejoin v0.6.1/go.mod h1:A8hd4EnAeyujCJRrICiOWqjS1AX0a9kM5XL+NwKoYSc=
github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc h1:U9qPSI2PIWSS1VwoXQT9A3Wy9MM3WgvqSxFWenqJduM=
@@ -284,8 +279,8 @@ github.com/davidlazar/go-crypto v0.0.0-20200604182044-b73af7476f6c h1:pFUpOrbxDR
github.com/davidlazar/go-crypto v0.0.0-20200604182044-b73af7476f6c/go.mod h1:6UhI8N9EjYm1c2odKpFpAYeR8dsBeM7PtzQhRgxRr9U=
github.com/decred/dcrd/crypto/blake256 v1.1.0 h1:zPMNGQCm0g4QTY27fOCorQW7EryeQ/U0x++OzVrdms8=
github.com/decred/dcrd/crypto/blake256 v1.1.0/go.mod h1:2OfgNZ5wDpcsFmHmCK5gZTPcCXqlm2ArzUIkw9czNJo=
github.com/decred/dcrd/dcrec/secp256k1/v4 v4.4.0 h1:NMZiJj8QnKe1LgsbDayM4UoHwbvwDRwnI3hwNaAHRnc=
github.com/decred/dcrd/dcrec/secp256k1/v4 v4.4.0/go.mod h1:ZXNYxsqcloTdSy/rNShjYzMhyjf0LaoftYK0p+A3h40=
github.com/decred/dcrd/dcrec/secp256k1/v4 v4.4.1 h1:5RVFMOWjMyRy8cARdy79nAmgYw3hK/4HUq48LQ6Wwqo=
github.com/decred/dcrd/dcrec/secp256k1/v4 v4.4.1/go.mod h1:ZXNYxsqcloTdSy/rNShjYzMhyjf0LaoftYK0p+A3h40=
github.com/dhowden/tag v0.0.0-20240417053706-3d75831295e8 h1:OtSeLS5y0Uy01jaKK4mA/WVIYtpzVm63vLVAPzJXigg=
github.com/dhowden/tag v0.0.0-20240417053706-3d75831295e8/go.mod h1:apkPC/CR3s48O2D7Y++n1XWEpgPNNCjXYga3PPbJe2E=
github.com/distribution/reference v0.6.0 h1:0IXCQ5g4/QMHHkarYzh5l+u8T3t73zM5QvfrDyIgxBk=
@@ -339,6 +334,8 @@ github.com/felixge/fgprof v0.9.3 h1:VvyZxILNuCiUCSXtPtYmmtGvb65nqXh2QFWc0Wpf2/g=
github.com/felixge/fgprof v0.9.3/go.mod h1:RdbpDgzqYVh/T9fPELJyV7EYJuHB55UTEULNun8eiPw=
github.com/felixge/httpsnoop v1.0.4 h1:NFTV2Zj1bL4mc9sqWACXbQFVBBg2W3GPvqp8/ESS2Wg=
github.com/felixge/httpsnoop v1.0.4/go.mod h1:m8KPJKqk1gH5J9DgRY2ASl2lWCfGKXixSwevea8zH2U=
github.com/filecoin-project/go-clock v0.1.0 h1:SFbYIM75M8NnFm1yMHhN9Ahy3W5bEZV9gd6MPfXbKVU=
github.com/filecoin-project/go-clock v0.1.0/go.mod h1:4uB/O4PvOjlx1VCMdZ9MyDZXRm//gkj1ELEbxfI1AZs=
github.com/flynn/noise v1.1.0 h1:KjPQoQCEFdZDiP03phOvGi11+SVVhBG2wOWAorLsstg=
github.com/flynn/noise v1.1.0/go.mod h1:xbMo+0i6+IGbYdJhF31t2eR1BIU0CYc12+BNAKwUTag=
github.com/fortytw2/leaktest v1.3.0 h1:u8491cBMTQ8ft8aeV+adlcytMZylmA5nnwwkRZjI8vw=
@@ -348,8 +345,8 @@ github.com/frankban/quicktest v1.14.6/go.mod h1:4ptaffx2x8+WTWXmUCuVU6aPUX1/Mz7z
github.com/fredbi/uri v1.1.1 h1:xZHJC08GZNIUhbP5ImTHnt5Ya0T8FI2VAwI/37kh2Ko=
github.com/fredbi/uri v1.1.1/go.mod h1:4+DZQ5zBjEwQCDmXW5JdIjz0PUA+yJbvtBv+u+adr5o=
github.com/fsnotify/fsnotify v1.4.9/go.mod h1:znqG4EE+3YCdAaPaxE2ZRY/06pZUdp0tY4IgpuI1SZQ=
github.com/fsnotify/fsnotify v1.9.0 h1:2Ml+OJNzbYCTzsxtv8vKSFD9PbJjmhYF14k/jKC7S9k=
github.com/fsnotify/fsnotify v1.9.0/go.mod h1:8jBTzvmWwFyi3Pb8djgCCO5IBqzKJ/Jwo8TRcHyHii0=
github.com/fsnotify/fsnotify v1.10.1 h1:b0/UzAf9yR5rhf3RPm9gf3ehBPpf0oZKIjtpKrx59Ho=
github.com/fsnotify/fsnotify v1.10.1/go.mod h1:TLheqan6HD6GBK6PrDWyDPBaEV8LspOxvPSjC+bVfgo=
github.com/fyne-io/gl-js v0.2.0 h1:+EXMLVEa18EfkXBVKhifYB6OGs3HwKO3lUElA0LlAjs=
github.com/fyne-io/gl-js v0.2.0/go.mod h1:ZcepK8vmOYLu96JoxbCKJy2ybr+g1pTnaBDdl7c3ajI=
github.com/fyne-io/glfw-js v0.3.0 h1:d8k2+Y7l+zy2pc7wlGRyPfTgZoqDf3AI4G+2zOWhWUk=
@@ -375,12 +372,12 @@ github.com/go-audio/wav v1.1.0 h1:jQgLtbqBzY7G+BM8fXF7AHUk1uHUviWS4X39d5rsL2g=
github.com/go-audio/wav v1.1.0/go.mod h1:mpe9qfwbScEbkd8uybLuIpTgHyrISw/OTuvjUW2iGtE=
github.com/go-git/gcfg v1.5.1-0.20230307220236-3a3c6141e376 h1:+zs/tPmkDkHx3U66DAb0lQFJrpS6731Oaa12ikc+DiI=
github.com/go-git/gcfg v1.5.1-0.20230307220236-3a3c6141e376/go.mod h1:an3vInlBmSxCcxctByoQdvwPiA7DTK7jaaFDBTtu0ic=
github.com/go-git/go-billy/v5 v5.8.0 h1:I8hjc3LbBlXTtVuFNJuwYuMiHvQJDq1AT6u4DwDzZG0=
github.com/go-git/go-billy/v5 v5.8.0/go.mod h1:RpvI/rw4Vr5QA+Z60c6d6LXH0rYJo0uD5SqfmrrheCY=
github.com/go-git/go-billy/v5 v5.9.0 h1:jItGXszUDRtR/AlferWPTMN4j38BQ88XnXKbilmmBPA=
github.com/go-git/go-billy/v5 v5.9.0/go.mod h1:jCnQMLj9eUgGU7+ludSTYoZL/GGmii14RxKFj7ROgHw=
github.com/go-git/go-git-fixtures/v4 v4.3.2-0.20231010084843-55a94097c399 h1:eMje31YglSBqCdIqdhKBW8lokaMrL3uTkpGYlE2OOT4=
github.com/go-git/go-git-fixtures/v4 v4.3.2-0.20231010084843-55a94097c399/go.mod h1:1OCfN199q1Jm3HZlxleg+Dw/mwps2Wbk9frAWm+4FII=
github.com/go-git/go-git/v5 v5.18.0 h1:O831KI+0PR51hM2kep6T8k+w0/LIAD490gvqMCvL5hM=
github.com/go-git/go-git/v5 v5.18.0/go.mod h1:pW/VmeqkanRFqR6AljLcs7EA7FbZaN5MQqO7oZADXpo=
github.com/go-git/go-git/v5 v5.19.0 h1:+WkVUQZSy/F1Gb13udrMKjIM2PrzsNfDKFSfo5tkMtc=
github.com/go-git/go-git/v5 v5.19.0/go.mod h1:Pb1v0c7/g8aGQJwx9Us09W85yGoyvSwuhEGMH7zjDKQ=
github.com/go-gl/gl v0.0.0-20231021071112-07e5d0ea2e71 h1:5BVwOaUSBTlVZowGO6VZGw2H/zl9nrd3eCZfYV+NfQA=
github.com/go-gl/gl v0.0.0-20231021071112-07e5d0ea2e71/go.mod h1:9YTyiznxEY1fVinfM7RvRcjRHbw2xLBJ3AAGIT0I4Nw=
github.com/go-gl/glfw v0.0.0-20190409004039-e6da0acd62b1/go.mod h1:vR7hzQXu2zJy9AVAgeJqvqgH9Q5CA+iKCZ2gyEVpxRU=
@@ -432,12 +429,6 @@ github.com/godbus/dbus/v5 v5.1.0 h1:4KLkAxT3aOY8Li4FRJe/KvhoNFFxo0m6fNuFUO8QJUk=
github.com/godbus/dbus/v5 v5.1.0/go.mod h1:xhWf0FNVPg57R7Z0UbKHbJfkEywrmjJnf7w5xrFpKfA=
github.com/gofiber/fiber/v2 v2.52.13 h1:TOKP64iqC9b5P49VrBW5tHhUOvDyrtJ0xePEfzJbCbk=
github.com/gofiber/fiber/v2 v2.52.13/go.mod h1:YEcBbO/FB+5M1IZNBP9FO3J9281zgPAreiI1oqg8nDw=
github.com/gofiber/template v1.8.3 h1:hzHdvMwMo/T2kouz2pPCA0zGiLCeMnoGsQZBTSYgZxc=
github.com/gofiber/template v1.8.3/go.mod h1:bs/2n0pSNPOkRa5VJ8zTIvedcI/lEYxzV3+YPXdBvq8=
github.com/gofiber/template/html/v2 v2.1.3 h1:n1LYBtmr9C0V/k/3qBblXyMxV5B0o/gpb6dFLp8ea+o=
github.com/gofiber/template/html/v2 v2.1.3/go.mod h1:U5Fxgc5KpyujU9OqKzy6Kn6Qup6Tm7zdsISR+VpnHRE=
github.com/gofiber/utils v1.1.0 h1:vdEBpn7AzIUJRhe+CiTOJdUcTg4Q9RK+pEa0KPbLdrM=
github.com/gofiber/utils v1.1.0/go.mod h1:poZpsnhBykfnY1Mc0KeEa6mSHrS3dV0+oBWyeQmb2e0=
github.com/gofrs/flock v0.13.0 h1:95JolYOvGMqeH31+FC7D2+uULf6mG61mEZ/A8dRYMzw=
github.com/gofrs/flock v0.13.0/go.mod h1:jxeyy9R1auM5S6JYDBhDt+E2TCo7DkratH4Pgi8P+Z0=
github.com/gogo/protobuf v1.3.2 h1:Ov1cvc58UF3b5XjBnZv7+opcTcQFZebYjWzi34vdm4Q=
@@ -479,8 +470,8 @@ github.com/golang/protobuf v1.5.2/go.mod h1:XVQd3VNwM+JqD3oG2Ue2ip4fOMUkwXdXDdiu
github.com/golang/protobuf v1.5.4 h1:i7eJL8qZTpSEXOPTxNKhASYpMn+8e5Q6AdndVa1dWek=
github.com/golang/protobuf v1.5.4/go.mod h1:lnTiLA8Wa4RWRcIUkrtSVa5nRhsEGBg48fD6rSs7xps=
github.com/golang/snappy v0.0.2/go.mod h1:/XxbfmMg8lxefKM7IXC3fBNl/7bRcc72aCRzEWrmP2Q=
github.com/golang/snappy v0.0.4 h1:yAGX7huGHXlcLOEtBnF4w7FQwA26wojNCwOYAEhLjQM=
github.com/golang/snappy v0.0.4/go.mod h1:/XxbfmMg8lxefKM7IXC3fBNl/7bRcc72aCRzEWrmP2Q=
github.com/golang/snappy v0.0.5-0.20231225225746-43d5d4cd4e0e h1:4bw4WeyTYPp0smaXiJZCNnLrvVBqirQVreixayXezGc=
github.com/golang/snappy v0.0.5-0.20231225225746-43d5d4cd4e0e/go.mod h1:/XxbfmMg8lxefKM7IXC3fBNl/7bRcc72aCRzEWrmP2Q=
github.com/gomarkdown/markdown v0.0.0-20250311123330-531bef5e742b h1:EY/KpStFl60qA17CptGXhwfZ+k1sFNJIUNR8DdbcuUk=
github.com/gomarkdown/markdown v0.0.0-20250311123330-531bef5e742b/go.mod h1:JDGcbDT52eL4fju3sZ4TeHGsQwhG9nbDV21aMyhwPoA=
github.com/google/btree v0.0.0-20180813153112-4030bb1f1f0c/go.mod h1:lNA+9X1NB3Zf8V7Ke586lFgjr2dZNuvo3lPJSGZ5JPQ=
@@ -587,27 +578,25 @@ github.com/huin/goupnp v1.3.0/go.mod h1:gnGPsThkYa7bFi/KWmEysQRf48l2dvR5bxr2OFck
github.com/ianlancetaylor/demangle v0.0.0-20181102032728-5e5cf60278f6/go.mod h1:aSSvb/t6k1mPoxDqO4vJh6VOCGPwU4O0C2/Eqndh1Sc=
github.com/ianlancetaylor/demangle v0.0.0-20200824232613-28f6c0f3b639/go.mod h1:aSSvb/t6k1mPoxDqO4vJh6VOCGPwU4O0C2/Eqndh1Sc=
github.com/inconshreveable/mousetrap v1.0.0/go.mod h1:PxqpIevigyE2G7u3NXJIT2ANytuPF1OarO4DADm73n8=
github.com/inconshreveable/mousetrap v1.1.0 h1:wN+x4NVGpMsO7ErUn/mUI3vEoE6Jt13X2s0bqwp9tc8=
github.com/inconshreveable/mousetrap v1.1.0/go.mod h1:vpF70FUmC8bwa3OWnCshd2FqLfsEA9PFc4w1p2J65bw=
github.com/ipfs/boxo v0.30.0 h1:7afsoxPGGqfoH7Dum/wOTGUB9M5fb8HyKPMlLfBvIEQ=
github.com/ipfs/boxo v0.30.0/go.mod h1:BPqgGGyHB9rZZcPSzah2Dc9C+5Or3U1aQe7EH1H7370=
github.com/ipfs/go-block-format v0.2.0 h1:ZqrkxBA2ICbDRbK8KJs/u0O3dlp6gmAuuXUJNiW1Ycs=
github.com/ipfs/go-block-format v0.2.0/go.mod h1:+jpL11nFx5A/SPpsoBn6Bzkra/zaArfSmsknbPMYgzM=
github.com/invopop/jsonschema v0.13.0 h1:KvpoAJWEjR3uD9Kbm2HWJmqsEaHt8lBUpd0qHcIi21E=
github.com/invopop/jsonschema v0.13.0/go.mod h1:ffZ5Km5SWWRAIN6wbDXItl95euhFz2uON45H2qjYt+0=
github.com/ipfs/boxo v0.37.0 h1:2E3mZvydMI2t5IkAgtkmZ3sGsld0oS7o3I+xyzDk6uI=
github.com/ipfs/boxo v0.37.0/go.mod h1:8yyiRn54F2CsW13n0zwXEPrVsZix/gFj9SYIRYMZ6KE=
github.com/ipfs/go-block-format v0.2.3 h1:mpCuDaNXJ4wrBJLrtEaGFGXkferrw5eqVvzaHhtFKQk=
github.com/ipfs/go-block-format v0.2.3/go.mod h1:WJaQmPAKhD3LspLixqlqNFxiZ3BZ3xgqxxoSR/76pnA=
github.com/ipfs/go-cid v0.6.1 h1:T5TnNb08+ueovG76Z5gx1L4Y7QOaGTXHg1F6raWFxIc=
github.com/ipfs/go-cid v0.6.1/go.mod h1:zrY0SwOhjrrIdfPQ/kf+k1sXyJ0QE7cMxfCployLBs0=
github.com/ipfs/go-datastore v0.8.2 h1:Jy3wjqQR6sg/LhyY0NIePZC3Vux19nLtg7dx0TVqr6U=
github.com/ipfs/go-datastore v0.8.2/go.mod h1:W+pI1NsUsz3tcsAACMtfC+IZdnQTnC/7VfPoJBQuts0=
github.com/ipfs/go-datastore v0.9.1 h1:67Po2epre/o0UxrmkzdS9ZTe2GFGODgTd2odx8Wh6Yo=
github.com/ipfs/go-datastore v0.9.1/go.mod h1:zi07Nvrpq1bQwSkEnx3bfjz+SQZbdbWyCNvyxMh9pN0=
github.com/ipfs/go-detect-race v0.0.1 h1:qX/xay2W3E4Q1U7d9lNs1sU9nvguX0a7319XbyQ6cOk=
github.com/ipfs/go-detect-race v0.0.1/go.mod h1:8BNT7shDZPo99Q74BpGMK+4D8Mn4j46UU0LZ723meps=
github.com/ipfs/go-ipfs-util v0.0.3 h1:2RFdGez6bu2ZlZdI+rWfIdbQb1KudQp3VGwPtdNCmE0=
github.com/ipfs/go-ipfs-util v0.0.3/go.mod h1:LHzG1a0Ig4G+iZ26UUOMjHd+lfM84LZCrn17xAKWBvs=
github.com/ipfs/go-log v1.0.5 h1:2dOuUCB1Z7uoczMWgAyDck5JLb72zHzrMnGnCNNbvY8=
github.com/ipfs/go-log v1.0.5/go.mod h1:j0b8ZoR+7+R99LD9jZ6+AJsrzkPbSXbZfGakb5JPtIo=
github.com/ipfs/go-log/v2 v2.1.3/go.mod h1:/8d0SH3Su5Ooc31QlL1WysJhvyOTDCjcCZ9Axpmri6g=
github.com/ipfs/go-log/v2 v2.6.0 h1:2Nu1KKQQ2ayonKp4MPo6pXCjqw1ULc9iohRqWV5EYqg=
github.com/ipfs/go-log/v2 v2.6.0/go.mod h1:p+Efr3qaY5YXpx9TX7MoLCSEZX5boSWj9wh86P5HJa8=
github.com/ipfs/go-test v0.2.1 h1:/D/a8xZ2JzkYqcVcV/7HYlCnc7bv/pKHQiX5TdClkPE=
github.com/ipfs/go-test v0.2.1/go.mod h1:dzu+KB9cmWjuJnXFDYJwC25T3j1GcN57byN+ixmK39M=
github.com/ipfs/go-log/v2 v2.9.1 h1:3JXwHWU31dsCpvQ+7asz6/QsFJHqFr4gLgQ0FWteujk=
github.com/ipfs/go-log/v2 v2.9.1/go.mod h1:evFx7sBiohUN3AG12mXlZBw5hacBQld3ZPHrowlJYoo=
github.com/ipfs/go-test v0.2.3 h1:Z/jXNAReQFtCYyn7bsv/ZqUwS6E7iIcSpJ2CuzCvnrc=
github.com/ipfs/go-test v0.2.3/go.mod h1:QW8vSKkwYvWFwIZQLGQXdkt9Ud76eQXRQ9Ao2H+cA1o=
github.com/ipld/go-ipld-prime v0.23.0 h1:csqdPZH60BsTC+AZrv7fpa27v+09I/oTqyHYYYE27eE=
github.com/ipld/go-ipld-prime v0.23.0/go.mod h1:46YCFSFNFBJHPjB0pfMuv7Ly7df2eChpkpyPo5SE0bA=
github.com/jackc/pgpassfile v1.0.0 h1:/6Hmqy13Ss2zCq62VdNG8tM1wchn8zjSGOBJ6icpsIM=
@@ -695,18 +684,18 @@ github.com/libp2p/go-buffer-pool v0.1.0 h1:oK4mSFcQz7cTQIfqbe4MIj9gLW+mnanjyFtc6
github.com/libp2p/go-buffer-pool v0.1.0/go.mod h1:N+vh8gMqimBzdKkSMVuydVDq+UV5QTWy5HSiZacSbPg=
github.com/libp2p/go-cidranger v1.1.0 h1:ewPN8EZ0dd1LSnrtuwd4709PXVcITVeuwbag38yPW7c=
github.com/libp2p/go-cidranger v1.1.0/go.mod h1:KWZTfSr+r9qEo9OkI9/SIEeAtw+NNoU0dXIXt15Okic=
github.com/libp2p/go-flow-metrics v0.2.0 h1:EIZzjmeOE6c8Dav0sNv35vhZxATIXWZg6j/C08XmmDw=
github.com/libp2p/go-flow-metrics v0.2.0/go.mod h1:st3qqfu8+pMfh+9Mzqb2GTiwrAGjIPszEjZmtksN8Jc=
github.com/libp2p/go-flow-metrics v0.3.0 h1:q31zcHUvHnwDO0SHaukewPYgwOBSxtt830uJtUx6784=
github.com/libp2p/go-flow-metrics v0.3.0/go.mod h1:nuhlreIwEguM1IvHAew3ij7A8BMlyHQJ279ao24eZZo=
github.com/libp2p/go-libp2p v0.48.0 h1:h2BrLAgrj7X8bEN05K7qmrjpNHYA+6tnsGRdprjTnvo=
github.com/libp2p/go-libp2p v0.48.0/go.mod h1:Q1fBZNdmC2Hf82husCTfkKJVfHm2we5zk+NWmOGEmWk=
github.com/libp2p/go-libp2p-asn-util v0.4.1 h1:xqL7++IKD9TBFMgnLPZR6/6iYhawHKHl950SO9L6n94=
github.com/libp2p/go-libp2p-asn-util v0.4.1/go.mod h1:d/NI6XZ9qxw67b4e+NgpQexCIiFYJjErASrYW4PFDN8=
github.com/libp2p/go-libp2p-kad-dht v0.33.1 h1:hKFhHMf7WH69LDjaxsJUWOU6qZm71uO47M/a5ijkiP0=
github.com/libp2p/go-libp2p-kad-dht v0.33.1/go.mod h1:CdmNk4VeGJa9EXM9SLNyNVySEvduKvb+5rSC/H4pLAo=
github.com/libp2p/go-libp2p-kbucket v0.7.0 h1:vYDvRjkyJPeWunQXqcW2Z6E93Ywx7fX0jgzb/dGOKCs=
github.com/libp2p/go-libp2p-kbucket v0.7.0/go.mod h1:blOINGIj1yiPYlVEX0Rj9QwEkmVnz3EP8LK1dRKBC6g=
github.com/libp2p/go-libp2p-pubsub v0.14.2 h1:nT5lFHPQOFJcp9CW8hpKtvbpQNdl2udJuzLQWbgRum8=
github.com/libp2p/go-libp2p-pubsub v0.14.2/go.mod h1:MKPU5vMI8RRFyTP0HfdsF9cLmL1nHAeJm44AxJGJx44=
github.com/libp2p/go-libp2p-kad-dht v0.39.0 h1:mww38eBYiUvdsu+Xl/GLlBC0Aa8M+5HAwvafkFOygAM=
github.com/libp2p/go-libp2p-kad-dht v0.39.0/go.mod h1:Po2JugFEkDq9Vig/JXtc153ntOi0q58o4j7IuITCOVs=
github.com/libp2p/go-libp2p-kbucket v0.8.0 h1:QAK7RzKJpYe+EuSEATAaaHYMYLkPDGC18m9jxPLnU8s=
github.com/libp2p/go-libp2p-kbucket v0.8.0/go.mod h1:JMlxqcEyKwO6ox716eyC0hmiduSWZZl6JY93mGaaqc4=
github.com/libp2p/go-libp2p-pubsub v0.15.0 h1:cG7Cng2BT82WttmPFMi50gDNV+58K626m/wR00vGL1o=
github.com/libp2p/go-libp2p-pubsub v0.15.0/go.mod h1:lr4oE8bFgQaifRcoc2uWhWWiK6tPdOEKpUuR408GFN4=
github.com/libp2p/go-libp2p-record v0.3.1 h1:cly48Xi5GjNw5Wq+7gmjfBiG9HCzQVkiZOUZ8kUl+Fg=
github.com/libp2p/go-libp2p-record v0.3.1/go.mod h1:T8itUkLcWQLCYMqtX7Th6r7SexyUJpIyPgks757td/E=
github.com/libp2p/go-libp2p-routing-helpers v0.7.5 h1:HdwZj9NKovMx0vqq6YNPTh6aaNzey5zHD7HeLJtq6fI=
@@ -719,8 +708,8 @@ github.com/libp2p/go-netroute v0.4.0 h1:sZZx9hyANYUx9PZyqcgE/E1GUG3iEtTZHUEvdtXT
github.com/libp2p/go-netroute v0.4.0/go.mod h1:Nkd5ShYgSMS5MUKy/MU2T57xFoOKvvLR92Lic48LEyA=
github.com/libp2p/go-reuseport v0.4.0 h1:nR5KU7hD0WxXCJbmw7r2rhRYruNRl2koHw8fQscQm2s=
github.com/libp2p/go-reuseport v0.4.0/go.mod h1:ZtI03j/wO5hZVDFo2jKywN6bYKWLOy8Se6DrI2E1cLU=
github.com/libp2p/go-yamux/v5 v5.0.1 h1:f0WoX/bEF2E8SbE4c/k1Mo+/9z0O4oC/hWEA+nfYRSg=
github.com/libp2p/go-yamux/v5 v5.0.1/go.mod h1:en+3cdX51U0ZslwRdRLrvQsdayFt3TSUKvBGErzpWbU=
github.com/libp2p/go-yamux/v5 v5.1.0 h1:8Qlxj4E9JGJAQVW6+uj2o7mqkqsIVlSUGmTWhlXzoHE=
github.com/libp2p/go-yamux/v5 v5.1.0/go.mod h1:tgIQ07ObtRR/I0IWsFOyQIL9/dR5UXgc2s8xKmNZv1o=
github.com/libp2p/zeroconf/v2 v2.2.0 h1:Cup06Jv6u81HLhIj1KasuNM/RHHrJ8T7wOTS4+Tv53Q=
github.com/libp2p/zeroconf/v2 v2.2.0/go.mod h1:fuJqLnUwZTshS3U/bMRJ3+ow/v9oid1n0DmyYyNO1Xs=
github.com/lithammer/fuzzysearch v1.1.8 h1:/HIuJnjHuXS8bKaiTMeeDlW2/AyIWk2brx1V8LFgLN4=
@@ -765,8 +754,8 @@ github.com/microcosm-cc/bluemonday v1.0.27 h1:MpEUotklkwCSLeH+Qdx1VJgNqLlpY2KXwX
github.com/microcosm-cc/bluemonday v1.0.27/go.mod h1:jFi9vgW+H7c3V0lb6nR74Ib/DIB5OBs92Dimizgw2cA=
github.com/miekg/dns v1.0.14/go.mod h1:W1PPwlIAgtquWBMBEV9nkV9Cazfe8ScdGz/Lj7v3Nrg=
github.com/miekg/dns v1.1.43/go.mod h1:+evo5L0630/F6ca/Z9+GAqzhjGyn8/c+TBaOyfEl0V4=
github.com/miekg/dns v1.1.66 h1:FeZXOS3VCVsKnEAd+wBkjMC3D2K+ww66Cq3VnCINuJE=
github.com/miekg/dns v1.1.66/go.mod h1:jGFzBsSNbJw6z1HYut1RKBKHA9PBdxeHrZG8J+gC2WE=
github.com/miekg/dns v1.1.72 h1:vhmr+TF2A3tuoGNkLDFK9zi36F2LS+hKTRW0Uf8kbzI=
github.com/miekg/dns v1.1.72/go.mod h1:+EuEPhdHOsfk6Wk5TT2CzssZdqkmFhf8r+aVyDEToIs=
github.com/mikioh/tcp v0.0.0-20190314235350-803a9b46060c h1:bzE/A84HN25pxAuk9Eej1Kz9OUelF97nAc82bDquQI8=
github.com/mikioh/tcp v0.0.0-20190314235350-803a9b46060c/go.mod h1:0SQS9kMwD2VsyFEB++InYyBJroV/FRmBgcydeSUcJms=
github.com/mikioh/tcpinfo v0.0.0-20190314235526-30a79bb1804b h1:z78hV3sbSMAUoyUMM0I83AUIT6Hu17AWfgjzIbtrYFc=
@@ -828,14 +817,12 @@ github.com/mr-tron/base58 v1.3.0 h1:K6Y13R2h+dku0wOqKtecgRnBUBPrZzLZy5aIj8lCcJI=
github.com/mr-tron/base58 v1.3.0/go.mod h1:2BuubE67DCSWwVfx37JWNG8emOC0sHEU4/HpcYgCLX8=
github.com/mschoch/smat v0.2.0 h1:8imxQsjDm8yFEAVBe7azKmKSgzSkZXDuKkSq9374khM=
github.com/mschoch/smat v0.2.0/go.mod h1:kc9mz7DoBKqDyiRL7VZN8KvXQMWeTaVnttLRXOlotKw=
github.com/mudler/LocalAGI v0.0.0-20260507074708-c1a12317930d h1:PYrydMGkFcEzNpazJ4ptaZdxG29CIQbUE1j0YRDFswA=
github.com/mudler/LocalAGI v0.0.0-20260507074708-c1a12317930d/go.mod h1:x77p9W1zKZr+W+UcEwg8/qdp00p4XXOI69wE7WlXZc0=
github.com/mudler/LocalAGI v0.0.0-20260508125235-37810d918a87 h1:az+2umaD/sT1rRvI3WZHWXjzdJVJHxcyxp0SNYbqlFk=
github.com/mudler/LocalAGI v0.0.0-20260508125235-37810d918a87/go.mod h1:x77p9W1zKZr+W+UcEwg8/qdp00p4XXOI69wE7WlXZc0=
github.com/mudler/cogito v0.9.5-0.20260315222927-63abdec7189b h1:A74T2Lauvg61KodYqsjTYDY05kPLcW+efVZjd23dghU=
github.com/mudler/cogito v0.9.5-0.20260315222927-63abdec7189b/go.mod h1:6sfja3lcu2nWRzEc0wwqGNu/eCG3EWgij+8s7xyUeQ4=
github.com/mudler/edgevpn v0.31.1 h1:7qegiDWd0kAg6ljhNHxqvp8hbo/6BbzSdbb7/2WZfiY=
github.com/mudler/edgevpn v0.31.1/go.mod h1:ftV5B0nKFzm4R8vR80UYnCb2nf7lxCRgAALxUEEgCf8=
github.com/mudler/edgevpn v0.32.2 h1:umTPyyZgkom/A81Bk4HbP0p1ZSEU5EFPW3Bg+YPxI8A=
github.com/mudler/edgevpn v0.32.2/go.mod h1:UaMc8MORbcRsAjuO5gVJj9Bn3Nq2AP5U9NTb6epVyv8=
github.com/mudler/go-piper v0.0.0-20241023091659-2494246fd9fc h1:RxwneJl1VgvikiX28EkpdAyL4yQVnJMrbquKospjHyA=
github.com/mudler/go-piper v0.0.0-20241023091659-2494246fd9fc/go.mod h1:O7SwdSWMilAWhBZMK9N9Y/oBDyMMzshE3ju8Xkexwig=
github.com/mudler/go-processmanager v0.1.1 h1:c/1NRZOZpW8HuFv9RhBG57nQu1oDMRomEHedwBFMlrw=
@@ -861,8 +848,8 @@ github.com/multiformats/go-base36 v0.2.0/go.mod h1:qvnKE++v+2MWCfePClUEjE78Z7P2a
github.com/multiformats/go-multiaddr v0.1.1/go.mod h1:aMKBKNEYmzmDmxfX88/vz+J5IU55txyt0p4aiWVohjo=
github.com/multiformats/go-multiaddr v0.16.1 h1:fgJ0Pitow+wWXzN9do+1b8Pyjmo8m5WhGfzpL82MpCw=
github.com/multiformats/go-multiaddr v0.16.1/go.mod h1:JSVUmXDjsVFiW7RjIFMP7+Ev+h1DTbiJgVeTV/tcmP0=
github.com/multiformats/go-multiaddr-dns v0.4.1 h1:whi/uCLbDS3mSEUMb1MsoT4uzUeZB0N32yzufqS0i5M=
github.com/multiformats/go-multiaddr-dns v0.4.1/go.mod h1:7hfthtB4E4pQwirrz+J0CcDUfbWzTqEzVyYKKIKpgkc=
github.com/multiformats/go-multiaddr-dns v0.5.0 h1:p/FTyHKX0nl59f+S+dEUe8HRK+i5Ow/QHMw8Nh3gPCo=
github.com/multiformats/go-multiaddr-dns v0.5.0/go.mod h1:yJ349b8TPIAANUyuOzn1oz9o22tV9f+06L+cCeMxC14=
github.com/multiformats/go-multiaddr-fmt v0.1.0 h1:WLEFClPycPkp4fnIzoFoV9FVd49/eQsuaL3/CWe167E=
github.com/multiformats/go-multiaddr-fmt v0.1.0/go.mod h1:hGtDIW4PU4BqJ50gW2quDuPVjyWNZxToGUh/HwTZYJo=
github.com/multiformats/go-multibase v0.3.0 h1:8helZD2+4Db7NNWFiktk2NePbF0boolBe6bDQvM4r68=
@@ -902,8 +889,8 @@ github.com/onsi/ginkgo v1.16.5 h1:8xi0RTUf59SOSfEtZMvwTvXYMzG4gV23XVHOZiXNtnE=
github.com/onsi/ginkgo v1.16.5/go.mod h1:+E8gABHa3K6zRBolWtd+ROzc/U5bkGt0FwiG042wbpU=
github.com/onsi/ginkgo/v2 v2.28.2 h1:DTrMfpqxiNUyQ3Y0zhn1n3cOO2euFgQPYIpkWwxVFps=
github.com/onsi/ginkgo/v2 v2.28.2/go.mod h1:CLtbVInNckU3/+gC8LzkGUb9oF+e8W8TdUsxPwvdOgE=
github.com/onsi/gomega v1.39.1 h1:1IJLAad4zjPn2PsnhH70V4DKRFlrCzGBNrNaru+Vf28=
github.com/onsi/gomega v1.39.1/go.mod h1:hL6yVALoTOxeWudERyfppUcZXjMwIMLnuSfruD2lcfg=
github.com/onsi/gomega v1.40.0 h1:Vtol0e1MghCD2ZVIilPDIg44XSL9l2QAn8ZNaljWcJc=
github.com/onsi/gomega v1.40.0/go.mod h1:M/Uqpu/8qTjtzCLUA2zJHX9Iilrau25x1PdoSRbWh5A=
github.com/openai/openai-go/v3 v3.26.0 h1:bRt6H/ozMNt/dDkN4gobnLqaEGrRGBzmbVs0xxJEnQE=
github.com/openai/openai-go/v3 v3.26.0/go.mod h1:cdufnVK14cWcT9qA1rRtrXx4FTRsgbDPW7Ia7SS5cZo=
github.com/opencontainers/go-digest v1.0.0 h1:apOUWs51W5PlhuyGyz9FCeeBIOUDA/6nW8Oi/yOhh5U=
@@ -968,8 +955,8 @@ github.com/pion/turn/v4 v4.1.4 h1:EU11yMXKIsK43FhcUnjLlrhE4nboHZq+TXBIi3QpcxQ=
github.com/pion/turn/v4 v4.1.4/go.mod h1:ES1DXVFKnOhuDkqn9hn5VJlSWmZPaRJLyBXoOeO/BmQ=
github.com/pion/webrtc/v4 v4.2.11 h1:QUX1QZKlNIn4O7U5JxLPGP0sV5RTncZkzu9SPR3jVNU=
github.com/pion/webrtc/v4 v4.2.11/go.mod h1:s/rAiyy77GyRFrZMx+Ls6aua26dIBPudH8/ZHYbIRWY=
github.com/pjbgf/sha1cd v0.3.2 h1:a9wb0bp1oC2TGwStyn0Umc/IGKQnEgF0vVaZ8QF8eo4=
github.com/pjbgf/sha1cd v0.3.2/go.mod h1:zQWigSxVmsHEZow5qaLtPYxpcKMMQpa09ixqBxuCS6A=
github.com/pjbgf/sha1cd v0.6.0 h1:3WJ8Wz8gvDz29quX1OcEmkAlUg9diU4GxJHqs0/XiwU=
github.com/pjbgf/sha1cd v0.6.0/go.mod h1:lhpGlyHLpQZoxMv8HcgXvZEhcGs0PG/vsZnEJ7H0iCM=
github.com/pkg/errors v0.8.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
github.com/pkg/errors v0.9.1 h1:FEBLx1zS214owpjy7qsBeixbURkuhQAwrK5UwLGTwt4=
github.com/pkg/errors v0.9.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
@@ -1019,7 +1006,6 @@ github.com/rs/zerolog v1.31.0/go.mod h1:/7mN4D5sKwJLZQ2b/znpjC3/GQWY/xaDXUM0kKWR
github.com/russross/blackfriday v1.6.0 h1:KqfZb0pUVN2lYqZUYRddxF4OR8ZMURnJIG5Y3VRLtww=
github.com/russross/blackfriday v1.6.0/go.mod h1:ti0ldHuxg49ri4ksnFxlkCfN+hvslNlmVHqNRXXJNAY=
github.com/russross/blackfriday/v2 v2.0.1/go.mod h1:+Rmxgy9KzJVeS9/2gXHxylqXiyQDYRxCVz55jmeOWTM=
github.com/russross/blackfriday/v2 v2.1.0/go.mod h1:+Rmxgy9KzJVeS9/2gXHxylqXiyQDYRxCVz55jmeOWTM=
github.com/ruudk/golang-pdf417 v0.0.0-20181029194003-1af4ab5afa58/go.mod h1:6lfFZQK844Gfx8o5WFuvpxWRwnSoipWe/p622j1v06w=
github.com/ryanuber/columnize v0.0.0-20160712163229-9b3edd62028f/go.mod h1:sm1tb6uqfes/u+d4ooFouqFdy9/2g9QGwK3SQygK0Ts=
github.com/rymdport/portal v0.4.2 h1:7jKRSemwlTyVHHrTGgQg7gmNPJs88xkbKcIL3NlcmSU=
@@ -1078,13 +1064,8 @@ github.com/spf13/cast v1.3.1/go.mod h1:Qx5cxh0v+4UWYiBimWS+eyWzqEqokIECu5etghLkU
github.com/spf13/cast v1.7.0 h1:ntdiHjuueXFgm5nzDRdOS4yfT43P5Fnud6DH50rz/7w=
github.com/spf13/cast v1.7.0/go.mod h1:ancEpBxwJDODSW/UG4rDrAqiKolqNNh2DX3mk86cAdo=
github.com/spf13/cobra v1.2.1/go.mod h1:ExllRjgxM/piMAM+3tAZvg8fsklGAf3tPfi+i8t68Nk=
github.com/spf13/cobra v1.10.2 h1:DMTTonx5m65Ic0GOoRY2c16WCbHxOOw6xxezuLaBpcU=
github.com/spf13/cobra v1.10.2/go.mod h1:7C1pvHqHw5A4vrJfjNwvOdzYu0Gml16OCs2GRiTUUS4=
github.com/spf13/jwalterweatherman v1.1.0/go.mod h1:aNWZUN0dPAAO/Ljvb5BEdw96iTZ0EXowPYD95IqWIGo=
github.com/spf13/pflag v1.0.5/go.mod h1:McXfInJRrz4CZXVZOBLb0bTZqETkiAhM9Iw0y3An2Bg=
github.com/spf13/pflag v1.0.9/go.mod h1:McXfInJRrz4CZXVZOBLb0bTZqETkiAhM9Iw0y3An2Bg=
github.com/spf13/pflag v1.0.10 h1:4EBh2KAYBwaONj6b2Ye1GiHfwjqyROoF4RwYO+vPwFk=
github.com/spf13/pflag v1.0.10/go.mod h1:McXfInJRrz4CZXVZOBLb0bTZqETkiAhM9Iw0y3An2Bg=
github.com/spf13/viper v1.8.1/go.mod h1:o0Pch8wJ9BVSWGQMbra6iw0oQ5oktSIBaujf1rJH9Ns=
github.com/srwiley/oksvg v0.0.0-20221011165216-be6e8873101c h1:km8GpoQut05eY3GiYWEedbTT0qnSxrCjsVbb7yKY1KE=
github.com/srwiley/oksvg v0.0.0-20221011165216-be6e8873101c/go.mod h1:cNQ3dwVJtS5Hmnjxy6AgTPd0Inb3pW05ftPSX7NZO7Q=
@@ -1092,6 +1073,8 @@ github.com/srwiley/rasterx v0.0.0-20220730225603-2ab79fcdd4ef h1:Ch6Q+AZUxDBCVqd
github.com/srwiley/rasterx v0.0.0-20220730225603-2ab79fcdd4ef/go.mod h1:nXTWP6+gD5+LUJ8krVhhoeHjvHTutPxMYl5SvkcnJNE=
github.com/ssor/bom v0.0.0-20170718123548-6386211fdfcf h1:pvbZ0lM0XWPBqUKqFU8cmavspvIl9nulOYwdy6IFRRo=
github.com/ssor/bom v0.0.0-20170718123548-6386211fdfcf/go.mod h1:RJID2RhlZKId02nZ62WenDCkgHFerpIOmW0iT7GKmXM=
github.com/standard-webhooks/standard-webhooks/libraries v0.0.0-20260508151727-1282bb917829 h1:zGlGD0Zfk2HaIo4EnUVBRhnXQ+cnGQz5X2PdBcplOyw=
github.com/standard-webhooks/standard-webhooks/libraries v0.0.0-20260508151727-1282bb917829/go.mod h1:L1MQhA6x4dn9r007T033lsaZMv9EmBAdXyU/+EF40fo=
github.com/streamer45/silero-vad-go v0.2.1 h1:Li1/tTC4H/3cyw6q4weX+U8GWwEL3lTekK/nYa1Cvuk=
github.com/streamer45/silero-vad-go v0.2.1/go.mod h1:B+2FXs/5fZ6pzl6unUZYhZqkYdOB+3saBVzjOzdZnUs=
github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
@@ -1167,9 +1150,8 @@ github.com/valyala/fasttemplate v1.2.2 h1:lxLXG0uE3Qnshl9QyaK6XJxMXlQZELvChBOCmQ
github.com/valyala/fasttemplate v1.2.2/go.mod h1:KHLXt3tVN2HBp8eijSv/kGJopbvo7S+qRAEEKiv+SiQ=
github.com/vbatts/tar-split v0.12.2 h1:w/Y6tjxpeiFMR47yzZPlPj/FcPLpXbTUi/9H7d3CPa4=
github.com/vbatts/tar-split v0.12.2/go.mod h1:eF6B6i6ftWQcDqEn3/iGFRFRo8cBIMSJVOpnNdfTMFA=
github.com/vishvananda/netlink v1.3.0 h1:X7l42GfcV4S6E4vHTsw48qbrV+9PVojNfIhZcwQdrZk=
github.com/vishvananda/netlink v1.3.0/go.mod h1:i6NetklAujEcC6fK0JPjT8qSwWyO0HLn4UKG+hGqeJs=
github.com/vishvananda/netns v0.0.4/go.mod h1:SpkAiCQRtJ6TvvxPnOSyH3BMl6unz3xZlaprSwhNNJM=
github.com/vishvananda/netlink v1.3.1 h1:3AEMt62VKqz90r0tmNhog0r/PpWKmrEShJU0wJW6bV0=
github.com/vishvananda/netlink v1.3.1/go.mod h1:ARtKouGSTGchR8aMwmkzC0qiNPrrWO5JS/XMVl45+b4=
github.com/vishvananda/netns v0.0.5 h1:DfiHV+j8bA32MFM7bfEunvT8IAqQ/NzSJHtcmW5zdEY=
github.com/vishvananda/netns v0.0.5/go.mod h1:SpkAiCQRtJ6TvvxPnOSyH3BMl6unz3xZlaprSwhNNJM=
github.com/warpfork/go-wish v0.0.0-20220906213052-39a1cc7a02d0 h1:GDDkbFiaK8jsSDJfjId/PEGEShv6ugrt4kYsC5UIDaQ=
@@ -1220,8 +1202,8 @@ go.opencensus.io v0.24.0 h1:y73uSU6J157QMP2kn2r30vwW1A2W2WFwSCGnAVxeaD0=
go.opencensus.io v0.24.0/go.mod h1:vNK8G9p7aAivkbmorf4v+7Hgx+Zs0yY+0fOtgBfjQKo=
go.opentelemetry.io/auto/sdk v1.2.1 h1:jXsnJ4Lmnqd11kwkBV2LgLoFMZKizbCi5fNZ/ipaZ64=
go.opentelemetry.io/auto/sdk v1.2.1/go.mod h1:KRTj+aOaElaLi+wW1kO/DZRXwkF4C5xPbEe3ZiIhN7Y=
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.61.0 h1:F7Jx+6hwnZ41NSFTO5q4LYDtJRXBf2PD0rNBkeB/lus=
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.61.0/go.mod h1:UHB22Z8QsdRDrnAtX4PntOl36ajSxcdUMt1sF7Y6E7Q=
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.65.0 h1:7iP2uCb7sGddAr30RRS6xjKy7AZ2JtTOPA3oolgVSw8=
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.65.0/go.mod h1:c7hN3ddxs/z6q9xwvfLPk+UHlWRQyaeR1LdgfL/66l0=
go.opentelemetry.io/otel v1.43.0 h1:mYIM03dnh5zfN7HautFE4ieIig9amkNANT+xcVxAj9I=
go.opentelemetry.io/otel v1.43.0/go.mod h1:JuG+u74mvjvcm8vj8pI5XiHy1zDeoCS2LB1spIq7Ay0=
go.opentelemetry.io/otel/exporters/prometheus v0.65.0 h1:jOveH/b4lU9HT7y+Gfamf18BqlOuz2PWEvs8yM7Q6XE=
@@ -1253,8 +1235,8 @@ go.uber.org/multierr v1.11.0/go.mod h1:20+QtiLqy0Nd6FdQB9TLXag12DsQkrbs3htMFfDN8
go.uber.org/tools v0.0.0-20190618225709-2cfd321de3ee/go.mod h1:vJERXedbb3MVM5f9Ejo0C68/HhF8uaILCdgjnY+goOA=
go.uber.org/zap v1.16.0/go.mod h1:MA8QOfq0BHJwdXa996Y4dYkAqRKB8/1K1QMMZVaNZjQ=
go.uber.org/zap v1.17.0/go.mod h1:MXVU+bhUf/A7Xi2HNOnopQOrmycQ5Ih87HtOu4q5SSo=
go.uber.org/zap v1.27.0 h1:aJMhYGrd5QSmlpLMr2MftRKl7t8J8PTZPA732ud/XR8=
go.uber.org/zap v1.27.0/go.mod h1:GB2qFLM7cTU87MWRP2mPIjqfIDnGu+VIO4V/SdhGo2E=
go.uber.org/zap v1.27.1 h1:08RqriUEv8+ArZRYSTXy1LeBScaMpVSTBhCeaZYfMYc=
go.uber.org/zap v1.27.1/go.mod h1:GB2qFLM7cTU87MWRP2mPIjqfIDnGu+VIO4V/SdhGo2E=
go.yaml.in/yaml/v2 v2.4.4 h1:tuyd0P+2Ont/d6e2rl3be67goVK4R6deVxCUX5vyPaQ=
go.yaml.in/yaml/v2 v2.4.4/go.mod h1:gMZqIpDtDqOfM0uNfy0SkpRhvUryYH0Z6wdMYcacYXQ=
go.yaml.in/yaml/v3 v3.0.4 h1:tfq32ie2Jv2UxXFdLJdh3jXuOzWiL1fo0bu/FbuKpbc=
@@ -1289,8 +1271,8 @@ golang.org/x/exp v0.0.0-20191227195350-da58074b4299/go.mod h1:2RIsYlXP63K8oxa1u0
golang.org/x/exp v0.0.0-20200119233911-0405dc783f0a/go.mod h1:2RIsYlXP63K8oxa1u096TMicItID8zy7Y6sNkU49FU4=
golang.org/x/exp v0.0.0-20200207192155-f17229e696bd/go.mod h1:J/WKrq2StrnmMY6+EHIKF9dgMWnmCNThgcyBT1FY9mM=
golang.org/x/exp v0.0.0-20200224162631-6cc2880d07d6/go.mod h1:3jZMyOhIsHpP37uCMkUooju7aAi5cS1Q23tOzKc+0MU=
golang.org/x/exp v0.0.0-20250606033433-dcc06ee1d476 h1:bsqhLWFR6G6xiQcb+JoGqdKdRU6WzPWmK8E0jxTjzo4=
golang.org/x/exp v0.0.0-20250606033433-dcc06ee1d476/go.mod h1:3//PLf8L/X+8b4vuAfHzxeRUl04Adcb341+IGKfnqS8=
golang.org/x/exp v0.0.0-20260410095643-746e56fc9e2f h1:W3F4c+6OLc6H2lb//N1q4WpJkhzJCK5J6kUi1NTVXfM=
golang.org/x/exp v0.0.0-20260410095643-746e56fc9e2f/go.mod h1:J1xhfL/vlindoeF/aINzNzt2Bket5bjo9sdOYzOsU80=
golang.org/x/image v0.0.0-20190227222117-0694c2d4d067/go.mod h1:kZ7UVZpmo3dzQBMxlp+ypCbDeSB+sBbTgSJuh5dn5js=
golang.org/x/image v0.0.0-20190802002840-cff245a6509b/go.mod h1:FeLwcggjj3mMvU+oOTbSwawSJRM1uh48EjtB4UJZlP0=
golang.org/x/image v0.0.0-20190910094157-69e4b8554b2a/go.mod h1:FeLwcggjj3mMvU+oOTbSwawSJRM1uh48EjtB4UJZlP0=
@@ -1662,8 +1644,8 @@ google.golang.org/genproto v0.0.0-20210310155132-4ce2db91004e/go.mod h1:FWY/as6D
google.golang.org/genproto v0.0.0-20210319143718-93e7006c17a6/go.mod h1:FWY/as6DDZQgahTzZj3fqbO1CbirC29ZNUFHwi0/+no=
google.golang.org/genproto v0.0.0-20210402141018-6c239bbf2bb1/go.mod h1:9lPAdzaEmUacj36I+k7YKbEc5CXzPIeORRgDAUOu28A=
google.golang.org/genproto v0.0.0-20210602131652-f16073e35f0c/go.mod h1:UODoCrxHCcBojKKwX1terBiRUaqAsFqJiF615XL43r0=
google.golang.org/genproto/googleapis/rpc v0.0.0-20260120221211-b8f7ae30c516 h1:sNrWoksmOyF5bvJUcnmbeAmQi8baNhqg5IWaI3llQqU=
google.golang.org/genproto/googleapis/rpc v0.0.0-20260120221211-b8f7ae30c516/go.mod h1:j9x/tPzZkyxcgEFkiKEEGxfvyumM01BEtsW8xzOahRQ=
google.golang.org/genproto/googleapis/rpc v0.0.0-20260128011058-8636f8732409 h1:H86B94AW+VfJWDqFeEbBPhEtHzJwJfTbgE2lZa54ZAQ=
google.golang.org/genproto/googleapis/rpc v0.0.0-20260128011058-8636f8732409/go.mod h1:j9x/tPzZkyxcgEFkiKEEGxfvyumM01BEtsW8xzOahRQ=
google.golang.org/grpc v1.19.0/go.mod h1:mqu4LbDTu4XGKhr4mRzUsmM4RtVoemTSY81AxZiDr8c=
google.golang.org/grpc v1.20.1/go.mod h1:10oTOabMzJvdu6/UiuZezV6QK5dSlG84ov/aaiqXj38=
google.golang.org/grpc v1.21.1/go.mod h1:oYelfM1adQP15Ek0mdvEgi9Df8B9CZIaU1084ijfRaM=

View File

@@ -1,8 +1,10 @@
package xsysinfo
import (
"bufio"
"bytes"
"encoding/json"
"io"
"os"
"os/exec"
"strconv"
@@ -801,14 +803,15 @@ func GetResourceAggregateInfo() AggregateMemoryInfo {
return resourceInfo.Aggregate
}
// getVulkanGPUMemory queries GPUs using vulkaninfo as a fallback
// Note: Vulkan provides memory heap info but not real-time usage
// getVulkanGPUMemory queries GPUs using vulkaninfo as a fallback.
// Note: vulkaninfo JSON is a Vulkan Profiles export and does not include
// VkPhysicalDeviceMemoryProperties, so memory heaps are parsed from text output.
func getVulkanGPUMemory() []GPUMemoryInfo {
if _, err := exec.LookPath("vulkaninfo"); err != nil {
return nil
}
cmd := exec.Command("vulkaninfo", "--json")
cmd := exec.Command("vulkaninfo", "--text")
var stdout, stderr bytes.Buffer
cmd.Stdout = &stdout
@@ -819,60 +822,174 @@ func getVulkanGPUMemory() []GPUMemoryInfo {
return nil
}
// Parse Vulkan JSON output
var result struct {
VkPhysicalDevices []struct {
DeviceName string `json:"deviceName"`
DeviceType string `json:"deviceType"`
VkPhysicalDeviceMemoryProperties struct {
MemoryHeaps []struct {
Flags int `json:"flags"`
Size uint64 `json:"size"`
} `json:"memoryHeaps"`
} `json:"VkPhysicalDeviceMemoryProperties"`
} `json:"VkPhysicalDevices"`
}
return parseVulkanGPUMemoryText(strings.NewReader(stdout.String()))
if err := json.Unmarshal(stdout.Bytes(), &result); err != nil {
xlog.Debug("failed to parse vulkaninfo output", "error", err)
return nil
}
}
type vulkanGPUTextInfo struct {
index int
name string
deviceType string
totalVRAM uint64
}
func parseVulkanGPUMemoryText(r io.Reader) []GPUMemoryInfo {
var gpus []GPUMemoryInfo
var current *vulkanGPUTextInfo
for i, device := range result.VkPhysicalDevices {
// Skip non-discrete/integrated GPUs if possible
if device.DeviceType == "VK_PHYSICAL_DEVICE_TYPE_CPU" {
continue
inMemoryProperties := false
inMemoryHeaps := false
inHeap := false
heapSize := uint64(0)
heapDeviceLocal := false
flushHeap := func() {
if current != nil && inHeap && heapDeviceLocal {
current.totalVRAM += heapSize
}
heapSize = 0
heapDeviceLocal = false
inHeap = false
}
// Sum up device-local memory heaps
var totalVRAM uint64
for _, heap := range device.VkPhysicalDeviceMemoryProperties.MemoryHeaps {
// Flag 1 = VK_MEMORY_HEAP_DEVICE_LOCAL_BIT
if heap.Flags&1 != 0 {
totalVRAM += heap.Size
}
}
if totalVRAM == 0 {
continue
flushGPU := func() {
if current == nil || current.totalVRAM == 0 || current.deviceType == "PHYSICAL_DEVICE_TYPE_CPU" {
return
}
gpus = append(gpus, GPUMemoryInfo{
Index: i,
Name: device.DeviceName,
Index: current.index,
Name: current.name,
Vendor: VendorVulkan,
TotalVRAM: totalVRAM,
UsedVRAM: 0, // Vulkan doesn't provide real-time usage
FreeVRAM: totalVRAM,
TotalVRAM: current.totalVRAM,
UsedVRAM: 0, // Vulkan heap size is capacity, not real-time usage.
FreeVRAM: current.totalVRAM,
UsagePercent: 0,
})
}
scanner := bufio.NewScanner(r)
for scanner.Scan() {
line := strings.TrimSpace(scanner.Text())
if line == "" {
continue
}
if index, ok := parseVulkanGPUHeader(line); ok {
flushHeap()
flushGPU()
current = &vulkanGPUTextInfo{index: index}
inMemoryProperties = false
inMemoryHeaps = false
continue
}
if current == nil {
continue
}
if strings.HasPrefix(line, "deviceType") {
current.deviceType = parseVulkanValue(line)
continue
}
if strings.HasPrefix(line, "deviceName") {
current.name = parseVulkanValue(line)
continue
}
if line == "VkPhysicalDeviceMemoryProperties:" {
inMemoryProperties = true
inMemoryHeaps = false
flushHeap()
continue
}
if !inMemoryProperties {
continue
}
if strings.HasPrefix(line, "memoryHeaps:") {
inMemoryHeaps = true
continue
}
if strings.HasPrefix(line, "memoryTypes:") {
flushHeap()
inMemoryProperties = false
inMemoryHeaps = false
continue
}
if !inMemoryHeaps {
continue
}
if strings.HasPrefix(line, "memoryHeaps[") {
flushHeap()
inHeap = true
continue
}
if !inHeap {
continue
}
if strings.HasPrefix(line, "size") {
if size, ok := parseVulkanUintValue(line); ok {
heapSize = size
}
continue
}
if strings.Contains(line, "MEMORY_HEAP_DEVICE_LOCAL_BIT") {
heapDeviceLocal = true
}
}
flushHeap()
flushGPU()
return gpus
}
func parseVulkanGPUHeader(line string) (int, bool) {
if !strings.HasPrefix(line, "GPU") || !strings.HasSuffix(line, ":") {
return 0, false
}
index, err := strconv.Atoi(strings.TrimSuffix(strings.TrimPrefix(line, "GPU"), ":"))
if err != nil {
return 0, false
}
return index, true
}
func parseVulkanValue(line string) string {
_, value, ok := strings.Cut(line, "=")
if !ok {
return ""
}
return strings.TrimSpace(value)
}
func parseVulkanUintValue(line string) (uint64, bool) {
value := parseVulkanValue(line)
fields := strings.Fields(value)
if len(fields) == 0 {
return 0, false
}
parsed, err := strconv.ParseUint(fields[0], 0, 64)
if err != nil {
return 0, false
}
return parsed, true
}
// getAppleGPUMemory detects Apple Silicon GPUs using system_profiler (macOS only).
// Apple Silicon uses unified memory, so GPU memory is reported as system RAM.
func getAppleGPUMemory() []GPUMemoryInfo {

51
scripts/build/ds4-darwin.sh Executable file
View File

@@ -0,0 +1,51 @@
#!/bin/bash
# Darwin/Metal build for the ds4 backend. Mirrors llama-cpp-darwin.sh:
# native make, otool -L for dylib bundling, then assemble an OCI tar that
# `local-ai backends install` can consume.
set -ex
IMAGE_NAME="${IMAGE_NAME:-localai/ds4-darwin}"
pushd backend/cpp/ds4
make NATIVE=false grpc-server package
popd
mkdir -p build/darwin
mkdir -p build/darwin/lib
mkdir -p backend-images
cp -rf backend/cpp/ds4/grpc-server build/darwin/
cp -rf backend/cpp/ds4/run.sh build/darwin/
# Apple Silicon: pick up Homebrew-installed protobuf utf8_validity if present.
if [[ "$(uname -s)" == "Darwin" && "$(uname -m)" == "arm64" ]]; then
ADDITIONAL_LIBS=${ADDITIONAL_LIBS:-$(ls /opt/homebrew/Cellar/protobuf/**/lib/libutf8_validity*.dylib 2>/dev/null)}
else
ADDITIONAL_LIBS=${ADDITIONAL_LIBS:-""}
fi
for file in $ADDITIONAL_LIBS; do
cp -rfv "$file" build/darwin/lib
done
# Walk dylibs via otool -L and bundle anything that isn't a system framework.
for file in build/darwin/grpc-server; do
LIBS="$(otool -L "$file" | awk 'NR > 1 { system("echo " $1) } ' | xargs echo)"
for lib in $LIBS; do
if [[ "$lib" == *.dylib ]] && [[ -e "$lib" ]]; then
cp -rvf "$lib" build/darwin/lib
fi
done
done
echo "Bundled libraries:"
ls -la build/darwin/lib
# Build an OCI tar that local-ai backends install can consume.
# scripts/build/oci-pack.sh is the existing helper used by llama-cpp-darwin
# - if your tree doesn't have it, write one (5 lines: tar + manifest.json).
if [ -f scripts/build/oci-pack.sh ]; then
bash scripts/build/oci-pack.sh build/darwin backend-images/ds4.tar "$IMAGE_NAME"
else
# Fallback: simple tar - local-ai accepts a flat tar in dev environments.
tar -C build/darwin -cvf backend-images/ds4.tar .
fi

View File

@@ -32,6 +32,9 @@ function inferBackendPath(item) {
// via a thin wrapper Makefile. Changes to either dir should retrigger it.
return `backend/cpp/turboquant/`;
}
if (item.dockerfile.endsWith("ds4")) {
return `backend/cpp/ds4/`;
}
if (item.dockerfile.endsWith("llama-cpp")) {
return `backend/cpp/llama-cpp/`;
}
@@ -128,11 +131,15 @@ async function getChangedFilesForPush(event) {
return res.data.files.map(f => f.filename);
}
// Group filtered linux matrix entries by tag-suffix and emit a merge-matrix
// entry for any tag-suffix that appears 2+ times. That's the trigger for
// "this backend has multiple per-arch legs and we need a manifest list".
// Singletons aren't merged — single-arch backends push by digest and don't
// need a manifest list assembled across legs.
// Group matrix entries by tag-suffix and emit a merge-matrix entry per group.
// Both multi-leg groups (per-arch fan-out) and singletons get one entry each:
// the build job pushes by digest only with no tags applied, so every backend
// needs a downstream merge step to apply its tags via `imagetools create`,
// regardless of how many per-arch legs feed it. Callers split entries by
// arch class first (see splitByArch) and call this once per class so the
// resulting matrices can be wired to merge jobs that `needs:` only their
// corresponding build matrix — preventing slow single-arch builds from
// gating multi-arch merges (the bug fixed in PR #9746).
function computeMergeMatrix(entries) {
const groups = new Map();
for (const item of entries) {
@@ -143,7 +150,6 @@ function computeMergeMatrix(entries) {
}
const include = [];
for (const [tagSuffix, group] of groups) {
if (group.length < 2) continue;
// tag-latest must agree across legs — they're going to publish under
// the same final tag, so disagreeing on whether it's also the :latest
// tag is an authoring bug. Warn loudly so a Task 2.5 fan-out typo is
@@ -177,17 +183,21 @@ function splitByArch(entries) {
function emitFullMatrix() {
const { multiarch, singlearch } = splitByArch(includes);
const mergeMatrix = computeMergeMatrix(includes);
const hasMerges = mergeMatrix.include.length > 0 ? 'true' : 'false';
const mergeMatrixMultiarch = computeMergeMatrix(multiarch);
const mergeMatrixSinglearch = computeMergeMatrix(singlearch);
const hasMergesMultiarch = mergeMatrixMultiarch.include.length > 0 ? 'true' : 'false';
const hasMergesSinglearch = mergeMatrixSinglearch.include.length > 0 ? 'true' : 'false';
fs.appendFileSync(process.env.GITHUB_OUTPUT, `run-all=true\n`);
fs.appendFileSync(process.env.GITHUB_OUTPUT, `has-backends-singlearch=${singlearch.length > 0 ? 'true' : 'false'}\n`);
fs.appendFileSync(process.env.GITHUB_OUTPUT, `has-backends-multiarch=${multiarch.length > 0 ? 'true' : 'false'}\n`);
fs.appendFileSync(process.env.GITHUB_OUTPUT, `has-backends-darwin=true\n`);
fs.appendFileSync(process.env.GITHUB_OUTPUT, `has-merges=${hasMerges}\n`);
fs.appendFileSync(process.env.GITHUB_OUTPUT, `has-merges-multiarch=${hasMergesMultiarch}\n`);
fs.appendFileSync(process.env.GITHUB_OUTPUT, `has-merges-singlearch=${hasMergesSinglearch}\n`);
fs.appendFileSync(process.env.GITHUB_OUTPUT, `matrix-singlearch=${JSON.stringify({ include: singlearch })}\n`);
fs.appendFileSync(process.env.GITHUB_OUTPUT, `matrix-multiarch=${JSON.stringify({ include: multiarch })}\n`);
fs.appendFileSync(process.env.GITHUB_OUTPUT, `matrix-darwin=${JSON.stringify({ include: includesDarwin })}\n`);
fs.appendFileSync(process.env.GITHUB_OUTPUT, `merge-matrix=${JSON.stringify(mergeMatrix)}\n`);
fs.appendFileSync(process.env.GITHUB_OUTPUT, `merge-matrix-multiarch=${JSON.stringify(mergeMatrixMultiarch)}\n`);
fs.appendFileSync(process.env.GITHUB_OUTPUT, `merge-matrix-singlearch=${JSON.stringify(mergeMatrixSinglearch)}\n`);
for (const backend of allBackendPaths.keys()) {
fs.appendFileSync(process.env.GITHUB_OUTPUT, `${backend}=true\n`);
}
@@ -218,18 +228,22 @@ function emitFilteredMatrix(changedFiles) {
console.log("Has multi-arch backends?:", hasBackendsMultiarch);
console.log("Has Darwin backends?:", hasBackendsDarwin);
const mergeMatrix = computeMergeMatrix(filtered);
const hasMerges = mergeMatrix.include.length > 0 ? 'true' : 'false';
const mergeMatrixMultiarch = computeMergeMatrix(multiarch);
const mergeMatrixSinglearch = computeMergeMatrix(singlearch);
const hasMergesMultiarch = mergeMatrixMultiarch.include.length > 0 ? 'true' : 'false';
const hasMergesSinglearch = mergeMatrixSinglearch.include.length > 0 ? 'true' : 'false';
fs.appendFileSync(process.env.GITHUB_OUTPUT, `run-all=false\n`);
fs.appendFileSync(process.env.GITHUB_OUTPUT, `has-backends-singlearch=${hasBackendsSinglearch}\n`);
fs.appendFileSync(process.env.GITHUB_OUTPUT, `has-backends-multiarch=${hasBackendsMultiarch}\n`);
fs.appendFileSync(process.env.GITHUB_OUTPUT, `has-backends-darwin=${hasBackendsDarwin}\n`);
fs.appendFileSync(process.env.GITHUB_OUTPUT, `has-merges=${hasMerges}\n`);
fs.appendFileSync(process.env.GITHUB_OUTPUT, `has-merges-multiarch=${hasMergesMultiarch}\n`);
fs.appendFileSync(process.env.GITHUB_OUTPUT, `has-merges-singlearch=${hasMergesSinglearch}\n`);
fs.appendFileSync(process.env.GITHUB_OUTPUT, `matrix-singlearch=${JSON.stringify({ include: singlearch })}\n`);
fs.appendFileSync(process.env.GITHUB_OUTPUT, `matrix-multiarch=${JSON.stringify({ include: multiarch })}\n`);
fs.appendFileSync(process.env.GITHUB_OUTPUT, `matrix-darwin=${JSON.stringify({ include: filteredDarwin })}\n`);
fs.appendFileSync(process.env.GITHUB_OUTPUT, `merge-matrix=${JSON.stringify(mergeMatrix)}\n`);
fs.appendFileSync(process.env.GITHUB_OUTPUT, `merge-matrix-multiarch=${JSON.stringify(mergeMatrixMultiarch)}\n`);
fs.appendFileSync(process.env.GITHUB_OUTPUT, `merge-matrix-singlearch=${JSON.stringify(mergeMatrixSinglearch)}\n`);
// Per-backend boolean outputs
for (const [backend, pathPrefix] of allBackendPaths) {

View File

@@ -194,7 +194,18 @@ var _ = Describe("Backend container", Ordered, func() {
BeforeAll(func() {
image := os.Getenv("BACKEND_IMAGE")
Expect(image).NotTo(BeEmpty(), "BACKEND_IMAGE env var must be set (e.g. local-ai-backend:llama-cpp)")
// BACKEND_BINARY is an escape hatch for hardware-gated backends (e.g. ds4)
// where building a full Docker image around an 80+ GB model is impractical.
// Points at a `run.sh` produced by `make -C backend/cpp/<name> package`.
binary := os.Getenv("BACKEND_BINARY")
Expect(image != "" || binary != "").To(BeTrue(),
"either BACKEND_IMAGE or BACKEND_BINARY env var must be set")
Expect(image != "" && binary != "").To(BeFalse(),
"BACKEND_IMAGE and BACKEND_BINARY are mutually exclusive")
if binary != "" {
Expect(filepath.Base(binary)).To(Equal("run.sh"),
"BACKEND_BINARY must point at a run.sh produced by 'make -C backend/cpp/<name> package'")
}
modelURL := os.Getenv("BACKEND_TEST_MODEL_URL")
modelFile = os.Getenv("BACKEND_TEST_MODEL_FILE")
@@ -203,7 +214,11 @@ var _ = Describe("Backend container", Ordered, func() {
"one of BACKEND_TEST_MODEL_URL, BACKEND_TEST_MODEL_FILE, or BACKEND_TEST_MODEL_NAME must be set")
caps = parseCaps()
GinkgoWriter.Printf("Testing image=%q with capabilities=%v\n", image, keys(caps))
src := image
if src == "" {
src = binary
}
GinkgoWriter.Printf("Testing src=%q with capabilities=%v\n", src, keys(caps))
prompt = os.Getenv("BACKEND_TEST_PROMPT")
if prompt == "" {
@@ -223,10 +238,13 @@ var _ = Describe("Backend container", Ordered, func() {
workDir, err = os.MkdirTemp("", "backend-e2e-*")
Expect(err).NotTo(HaveOccurred())
// Extract the image filesystem so we can run run.sh directly.
binaryDir = filepath.Join(workDir, "rootfs")
Expect(os.MkdirAll(binaryDir, 0o755)).To(Succeed())
extractImage(image, binaryDir)
if image != "" {
binaryDir = filepath.Join(workDir, "rootfs")
Expect(os.MkdirAll(binaryDir, 0o755)).To(Succeed())
extractImage(image, binaryDir)
} else {
binaryDir = filepath.Dir(binary)
}
Expect(filepath.Join(binaryDir, "run.sh")).To(BeAnExistingFile())
// Download the model once if not provided and no HF name given.