Compare commits

...

273 Commits

Author SHA1 Message Date
dependabot[bot]
2342c9348e chore(deps): bump torch
Bumps the pip group with 1 update in the /backend/python/transformers directory: torch.


Updates `torch` from 2.7.1 to 2.7.1+xpu

---
updated-dependencies:
- dependency-name: torch
  dependency-version: 2.7.1+xpu
  dependency-type: direct:production
  dependency-group: pip
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-06-05 23:31:07 +00:00
Copilot
352b7ec604 Harden gallery-agent Hugging Face fetches against transient rate limiting (#10187)
* Initial plan

* fix: retry HuggingFace trending fetch on transient rate limits

* fix: handle body close/write errors in huggingface retry paths

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
2026-06-05 23:43:06 +02:00
LocalAI [bot]
ba706422fb chore: ⬆️ Update vllm-project/vllm cu130 wheel to 0.22.1 (#10188)
⬆️ Update vllm-project/vllm cu130 wheel

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-05 23:42:50 +02:00
LocalAI [bot]
e837921c2c feat: forward reasoning_effort to the backend so jinja models honor it (#10184)
* feat: forward reasoning_effort to the backend so jinja models honor it

reasoning_effort was only mapped to the binary enable_thinking toggle and
otherwise reached Go-side templates — it was never sent to the backend. So
jinja-templated models whose chat template keys on reasoning_effort (gpt-oss
Harmony, LFM2.5) could not be driven by it: LFM2.5 ignores enable_thinking and
kept emitting <think>.

Forward the effective reasoning_effort to the backend as a chat_template_kwarg
(mirroring enable_thinking) in grpc-server.cpp, and put it in PredictOptions
metadata (gRPCPredictOpts). Add a config-level default: ModelConfig.reasoning_effort
and Pipeline.reasoning_effort, resolved by ModelConfig.ApplyReasoningEffort
(request value overrides config default, none->disable / level->enable, an
operator's reasoning.disable wins). request.go now uses that helper.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): set the pipeline LLM's reasoning_effort

Apply Pipeline.ReasoningEffort to the pipeline's LLM config when the realtime
model is built (per-session copy, overrides the LLM's own reasoning_effort),
and surface the resolved effort on the template input so Go-templated models
get it too. jinja models receive it via the backend metadata. This lets a
realtime pipeline disable thinking on models that only honor reasoning_effort
(e.g. LFM2.5), which enable_thinking can't.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-05 13:45:43 +00:00
Richard Palethorpe
73385713ca feat(distributed): enforce registration token for worker file transfer (#10183)
The worker HTTP file-transfer server is authenticated by the registration
token via checkBearerToken, which fails open on an empty token: every
/v1/files, /v1/files-list and /v1/backend-logs request is then served
unauthenticated, granting read/write to the worker's models/staging/data
directories. The fail-open was also silent (the only auth log sat on the
unreachable reject branch), and the worker process never runs
DistributedConfig.Validate(), so the existing frontend warning did not
cover the component that exposes the server.

Mirror the NatsRequireAuth pattern: keep anonymous as the default but make
it loud and opt-in enforceable.

- Log a prominent warning when the file-transfer server starts tokenless.
- Add LOCALAI_REGISTRATION_REQUIRE_AUTH: DistributedConfig.Validate() errors
  on an empty token (frontend) and the worker refuses to start (fail-fast,
  before registration), so production can fail closed. Also satisfies the
  F-003 suggestion to fail Validate() on distributed + empty token.
- Add LOCALAI_DISTRIBUTED_REQUIRE_AUTH umbrella switch implying both
  RegistrationRequireAuth and NatsRequireAuth — one production knob locking
  down the registration/file-transfer layer and the NATS bus together; the
  granular flags remain available as single-layer overrides. Wired into the
  frontend, supervisor worker, and agent worker (vLLM worker has neither a
  NATS connection nor a file-transfer server, so it is left untouched).
- Document in distributed-mode.md (warning callout + flag tables).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-06-05 14:34:28 +02:00
LocalAI [bot]
a4e671779a chore: ⬆️ Update ggml-org/whisper.cpp to 99613cb720b65036237d44b52f753b51f75c2797 (#10178)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-05 09:04:25 +02:00
LocalAI [bot]
7051b2e0a1 chore: ⬆️ Update ggml-org/llama.cpp to 7c158fbb4aec1bdc9c81d6ca0e785139f4826fae (#10179)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-05 09:04:10 +02:00
LocalAI [bot]
469737101a chore: ⬆️ Update ikawrakow/ik_llama.cpp to 1520eda980564241434b791ce2bbbd128c4be9ea (#10180)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-05 09:03:08 +02:00
LocalAI [bot]
858257eaf0 fix(distributed): self-heal stale 'model not loaded' routing (#10181)
* fix(distributed): self-heal stale 'model not loaded' routing

In distributed mode the registry can list a model as loaded on a node
while the worker has evicted it (autonomous LRU eviction, an out-of-band
unload, etc.) yet the backend process survives. The router's cached-node
check only verifies the process is alive (probeHealth), so it routes there
and inference fails with "<backend>: model not loaded" — and stays broken
until the controller restarts and rebuilds its registry.

InFlightTrackingClient now reconciles this: when a tracked inference call
returns a model-not-loaded error, it drops the stale replica row
(RemoveNodeModel) so the next request reloads the model on a healthy node
instead of routing back to the evicted one. The original error is returned
unchanged; only the registry is corrected.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(distributed): typed model-not-loaded error via gRPC status code

Replace the controller-side error-string match with a shared, code-aware
helper. Go error types don't survive the gRPC boundary, so the signal is
carried as a status code (FailedPrecondition):

- pkg/grpc/grpcerrors: ModelNotLoaded(backend) constructor +
  IsModelNotLoaded(err) checker (status-code first, message fallback for
  backends not yet migrated).
- InFlightTrackingClient.reconcile now uses grpcerrors.IsModelNotLoaded.
- Migrate the Go backends that emit this error (parakeet-cpp, cloud-proxy,
  rfdetr-cpp) to the typed constructor.

Acting on a false positive is harmless (the model is just reloaded).

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-05 09:01:36 +02:00
Adira
ef80a0e825 fix(config): add face/speaker recognition constants and register insightface + speaker-recognition (#10110)
FLAG_FACE_RECOGNITION and FLAG_SPEAKER_RECOGNITION already existed as
ModelConfigUsecase bitmask flags, and GuessUsecases already gate-checks
both backends by name — but BackendCapabilities had no entries for
either, so the UI could not classify them.

Also missing were the Method* constants for the five proto-defined RPCs
these backends implement (FaceVerify, FaceAnalyze, VoiceVerify,
VoiceEmbed, VoiceAnalyze) and the corresponding Usecase* strings
and UsecaseInfoMap entries needed to wire them into the rest of the
capability system.

Changes:
- Add MethodFaceVerify, MethodFaceAnalyze, MethodVoiceVerify,
  MethodVoiceEmbed, MethodVoiceAnalyze GRPCMethod constants
- Add UsecaseFaceRecognition ("face_recognition") and
  UsecaseSpeakerRecognition ("speaker_recognition") Usecase constants
- Add UsecaseInfoMap entries for both new usecases, referencing the
  existing FLAG_FACE_RECOGNITION and FLAG_SPEAKER_RECOGNITION flags
- Register insightface: Embedding + Detect + FaceVerify + FaceAnalyze
- Register speaker-recognition: VoiceVerify + VoiceEmbed + VoiceAnalyze

Follows up on #10107 which left these two out because they needed new
constants first.

Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
2026-06-04 21:48:01 +02:00
LocalAI [bot]
92726f7631 fix(distributed): stage directory-based models to remote nodes (#10175)
Distributed file-staging treated every model path field (ModelFile, etc.)
as a single regular file: it os.Open'd the path and streamed its fd as the
HTTP PUT body. For directory-based models — e.g. qwen3-tts-cpp, whose
weights and tokenizer ggufs live under one directory referenced by
parameters.model — opening the directory succeeds but reading its fd
returns EISDIR, so routing the model to a remote NATS worker failed with
"read /models/<model>: is a directory". Single-file models were unaffected,
so only multi-file pipelines (e.g. the realtime TTS stage) broke.

stageModelFiles now detects a directory path field and stages each
contained file individually (via the new stageDirectory helper), preserving
structure with the existing StagingKeyMapper and rewriting the field to the
remote directory (deriving ModelPath as before). countStageableFiles makes
the progress total count a directory's files so the staging tracker stays
accurate.

Assisted-by: Claude:claude-opus-4-8 go vet

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-04 18:05:38 +02:00
LocalAI [bot]
994063ba9a feat(qwen3-tts-cpp): normalize request language for flexible matching (#10174)
The qwen3-tts.cpp backend honored the request `language` field only via exact lowercase two-letter codes in the C++ language_to_id table, silently defaulting to English for anything else (en-US, EN, english, ...).

Add normalizeLanguage() in the Go handler: lowercase + trim, strip the region/locale suffix (en-US, pt_BR, zh-Hans -> en/pt/zh), and resolve common English full names (english -> en). The canonical codes match the existing C++ table, so no C++ change is needed. Covered by a pure-Go Ginkgo spec. Also document the language field and accepted forms under the Qwen3-TTS docs.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-04 17:26:31 +02:00
LocalAI [bot]
c1a55cf72d chore: ⬆️ Update mudler/parakeet.cpp to b11fe5bca78ad8b342dd559a43d76df3984bb447 (#10167)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-04 12:07:09 +02:00
LocalAI [bot]
96758841d8 chore: ⬆️ Update predict-woo/qwen3-tts.cpp to 136e5d36c17083da0321fd96512dc7b263f94a44 (#10165)
⬆️ Update predict-woo/qwen3-tts.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-04 12:06:55 +02:00
LocalAI [bot]
7a59260621 chore: ⬆️ Update CrispStrobe/CrispASR to 13d54e110e1538e0f0bc3af0680b9ab246cfb48d (#10145)
⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-04 12:06:32 +02:00
LocalAI [bot]
27e63b9a78 feat(tts): support per-request instructions and params (#10172)
The OpenAI-compatible TTS endpoint accepts an `instructions` field, but it
was silently dropped at the HTTP->gRPC boundary: neither schema.TTSRequest
nor the gRPC TTSRequest proto carried it, so backends could only read such a
value from static YAML options (identical for every request). This blocked
per-line emotion/style and, for Qwen3-TTS VoiceDesign, limited a model config
to a single designed voice.

Plumb a generic per-request instruction string end to end, plus an optional
backend-specific params map:

- proto: add `optional string instructions` and `map<string,string> params`
  to TTSRequest.
- schema: add Instructions (maps OpenAI `instructions`) and Params (LocalAI
  extension) to schema.TTSRequest.
- core: thread both through ModelTTS/ModelTTSStream via a newTTSRequest helper
  that attaches instructions only when non-empty (so backends can fall back to
  YAML when unset); forward them from the /v1/audio/speech handler.
- qwen-tts: prefer the per-request instruction over the YAML `instruct` option
  (used by both mode detection and generation) and merge per-request params.
- chatterbox: merge per-request params (coerced to float/int/bool) over YAML
  options into generate() kwargs.

Fully backward compatible: empty instructions fall back to the YAML option and
backends that don't support style/voice instructions ignore the field.

Closes #10164


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-04 11:45:02 +02:00
LocalAI [bot]
55c0911c23 chore: ⬆️ Update leejet/stable-diffusion.cpp to 1f9ee88e09c258053fa59d5e05e23dfb10fa0b13 (#10166)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-04 09:34:34 +02:00
LocalAI [bot]
f6cb6ab6d9 chore: ⬆️ Update ggml-org/llama.cpp to 94a220cd6745e6e3f8de62870b66fd5b9bc92700 (#10168)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-04 09:34:13 +02:00
LocalAI [bot]
9f11b09c6a chore(model-gallery): ⬆️ update checksum (#10169)
⬆️ Checksum updates in gallery/index.yaml

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-04 00:32:15 +02:00
LocalAI [bot]
a5c4f822f0 chore: ⬆️ Update antirez/ds4 to 477c0e82e2699b35a65fd0a1ed6fe66b41087dfe (#10142)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-03 19:45:23 +02:00
LocalAI [bot]
fb36c262fe chore(model gallery): 🤖 add 1 new models via gallery agent (#10163)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-03 19:44:51 +02:00
LocalAI [bot]
0e4e8980e6 chore: ⬆️ Update ggml-org/llama.cpp to 5c394fdc8b564eff6faacc50a139529d875f0e36 (#10143)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-03 19:44:21 +02:00
Richard Palethorpe
3a932a9803 feat(distributed): Add NATS JWT authentication and TLS/mTLS options (#10159)
* feat(distributed): NATS JWT auth, TLS/mTLS options, and e2e coverage

Mint per-node NATS user JWTs at registration when LOCALAI_NATS_ACCOUNT_SEED
is set, and connect workers with scoped credentials from the register response.
Add optional LOCALAI_NATS_TLS_CA/CERT/KEY for private CA and mTLS alongside
tls:// URLs, plus test-e2e-distributed and NatsJWT container e2e specs.

Document JWT setup (nats-auth-setup.sh) and TLS env vars in distributed-mode.

Assisted-by: Grok:grok grok-build
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(distributed): correct NATS JWT scoping and harden client auth

The JWT-auth path added in 46467cc7 had several gaps that fail silently
under LOCALAI_NATS_REQUIRE_AUTH:

- Agent-worker minted JWTs did not allow the subjects the agent worker
  actually subscribes to (jobs.mcp-ci.new and nodes.<id>.backend.stop),
  so MCP-CI jobs and backend-stop session cleanup were silently dropped.
  Scope the agent permission set to those subjects.
- NATS subscription permission violations were swallowed (Subscribe
  returned a live-but-dead subscription). Confirm subscriptions with a
  server round-trip so a denial surfaces synchronously, and log async
  permission errors.
- The backend worker connected anonymously when given a JWT without its
  paired seed; reject the unpaired credential instead.
- The documented service-user permissions in nats-auth-setup.sh omitted
  prefixcache.>, which the frontend publishes and subscribes; add it.

Also: add a credential-provider hook to the messaging client (consumed by
the follow-up credential-lifecycle change), drop the always-nil error from
NatsMessagingOptions, run go mod tidy (jwt/v2 and nkeys are now direct),
and gofmt the feature's files.

Tests: an agent-JWT e2e spec that connects to the enforcing NATS server
and exercises every subscription the agent worker makes, plus permission
allow-list coverage unit tests.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(distributed): acquire and auto-refresh worker NATS credentials

Workers fetched NATS credentials once at startup, which broke two cases
under JWT auth: a worker that registered while still pending admin
approval never received a minted JWT (it connected unauthenticated and
gave up), and a long-running worker's 24h JWT expired with no way to renew
it.

Introduce workerregistry.NATSCredentialManager, built on idempotent
re-registration (the frontend preserves the node row and mints a fresh JWT
each call):

- Acquire re-registers through admin approval until the node is approved
  and credentials are minted (or returns the first success when auth is
  not required, preserving anonymous-NATS behavior).
- RefreshLoop re-registers before the JWT expires (~75% of its lifetime),
  updating the credentials served to the connection.
- Both are bounded (default 100 attempts / consecutive failures) and
  return an error on exhaustion, so an unapprovable or unrenewable worker
  exits non-zero and surfaces the problem instead of hanging or drifting
  toward an expired credential.

The messaging client gains WithUserJWTProvider, fetching credentials on
each (re)connect so the connection transparently adopts a refreshed JWT
when the server expires the old one. RegisterFull exposes the approval
status and full response; Register delegates to it.

Both the backend worker and the agent worker are wired to this: explicit
env credentials are used as-is, minted credentials are acquired-with-wait
and refreshed, and a permanent refresh failure shuts the worker down so it
restarts and re-acquires.

Tests cover Acquire (wait-through-pending, bounded give-up, context
cancel), RefreshLoop (refresh-before-expiry, bounded failure, no-expiry
exit) and jwtExpiry decoding. Docs updated in distributed-mode.md.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-06-03 19:43:56 +02:00
LocalAI [bot]
9d10418593 fix(parakeet-cpp): convert audio before the non-batched transcribe path (#10161)
The direct (non-batched) transcription path handed the original upload
path straight to the C library via parakeet_capi_transcribe_path_json.
That loader only understands 16 kHz mono WAV/PCM, so any other format
(MP3, etc.) failed with "parakeet: failed to load audio: <file>".

Only the batched path converted the input (via decodeWavMono16k ->
utils.AudioToWav). Every other audio backend (whisper, crispasr)
converts unconditionally with utils.AudioToWav before handing the file
to its engine; the parakeet-cpp fallback was the lone exception.

Extract a convertToWavMono16k helper (reused by decodeWavMono16k) that
produces a 16 kHz mono WAV in a temp dir, and run the non-batched path
through it before calling the C loader. WAV inputs already in the target
format are passed through without ffmpeg.

Add specs covering the helper (decodable copy + cleanup, and an error on
a missing input) that need neither the model, the C library, nor ffmpeg.


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-03 15:06:57 +02:00
dependabot[bot]
5470051d4d chore(deps): bump grpcio from 1.80.0 to 1.81.0 in /backend/python/transformers (#10158)
chore(deps): bump grpcio in /backend/python/transformers

Bumps [grpcio](https://github.com/grpc/grpc) from 1.80.0 to 1.81.0.
- [Release notes](https://github.com/grpc/grpc/releases)
- [Commits](https://github.com/grpc/grpc/compare/v1.80.0...v1.81.0)

---
updated-dependencies:
- dependency-name: grpcio
  dependency-version: 1.81.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-03 10:38:43 +02:00
LocalAI [bot]
68c5eeebc3 chore: ⬆️ Update ggml-org/whisper.cpp to 610e664ba7cfe3af46125ed1b5a1184fccb51bcd (#10140)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-03 10:38:28 +02:00
dependabot[bot]
1531fabe23 chore(deps): bump securego/gosec from 2.22.9 to 2.27.1 (#10147)
Bumps [securego/gosec](https://github.com/securego/gosec) from 2.22.9 to 2.27.1.
- [Release notes](https://github.com/securego/gosec/releases)
- [Commits](https://github.com/securego/gosec/compare/v2.22.9...v2.27.1)

---
updated-dependencies:
- dependency-name: securego/gosec
  dependency-version: 2.27.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-03 10:38:07 +02:00
LocalAI [bot]
b7673d5b76 chore: ⬆️ Update leejet/stable-diffusion.cpp to 2d40a8b2adcdf8b5b0ca0535f3bb7801b6ba13e5 (#10144)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-03 10:37:51 +02:00
dependabot[bot]
b64bdaf406 chore(deps): bump github.com/google/go-containerregistry from 0.21.5 to 0.21.6 (#10149)
chore(deps): bump github.com/google/go-containerregistry

Bumps [github.com/google/go-containerregistry](https://github.com/google/go-containerregistry) from 0.21.5 to 0.21.6.
- [Release notes](https://github.com/google/go-containerregistry/releases)
- [Commits](https://github.com/google/go-containerregistry/compare/v0.21.5...v0.21.6)

---
updated-dependencies:
- dependency-name: github.com/google/go-containerregistry
  dependency-version: 0.21.6
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-03 10:37:33 +02:00
dependabot[bot]
eebf08ff1d chore(deps): bump grpcio from 1.80.0 to 1.81.0 in /backend/python/vllm (#10157)
Bumps [grpcio](https://github.com/grpc/grpc) from 1.80.0 to 1.81.0.
- [Release notes](https://github.com/grpc/grpc/releases)
- [Commits](https://github.com/grpc/grpc/compare/v1.80.0...v1.81.0)

---
updated-dependencies:
- dependency-name: grpcio
  dependency-version: 1.81.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-03 10:37:16 +02:00
dependabot[bot]
42e51894c3 chore(deps): bump go.opentelemetry.io/otel/exporters/prometheus from 0.65.0 to 0.66.0 (#10151)
chore(deps): bump go.opentelemetry.io/otel/exporters/prometheus

Bumps [go.opentelemetry.io/otel/exporters/prometheus](https://github.com/open-telemetry/opentelemetry-go) from 0.65.0 to 0.66.0.
- [Release notes](https://github.com/open-telemetry/opentelemetry-go/releases)
- [Changelog](https://github.com/open-telemetry/opentelemetry-go/blob/main/CHANGELOG.md)
- [Commits](https://github.com/open-telemetry/opentelemetry-go/compare/exporters/prometheus/v0.65.0...metric/x/v0.66.0)

---
updated-dependencies:
- dependency-name: go.opentelemetry.io/otel/exporters/prometheus
  dependency-version: 0.66.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-03 09:14:42 +02:00
LocalAI [bot]
d9ae6481fb chore: ⬆️ Update mudler/parakeet.cpp to 9edf17c3ada66e0f881dcff155492867db7ac4cf (#10141)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-03 08:49:47 +02:00
dependabot[bot]
f1c495a748 chore(deps): bump github.com/mudler/edgevpn from 0.32.2 to 0.34.0 (#10153)
Bumps [github.com/mudler/edgevpn](https://github.com/mudler/edgevpn) from 0.32.2 to 0.34.0.
- [Release notes](https://github.com/mudler/edgevpn/releases)
- [Commits](https://github.com/mudler/edgevpn/compare/v0.32.2...v0.34.0)

---
updated-dependencies:
- dependency-name: github.com/mudler/edgevpn
  dependency-version: 0.34.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-03 08:34:16 +02:00
LocalAI [bot]
415b561947 docs: fix distributed-mode diagram (workers use NATS, not PostgreSQL) (#10138)
docs: fix distributed-mode diagram - workers coordinate via NATS, not PostgreSQL

The architecture diagram drew the worker-bound arrows from the PostgreSQL area of the control plane, implying workers connect to PostgreSQL. They do not: PostgreSQL is the frontends shared state, while workers coordinate over NATS (backend.install events) and receive LoadModel over gRPC from a frontend. Re-route the worker arrows to originate from the NATS chip.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-02 22:05:33 +02:00
Ettore Di Giacinto
e6a0d4c375 Remove diagram from distributed mode documentation
Removed ASCII diagram of distributed mode architecture from the documentation.

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-06-02 18:48:12 +02:00
LocalAI [bot]
7e59a5c7c5 docs: architecture & feature diagrams (blueprint style) (#10137)
* docs: add 'how LocalAI works' architecture diagram

Add a blueprint-style architecture diagram: clients -> small core (API,
router, WebUI, agents) -> gRPC -> backend processes pulled on demand as
OCI images. Place it on the overview page and replace the stale external
architecture image on the reference page.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: add blueprint diagrams across feature, distributed & getting-started docs

Add 24 architecture/flow/comparison diagrams (PNG + HTML source) under
docs/static/images/diagrams/, wired into their docs pages, from an
impact-vs-effort audit of the docs. Broaden the API surface on the
overview architecture diagram (OpenAI, Anthropic, ElevenLabs, Ollama,
and LocalAI's own API) and move the gRPC boundary label clear of the arrows.

Pages: distributed mode (architecture, scheduling, ds4 layer-split),
distributed inferencing, MLX, realtime, quantization, MCP, agents,
mitm & cloud proxy, middleware, reverse-proxy TLS, VRAM, voice & face
recognition, reranker, function calling, fine-tuning (recipe + jobs),
diarization, audio transform, quickstart, model resolution.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: add composable-core diagram to README hero

Commit the composable-core card (small core + on-demand backend tiles)
alongside the other diagrams and reference it from the README hero via a
repo-relative path, so it renders on GitHub.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: fix composable-core connectors/badge and federated-vs-worker layout

- composable-core: thicken the plug-in connectors so they read clearly, and
  widen the SEPARATE IMAGE badge so its text no longer overflows the box.
- federated-vs-worker: shorten the WHOLE/SPLIT REQUEST pills to fit, and
  replace the tangled node-to-node activation arrows with a clean fan-out
  (request split across all sharded nodes), mirroring the federated panel.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-02 18:43:22 +02:00
LocalAI [bot]
aea954a482 docs: position LocalAI as a composable engine, not a bundle (#10136)
Reframe the README hero and docs (homepage, overview, FAQ) around the
composable architecture: a small core, with backends built as dedicated
gRPC services around best-in-class engines, shipped as separate OCI
images and pulled on demand. Lead from strength: drop the "36+ backends"
kitchen-sink framing and the "All-in-One Complete AI Stack" / "single
binary that gives you everything" lines that read as a monolith.

- README: small-core differentiator; composable + open/extensible bullets
- _index.md: composable tagline; install only what you use
- overview.md: core vs on-demand backends; gRPC/OCI mechanics as benefits;
  bring-your-own model and backend
- faq.md: "Do I need to install all the backends?" and
  "Can I bring my own model or backend?"

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-02 17:34:43 +02:00
Ettore Di Giacinto
595e448714 docs(llama.cpp): note tensor split now works with quantized KV cache (#10135)
The split_mode: tensor description claimed tensor parallelism requires
KV-cache quantization to be disabled. ggml-org/llama.cpp#23792 lifts that
restriction by extending the meta backend to preserve shape information
through KV-cache flatten/reshape, so cache_type_k/cache_type_v
quantization can be combined with -sm tensor on builds that include it.

Documentation only: no backend code, grpc-server.cpp comment, or
llama.cpp pin changes.


Assisted-by: Claude Code:claude-opus-4-8

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-02 15:52:23 +02:00
LocalAI [bot]
860f9d63ad feat(parakeet-cpp): dynamic batching for concurrent transcription requests (#10112)
* feat(parakeet-cpp): dynamic-batching scheduler (queue + dispatcher)

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(parakeet-cpp): dynamic batching for AudioTranscription via batched JSON C-API

Drop SingleThread; route unary transcription through the in-process batcher
which coalesces concurrent requests into one batched engine call. Streaming
stays mutually exclusive via engineMu. Adds batch_max_size / batch_max_wait_ms
options (size=1 disables; recommended on CPU).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(parakeet-cpp): tear down dispatcher in Free; log batch config; preallocate; clarify stream lock

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(parakeet-cpp): Ginkgo batcher tests; optional batch C-API binding with per-request fallback

The batched JSON C-API symbol exists only in newer libparakeet.so (ABI >= 2);
probe it with Dlsym and register optionally so the backend still loads against
an older library, falling back to per-request transcription. Rewrites the
batcher unit tests as Ginkgo/Gomega specs (forbidigo bans t.Fatal in tests).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(parakeet-cpp): debug-log coalesced batch size in runBatch

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(parakeet-cpp): default batch_max_size to 1 (batching opt-in)

Dynamic batching now defaults off (batch_max_size:1, one request at a
time). Raise batch_max_size to opt in: it is a large throughput win on
GPU under concurrent load, but on CPU and low-concurrency setups it only
adds latency, so off is the safer default. The startup log now states
whether batching is on or off, and the audio-to-text docs are updated to
match.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(parakeet-cpp): bump parakeet.cpp to 8a7c482 (batched decode + B=1 fast-path)

parakeet.cpp PR #1 merged the batched encoder/decode and the B=1 encoder
fast-path to master. Point PARAKEET_VERSION at that commit so the backend
builds the batched C-API (parakeet_capi_transcribe_pcm_batch_json) that the
dynamic batcher calls; the prior pin (30a3075) predated it, so only the
per-request fallback path was exercised. Verified the shared lib builds with
the backend's CMake flags and exports the batch symbol.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-02 14:49:02 +02:00
LocalAI [bot]
a5a0b3dc4e chore: ⬆️ Update CrispStrobe/CrispASR to 05e60432bcb5bc2113f8c395a41e86497c11504a (#10115)
⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-02 14:48:47 +02:00
番茄摔成番茄酱
94eca04c60 fix(nemo): pin texterrors to 1.1.6 for GLIBCXX compatibility (#10134)
Pin texterrors==1.1.6 before nemo_toolkit[asr] in requirements-cublas13.txt.

The texterrors package (a NeMo transitive dependency) contains a compiled
C++ extension (texterrors_align.so) that may be built from source during
OCI image creation. When built on systems with GCC 14+ (e.g. Ubuntu 24.04),
the resulting binary requires GLIBCXX_3.4.32, which is not available in
the default LocalAI container (Ubuntu 22.04, GLIBCXX up to 3.4.30).

Pinning to 1.1.6 (the latest release) ensures:
- Reproducible builds across environments
- pip resolves the pre-built manylinux2014 wheel (needs only GLIBCXX_3.4.11)
  instead of potentially building from source with a newer toolchain

Fixes #10056

Signed-off-by: 番茄摔成番茄酱 <fqscfqj@outlook.com>
2026-06-02 14:48:27 +02:00
LocalAI [bot]
35bd485d6a chore: ⬆️ Update ggml-org/llama.cpp to 5dcb71166686799f0d873eab7386234302d05ecf (#10128)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-02 09:06:35 +02:00
LocalAI [bot]
1fe96f8d9a chore: ⬆️ Update mudler/parakeet.cpp to 8a7c48209d7882a7ce79a6b306270e4703194543 (#10129)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-02 09:06:19 +02:00
LocalAI [bot]
c508e9d7c6 chore: ⬆️ Update leejet/stable-diffusion.cpp to 7948df8ac1070f5f6881b8d34675821893eb97d6 (#10127)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-02 09:06:03 +02:00
LocalAI [bot]
55e754fd05 chore: ⬆️ Update ggml-org/whisper.cpp to 23ee03506a91ac3d3f0071b40e66a430eebdfa1d (#10130)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-02 01:43:03 +02:00
LocalAI [bot]
a17753f7d1 chore(model-gallery): ⬆️ update checksum (#10131)
⬆️ Checksum updates in gallery/index.yaml

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-01 23:39:47 +02:00
Zhao73
c61838dba6 docs: fix documentation typos (#10125)
Correct clear spelling mistakes in documentation without changing behavior.

Confidence: high
Scope-risk: narrow
Tested: git diff --check; uvx codespell on changed files
Not-tested: Full docs build not run; text-only changes
Assisted-by: Codex:gpt-5 codespell
2026-06-01 14:31:08 +02:00
LocalAI [bot]
7013e13f05 chore: ⬆️ Update ggml-org/llama.cpp to 399739d5c5978351f39e3454bfbfbab4f369088f (#10119)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-01 14:24:51 +02:00
Richard Palethorpe
5a0013defe test(react-ui): add page render-smoke specs, reset the coverage gate (#10122)
The UI coverage gate was tightened to 0.1pp against a fast-local
measurement (39.86% baseline); CI's slower runners measure ~0.9pp lower,
so tests-ui-e2e failed there. UI e2e coverage is diffusely
non-deterministic and tracks machine speed — a 0.1pp band can't hold
across environments.

Rather than loosen the gate, raise the floor under it: a render-smoke
spec mounts each lazy page (navigate + assert the header renders),
covering a dozen previously-untested pages and lifting coverage from
~39% to ~42.7% locally. Restore the tolerance to 0.8pp and set the
baseline conservatively (40.0), below the slow-CI floor, so the ratchet
holds without flapping.

Document the coverage policy — install the git hooks and don't bypass
them (no --no-verify, no hand-lowering the baseline or widening the
tolerance); raise coverage by adding tests instead; set the UI baseline
below the slow-CI floor — in AGENTS.md, CONTRIBUTING.md and
.agents/building-and-testing.md.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-06-01 14:24:36 +02:00
LocalAI [bot]
c01ed631d6 refactor(routing): extract replica picker into pkg/clusterrouting (#10123)
Move ReplicaCandidate and PickBestReplica out of core/services/nodes (which depends on gorm) into a new dependency-light leaf package pkg/clusterrouting, so the p2p federation server can later share the same replica-selection policy without pulling in a database driver.

core/services/nodes keeps a type alias and a thin delegator, so every existing reference (the LoadedReplicaStats interface method, the ReplicaCandidate row conversion in registry.go, and the SQL policy-mirror test) compiles and behaves unchanged. This is a pure, behavior-preserving refactor: the full nodes suite, including the policy-mirror spec that pins the SQL ORDER BY to PickBestReplica, stays green.

Assisted-by: Claude Code:claude-opus-4-8

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-01 09:38:55 +02:00
LocalAI [bot]
d47464cb06 docs: ⬆️ update docs version mudler/LocalAI (#10114)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-01 08:16:29 +02:00
LocalAI [bot]
63f176346e chore: ⬆️ Update leejet/stable-diffusion.cpp to be65ac7511b30379b003626c15224798929e33d4 (#10118)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-01 00:43:50 +02:00
LocalAI [bot]
af94d08729 chore: ⬆️ Update ggml-org/whisper.cpp to fe69461618ffc50ba8afa65c25cc6c6e34d4537f (#10117)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-01 00:43:34 +02:00
LocalAI [bot]
6795d38f50 chore: ⬆️ Update mudler/parakeet.cpp to cb45f68068081af01e7092e91b038ee353eb56be (#10116)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-31 23:57:15 +02:00
Richard Palethorpe
718223f33b feat(localvqe/audio): v1.3 release and add spectrograms to audio transform UI (#10113)
* chore(localvqe): update backend to v1.3, add v1.2/v1.3 gallery models

Bump the LocalVQE backend pin 72bfb4c6 -> b0f0378a, which adds the v1.2
(1.3 M) and v1.3 (4.8 M) GGUF SHA-256s to the upstream released-models
allowlist (and the arch_version=3 loader) so both load without
LOCALVQE_ALLOW_UNHASHED.

Add gallery entries for localvqe-v1.2-1.3m and localvqe-v1.3-4.8m
(SHA-256 verified against the downloaded weights) and update the
audio-transform docs to make v1.3 the current default while noting the
compact v1.1/v1.2 alternatives.

Assisted-by: Claude:claude-opus-4-8 Claude-Code
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* chore(flake): add ffmpeg-headless to the dev shell

pkg/utils/ffmpeg_test.go shells out to the `ffmpeg` CLI, and the
pre-commit gate runs those tests via `make test-coverage`. Without
ffmpeg in the dev shell the gate fails with "executable file not found
in $PATH". The headless build provides the CLI without GUI/X deps.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(localvqe): parse WAV by walking RIFF sub-chunks

Walk the RIFF chunk list instead of assuming the canonical 44-byte
header layout. Real inputs (browser-recorded clips, ffmpeg output with
an 18/40-byte extensible `fmt ` chunk or trailing LIST/INFO metadata)
would otherwise splice header/metadata bytes into the PCM stream as an
audible impulse. Honour the `data` chunk size and validate that both
`fmt ` and `data` chunks are present.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(security-headers): allow blob: in connect-src for waveform fetch

The waveform renderer XHRs/fetches a freshly-created blob: object URL
(e.g. an uploaded or enhanced clip before it has a server URL). XHR/fetch
of blob: is governed by connect-src, not media-src, so it was blocked by
the CSP. Add blob: to connect-src.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(react-ui): add input/output spectrogram view to AudioTransform

The transform page only showed time-domain amplitude waveforms, so you
could see how loud a clip was but not which frequencies the model
touched. Add a time x frequency spectrogram heatmap and render the input
and output spectrums side by side, so it's visible which bands the
enhancement attenuates (bright input bands that go dark in the output).

Computed client-side via a Hann-windowed STFT over both clips (a small
dependency-free radix-2 FFT), defaulting to the LocalVQE 512/256 frame
geometry. This shows the net input->output spectral change; the model's
internal gain mask is not exposed by the backend.

- src/utils/fft.js            radix-2 FFT
- src/hooks/useSpectrogram.js decode + STFT -> normalised dB magnitude grid
- src/components/audio/Spectrogram.jsx  canvas heatmap (magma colormap)
- AudioTransform.jsx          dual-spectrogram panel + CSS
- e2e spec + UI coverage baseline bump (38.29 -> 39.0; measured ~39.4-40.2)

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* test(react-ui): make UI coverage deterministic, tighten the gate

UI e2e line coverage swung ~1pp run-to-run (39.1% <-> 40.2%), which forced
a loose 0.8pp tolerance on the monotonic gate — a band wide enough to let
a real ~300-line regression through silently. The swing was a bug, not
inherent jitter: the 'Create Agent navigates' spec ended on the URL
assertion, so AgentCreate.jsx's ~400 lines were collected only when its
render happened to beat the coverage teardown.

Wait for the page to actually render (assert its heading) so those lines
are covered every run. With the race gone, repeated runs land within
~0.013pp of each other, so:

- tighten UI_COVERAGE_TOLERANCE 0.8 -> 0.1 (noise floor, not a drift band)
- set the baseline to the real, reliably-achieved value (39.0 -> 39.86)

Localised by running the V8-coverage suite repeatedly and diffing per-file
line coverage; AgentCreate.jsx was the sole ~1pp flipper.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-31 23:56:46 +02:00
LocalAI [bot]
39e050d9e2 fix(parakeet-cpp): cublas/hipblas/vulkan builds were silently CPU-only (#10120)
fix(parakeet-cpp): forward PARAKEET_GGML_* so cublas/hipblas/vulkan builds aren't silently CPU-only

parakeet.cpp gates its GGML backends behind PARAKEET_GGML_CUDA/HIP/VULKAN and
does set(GGML_CUDA ${PARAKEET_GGML_CUDA} CACHE BOOL "" FORCE), which overwrites
a bare -DGGML_CUDA=ON back to OFF. So the backend's BUILD_TYPE=cublas (and hipblas,
vulkan) produced a CPU-only libparakeet.so. Forward the PARAKEET_GGML_* options
instead. Verified on a GB10 (CUDA 13): the lib now links libcudart/libcublas and
registers the CUDA backend, vs a CPU-only lib before.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-31 23:56:07 +02:00
LocalAI [bot]
c222161291 feat(distributed): resumable file uploads via HTTP Content-Range (#10109)
Large model GGUFs (multi-GB) transferred between master and worker over
flaky / bandwidth-throttled paths (e.g. libp2p relays with byte caps) used
to restart from byte 0 on every transport error. This change adds standard
HTTP Range/resume semantics to the worker's PUT /v1/files/<key> endpoint
and teaches the master-side HTTPFileStager to consult the worker for the
last accepted offset and resume from there.

Server side (file_transfer_server.go):
- PUT now honors Content-Range: bytes <start>-<end>/<total>. The handler
  validates that <start> matches the current on-disk size; mismatches
  return 416 with the actual size in X-File-Size.
- Mid-upload chunks return 308 Permanent Redirect ("Resume Incomplete")
  with the new size, so the client can keep going.
- An optional X-Content-SHA256 request header binds an upload to a target
  hash; cross-attempt drift returns 409. On the final chunk the server
  re-computes SHA-256 and returns 400 if it doesn't match.
- HEAD now advertises Accept-Ranges: bytes and Content-Length, and exposes
  X-Target-SHA256 for in-progress files (so clients can resume only when
  the partial bytes belong to the file they want to upload).
- Legacy PUTs with no Content-Range keep the original truncate-create
  semantics — zero behavior change on the happy path.

Client side (file_stager_http.go):
- Pre-PUT HEAD probe reads X-File-Size + X-Target-SHA256 to determine the
  resume offset.
- doUpload seeks to that offset and sends Content-Range + X-Content-SHA256.
- Retry loop switches from fixed 3 attempts / 5s-10s-20s backoff to an
  outer time budget
  with exponential backoff (1s -> 30s cap), so a 5GB upload over a flaky
  link can outlast many short disconnects.
- 308 and 416 responses are treated as transient: the next iteration
  re-HEADs to learn the correct offset.

Tests:
- Two-chunk Content-Range round-trip produces the correct file + sidecar.
- 416 on a Content-Range/file-size mismatch.
- 409 on X-Content-SHA256 drift between chunks.
- 400 on final-hash mismatch.
- HEAD on a partial upload exposes X-Target-SHA256 (not a misleading
  hash-of-partial-bytes via X-Content-SHA256).
- Pre-existing finished file with a different hash is transparently
  overwritten when a new PUT starts at byte 0.
- End-to-end resume: EnsureRemote against a worker that already holds a
  partial file transfers only the remainder.
- Mid-stream connection drop on attempt #1 is recovered by attempt #2
  resuming from the partial offset.

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-31 11:02:20 +00:00
LocalAI [bot]
aa80d4681b chore: ⬆️ Update ggml-org/llama.cpp to d6588daa800058dfa54f1d7ea695b1a810c8ae18 (#10093)
* ⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix(llama-cpp): skip begin-of-stream null partial in PredictStream

Upstream llama.cpp (ggml-org/llama.cpp#23884), pulled in by this bump,
now emits an initial "begin" partial whose to_json() returns null. It
exists only to signal the HTTP layer to flush 200 status headers before
any token is produced.

gRPC has no such concept, and PredictStream had no guard: the null result
was fed straight into build_reply_from_json, which threw an uncaught
exception. That surfaced as a generic "Unexpected error in RPC handling"
and the task was cancelled the instant it launched, breaking the
PredictStream e2e spec.

Skip null results in both the first-result handling and the streaming
loop, mirroring upstream's own `if (first_result_json == nullptr)` guard.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-31 10:26:03 +00:00
LocalAI [bot]
0d57957ebb feat(worker): add LOCALAI_PREFETCH_MODELS for boot-time gallery prefetch (#10108)
In LocalAI distributed mode the master streams a model GGUF to a
worker on first inference. On bandwidth-constrained cluster networks
(libp2p circuit-v2 relays under NAT, double-NAT residential, slow
overlays) that transfer can be slow or unreliable — meanwhile each
worker's outbound internet is usually fine.

LOCALAI_PREFETCH_MODELS lets the operator name gallery model IDs to
download at worker boot, BEFORE the worker subscribes to backend.install
events. Reuses gallery.InstallModelFromGallery so the on-disk /models
layout matches what the master would have pushed, and the master can
still push files on demand if the gallery is unreachable at boot
(prefetch is non-fatal on every error path).

The installer is wrapped in a function-value indirection so tests can
swap a fake without touching the real gallery; production never
reassigns the binding.

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-31 12:22:45 +02:00
LocalAI [bot]
76fe0bb929 feat(crispasr): add CrispASR backend — multi-architecture ASR + TTS (#10099)
* feat(crispasr): backend source files (Go gRPC server, C-ABI shim, build files)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* polish(crispasr): brand error strings + fix stale shim comment

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* build(crispasr): register backend in root Makefile

Mirror the whisper Go backend registration for the new crispasr
backend: NOTPARALLEL entry, prepare-test-extra/test-extra hooks,
BACKEND_CRISPASR definition, docker-build target generation, and the
docker-build-backends aggregate target.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(crispasr): add backend build matrix entries

Mirror the 11 whisper golang Dockerfile matrix entries (CPU amd64/arm64,
CUDA 12/13, L4T CUDA 13, Intel SYCL f32/f16, Vulkan amd64/arm64, L4T
arm64, ROCm hipblas) with backend and tag-suffix substituted to crispasr.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add crispasr backend gallery entries

Add the crispasr meta anchor and its full set of image gallery entries
(cpu, metal, cuda12/13, rocm, intel-sycl f32/f16, vulkan, L4T arm64,
L4T cuda13 arm64, plus -development variants), mirroring the whisper
backend gallery block.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(crispasr): bump CRISPASR_VERSION via bump_deps workflow

Track CrispStrobe/CrispASR main branch and bump CRISPASR_VERSION in
backend/go/crispasr/Makefile.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* build(crispasr): don't wire fixture-gated test into test-extra

Mirror the whisper Go backend: its AudioTranscription test is gated on
model/audio fixtures and skips in CI, so building crispasr (the heaviest
ggml compile in the tree) inside the unit-test lane adds a long compile
for zero coverage. The backend image build in backend-matrix.yml remains
the authoritative compile check.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(crispasr): add darwin metal build entry (mirror whisper)

The metal-crispasr gallery entries and capabilities.metal mapping
reference -metal-darwin-arm64-crispasr, which is only produced by an
includeDarwin entry. Mirror whisper's darwin metal entry so the tag
actually gets built.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(crispasr): place hipblas matrix entry next to whisper twin

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): register crispasr as pref-only ASR backend + test

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(crispasr): port whisper behavioral suite (cancellation + streaming)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(crispasr): fix skip message env var names to CRISPASR_*

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): switch shim to crispasr_session_* multi-architecture API

The shim used whisper_full(), which in CrispASR is the whisper-only path:
libcrispasr only transcribes Whisper GGUFs through it. Multi-architecture
transcription (Parakeet, Voxtral, Qwen3-ASR, Canary, Granite, FunASR,
Paraformer, SenseVoice, ...) goes through the crispasr_session_* C-ABI,
which auto-detects the architecture from the GGUF and dispatches to the
matching backend.

Rewrite the C shim around crispasr_session_open / _transcribe_lang /
_result_* and add get_backend() so the selected backend is logged.
load_model now takes a threads param (session_open binds n_threads at
open). The session result is segment+word based with no token IDs and no
per-decode callback, so drop n_tokens / get_token_id /
get_segment_speaker_turn_next / set_new_segment_callback. set_abort is
kept for API parity but is best-effort: the session transcribe is blocking
with no abort hook.

Update the purego bindings and gocrispasr.go to match: tokens are left
empty, speaker-turn handling is removed, and AudioTranscriptionStream
emits one delta per non-empty segment after the blocking decode returns
(no progressive streaming via the session API), preserving the
concat(deltas) == final.Text invariant.

crispasr_session_set_translate is exported by libcrispasr but not declared
in crispasr.h, so it is forward-declared in the shim alongside the
open/transcribe/result functions.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* build(crispasr): link full CrispASR backend set for multi-arch support

The shim's crispasr_session_* dispatch calls into the per-architecture
backend libs (parakeet, voxtral, qwen3_asr, canary, funasr, paraformer,
sensevoice, ...), which CrispASR builds as static archives. Linking only
crispasr + ggml dead-stripped every backend object from the final module
(nm backend-symbol count: 0), leaving a whisper-only .so.

Link the same backend set as crispasr-cli so the static archives are
pulled in. After this the module carries the backend symbols (nm count
407, .so grows from ~2.1MB to ~6.7MB) and the session API can dispatch to
every compiled-in architecture.

Also rewrite ${CMAKE_SOURCE_DIR}/examples/talk-llama to
${PROJECT_SOURCE_DIR}/... in the vendored src/CMakeLists.txt: CrispASR
locates its vendored llama.cpp via ${CMAKE_SOURCE_DIR}, which is wrong when
CrispASR is add_subdirectory'd (CMAKE_SOURCE_DIR points at this backend
dir, not the CrispASR root). PROJECT_SOURCE_DIR is correct both standalone
and as a subproject; the sed is idempotent.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(crispasr): adapt suite to session API (blocking, no decode callback)

Register the new symbol set (drop the removed token/speaker/callback funcs,
add get_backend; load_model now takes 2 args). The session transcribe is
blocking with no abort hook, so a mid-decode cancel can't interrupt it:
change the cancellation spec to cancel the context before the call and
assert codes.Canceled from the pre-call ctx.Err() check, dropping the
<5s mid-decode timing assertion. The streaming spec still holds with
per-segment post-decode emission (>=2 deltas, concat(deltas) == final.Text).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add CrispASR ASR model entries (-crispasr)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(gallery): keep only session-auto-detectable CrispASR ASR models

The crispasr backend loads models via crispasr_session_open, which
auto-detects the backend from the GGUF general.architecture using
crispasr_detect_backend_from_gguf. Architectures not in that detect
map cannot be opened, so those gallery entries fail to load.

Removed entries whose architecture is not wired into CrispASR
v0.6.11's session auto-detect router (they can be re-added when
upstream maps them):

- Not in the detect map: data2vec, firered-asr, funasr,
  fun-asr-mlt-nano, glm-asr, hubert, kyutai-stt, mega-asr, mimo-asr,
  moonshine{,-de,-streaming,-tiny-de}, omniasr{,-llm,-llm-1b},
  paraformer, sensevoice.
- Pending verification (filename-heuristic routed, not arch-detected):
  parakeet-ctc-0.6b, parakeet-ctc-1.1b. Their GGUFs are routed to the
  fastconformer-ctc backend by a filename heuristic in the model
  registry, which implies general.architecture is not a mapped string.

Kept the parakeet rnnt/tdt_ctc variants: convert-parakeet-to-gguf.py
writes general.architecture="parakeet" unconditionally and encodes the
rnnt/ctc distinction in metadata fields, so they session-auto-detect.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): TTS synthesis via crispasr_session_synthesize (24kHz)

Add tts_synthesize/tts_free/tts_set_voice to the C-ABI shim. They reuse
the already-open g_session (crispasr_session_open auto-detects a TTS
model) and dispatch to the upstream synthesis call, which returns
malloc'd 24 kHz mono float PCM. Orpheus needs a SNAC codec path that we
do not set, so it returns NULL here and surfaces as an error Go-side.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): implement TTS/TTSStream gRPC methods

Bind the new shim functions via purego and implement TTS, TTSStream and
a writeWAV24k helper. synthesize copies the C-owned PCM out before
freeing it; TTS writes a 24 kHz mono 16-bit WAV to req.Dst via
go-audio/wav. CrispASR has no progressive synth, so TTSStream
synthesizes fully, encodes to WAV, and emits the bytes as a single
chunk; it owns the results-channel close (the gRPC server wrapper ranges
until close), mirroring vibevoice-cpp's TTSStream.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): log when a TTS voice override is not honored

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add CrispASR vibevoice-tts model entry

Only vibevoice-tts works through the current shim: qwen3-tts, chatterbox,
and orpheus require companion codec/s3gen/SNAC paths (set_codec_path /
set_s3gen_path) that the shim doesn't wire yet, and kokoro/indextts/voxcpm2
aren't in the session auto-detect map. Those are follow-ups.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(crispasr): gated TTS synthesis spec

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(crispasr): satisfy golangci-lint (errcheck defers + unsafeptr nolint)

The crispasr Go file is entirely new, so new-from-merge-base lints every
line (unlike the grandfathered whisper backend it was forked from):
- handle os.RemoveAll / fh.Close return values in AudioTranscription
- annotate the two intentional C-pointer unsafe.Slice sites with //nolint:govet

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): backend: and codec: model options (explicit arch + companion files)

Add two model-config options to the CrispASR backend via opts.Options:

- backend:<name> selects an explicit CrispASR backend (bypassing
  auto-detect) by routing load_model through
  crispasr_session_open_explicit, unlocking architectures the
  detector won't pick on its own (qwen3, cohere, granite, voxtral,
  moonshine, mimo-asr, orpheus, kokoro, chatterbox, etc.).
- codec:<path> loads a companion file (qwen3-tts codec, orpheus SNAC,
  chatterbox s3gen, or mimo-asr tokenizer) via the universal
  crispasr_session_set_codec_path setter after the session opens. A
  relative path resolves against the model directory. rc==0 means
  success or not-applicable; only a negative rc is fatal.

The C shim load_model gains a backend_name argument and a new
set_codec_path entry point; the Go bridge parses the prefix:value
options and registers the new symbol. The vad_only path is unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): expand CrispASR models via backend:/codec: options (explicit arch + companions)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(gallery): use virtual.yaml base for crispasr models

The crispasr entries are just backend + model + a couple options, fully
expressed inline via overrides:/files: in gallery/index.yaml. Point each
url: at the shared gallery/virtual.yaml (the established 'virtual' model
trick) and drop the 36 redundant per-model gallery/*-crispasr.yaml files.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(gallery): drop voice-requiring TTS entries (keep vibevoice-tts)

Real e2e showed qwen3-tts/orpheus/chatterbox don't synthesize through the
current shim: the codec: companion loads fine, but these engines additionally
need a voice pack / voice prompt / reference clip (qwen3-tts base errors
'no voice'; chatterbox is zero-shot cloning; orpheus uses named voices) that
the backend doesn't wire. (qwen3-tts also can't auto-detect: its GGUF arch is
'qwen3tts', unmapped by the detector — would need backend:qwen3-tts.) Removed
to avoid shipping non-working gallery entries; vibevoice-tts (built-in voice,
e2e-verified) remains the working TTS. Voice-pack wiring is a follow-up.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): speaker: and voice: TTS options (baked speakers + voice packs/prompts)

speaker:<name> -> crispasr_session_set_speaker_name (baked speakers: qwen3-tts
CustomVoice, orpheus). voice:<path>(+voice_text:<ref>) -> crispasr_session_set_voice
(voice-pack GGUF, or WAV zero-shot clone with ref text). Applied at Load as the
default voice; req.Voice still overrides the speaker per request.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): re-add e2e-verified TTS engines (chatterbox, qwen3-tts-customvoice, orpheus)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-31 12:11:03 +02:00
Adira
baa11133f1 fix(config): register parakeet-cpp as a transcript backend (#9718) (#10106)
parakeet-cpp was added in #10084 but not registered in
BackendCapabilities, so GuessUsecases only allowed "whisper" for
FLAG_TRANSCRIPT and the UI could not classify parakeet-cpp models as
speech-to-text. The result was that parakeet models appeared only in
the LLM selector in the speech-to-speech pipeline, making them
unusable for transcription through the UI.

Closes #9718

Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 11:15:15 +02:00
Adira
1bdd3338a6 fix(config): register 5 backends missing from BackendCapabilities (#10107)
Cross-referencing backend/ directories against BackendCapabilities found
five backends that exist and work but have no entry in the map, so
GuessUsecases falls back to heuristics that mis-classify them (e.g.
a TTS backend appears as an LLM in the UI).

Added entries, each modelled on the corresponding Python twin or the
nearest equivalent already in the map:

  sglang        — LLM (Predict/PredictStream/TokenizeString, vision)
  vibevoice-cpp — ASR + TTS/TTSStream (mirrors vibevoice Python)
  sherpa-onnx   — ASR + TTS/TTSStream + VAD (multi-model toolkit)
  qwen3-tts-cpp — TTS (mirrors qwen-tts Python)
  rfdetr-cpp    — object detection (mirrors rfdetr Python)

Found by diffing `ls backend/{go,python}/` against the keys in
BackendCapabilities. Remaining gaps (insightface, speaker-recognition,
sam3-cpp) use custom gRPC methods not yet in the Method* constants —
left for a follow-up.

Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 11:14:52 +02:00
LocalAI [bot]
e08492a2c3 chore: ⬆️ Update leejet/stable-diffusion.cpp to d2797b86670622b6538123b4aeb5fbb6be2653c5 (#10094)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-31 00:42:13 +02:00
LocalAI [bot]
d5d8fe909d docs: ⬆️ update docs version mudler/LocalAI (#10091)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-31 00:11:41 +02:00
LocalAI [bot]
8a82753277 chore: ⬆️ Update antirez/ds4 to ba00a8a88c4c5810a3d1fed6b7b8fa2b44b82fdc (#10095)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-31 00:10:33 +02:00
LocalAI [bot]
51ca109067 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 3f40e73c367ad9f0c1b1819f28c7348c26aa340d (#10097)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-31 00:10:16 +02:00
LocalAI [bot]
07f6c15a37 feat(ds4): layer-split distributed inference (#10098)
* feat(ds4): add standalone ds4-worker distributed worker binary

Add worker_main.c, a minimal standalone worker that owns a slice of the
model's transformer layers and serves activations over ds4's own TCP
transport via ds4_dist_run(). It links the same engine objects the
backend already builds (including ds4_distributed.o) and has NO
gRPC/protobuf dependency, so it builds even on hosts lacking protobuf/grpc
dev headers. Launched by `local-ai worker ds4-distributed`.

Wire the ds4-worker CMake target (mirrors grpc-server's object/GPU/native
handling) and have the Makefile copy + clean the binary alongside
grpc-server. Ignore the built ds4-worker artifact.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(ds4): package ds4-worker alongside grpc-server

Copy the standalone ds4-worker binary into the backend package (Linux
package.sh) and the Darwin OCI tar (ds4-darwin.sh: both the explicit copy
and the otool dylib-bundling loop) so distributed workers ship with the
backend.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(ds4): tighten ds4-worker integer arg validation to match upstream

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(ds4): wire grpc-server as distributed coordinator

Add distributed COORDINATOR support to the ds4 backend's gRPC server.
Distributed inference is an engine backend: when LoadModel receives
'ds4_role:coordinator', the process populates ds4_engine_options.distributed
(role, layer slice, listen host/port) before ds4_engine_open, then the normal
ds4_session_* generation path runs transparently once the worker route covers
all layers.

- New LoadModel options: ds4_role, ds4_layers (START:END or START:output),
  ds4_listen (host:port), ds4_route_timeout.
- parse_layers_spec() maps the layer spec onto ds4_distributed_layers.
- wait_route_ready() blocks generation until
  ds4_session_distributed_route_ready() reports full coverage (or timeout),
  gating both Predict and PredictStream; returns UNAVAILABLE on timeout/error.
- No ds4_role => g_distributed stays false and wait_route_ready is a no-op,
  so single-node behavior is unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(ds4): don't block Status during route wait; validate coordinator opts

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(cli): add ds4-distributed worker exec helper

Add the ds4WorkerArgs helper plus findDS4Backend/DS4Distributed.Run that
resolve the ds4 backend via the gallery and exec the packaged ds4-worker
binary. Unlike worker_llamacpp.go, ds4 bundles its own dynamic loader
(lib/ld.so) for glibc compatibility, so when present we exec ds4-worker
through that loader with LD_LIBRARY_PATH=<backend>/lib, mirroring
backend/cpp/ds4/run.sh; otherwise we exec it directly.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(cli): register the ds4-distributed worker subcommand

Wire DS4Distributed into the Worker kong command tree so
`local-ai worker ds4-distributed` is available.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* docs(ds4): document layer-split distributed inference

Add a ds4 section to the distributed-mode feature docs (coordinator
model YAML, manual worker command, layer-range semantics, the
'GGUF on every machine' requirement, coordinator-listens dial
direction vs llama.cpp) and a terse Distributed mode section to the
ds4 backend agent guide.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* test(ds4): opt-in hardware-gated distributed e2e spec

Add a self-contained, opt-in Ginkgo spec to the backend e2e suite that
spins a ds4 coordinator (via the packaged run.sh, loaded with
ds4_role/ds4_layers/ds4_listen options) plus a ds4-worker process for
the upper layers, then uses Eventually to assert a short successful
Predict once the layer route forms, before tearing the worker down.

Gated by BACKEND_TEST_DS4_DISTRIBUTED=1 (plus the existing
BACKEND_BINARY + BACKEND_TEST_MODEL_FILE and optional layer/listen/accel
knobs); compiles and skips cleanly with no env, hardware, or model.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* test(ds4): pass coordinator ctx to worker; lowercase error string

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* docs(ds4): note distributed transport is plaintext/unauthenticated

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* style(ds4): replace em dashes in distributed docs/agent/test per repo convention

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(ds4): link ds4-worker with the C++ driver for CUDA/Metal builds

The ds4-worker target is built from worker_main.c (C), so CMake linked it
with the C driver. The nvcc-built ds4_cuda.o (and Obj-C++ ds4_metal.o)
reference the C++ runtime, so the CUDA/Metal builds failed with undefined
libstdc++ symbols (std::__throw_length_error). The CPU build passed because
ds4_cpu.o is pure C. Force LINKER_LANGUAGE CXX so libstdc++ is linked.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-31 00:09:55 +02:00
LocalAI [bot]
a44bdb29d4 feat: prefix-cache-aware routing for distributed mode (#10071)
* feat(radixtree): generic prefix tree skeleton with longest-match

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(radixtree): Insert with path recency refresh and entry cap

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(radixtree): TTL idle-expiry and Evict sweep with branch pruning

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(radixtree): recency-weighted per-value Weight

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(radixtree): Remove all entries for a value

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(radixtree): race-free concurrency smoke test

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(radixtree): reclaim empty branches, RWMutex reads, TTL boundary, empty-key guard

Address review findings on the generic prefix tree:

- Extract a shared pruneWalk helper parameterized by a shouldClear
  predicate and use it from Evict, Remove, and the MaxEntries path.
  Previously evictOldestLocked cleared a victim's value but never
  removed the now value-less node or its childless ancestors, so
  internal nodes accumulated under sustained churn at the cap. The
  MaxEntries path now prunes the victim and its empty ancestors.
- DRY: pruneWalk replaces the duplicated logic in the former
  pruneLocked and Remove's inner closure.
- Switch Tree.mu to sync.RWMutex; LongestMatch, Weight and Len take
  the read lock (RLock) while Insert, Evict and Remove keep the write
  lock. Confirmed race-clean under go test -race.
- Document the strict greater-than TTL boundary on Options.TTL and
  expired: age exactly equal to TTL is still live.
- Guard Insert against an empty key (no-op): the root never holds a
  value.

Adds Ginkgo specs covering MaxEntries eviction, ancestor reclamation,
the no-growth-past-cap invariant, the TTL boundary, and empty-key
behavior for both Insert and LongestMatch.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): RoutePolicy enum with parse/resolve

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): Config with defaults and validation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): deterministic xxhash prefix-chain extractor

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): pure filter-then-score replica selection

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): Provider interface and radix-tree-backed Index

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* style(prefixcache): gofmt policy enum comment alignment

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(prefixcache): head-first prefix chunking and hoist Weight out of sort

Address code-quality review findings in the prefixcache package.

Correctness: ExtractChain now chunks from absolute offset 0 with fixed
[0,W),[W,2W),... boundaries and caps the chain to the FIRST MaxDepth
head blocks. The previous tail-keeping logic shifted the byte offset by a
non-window amount once a conversation grew past MaxDepth*WindowBytes,
changing every hash each turn and silently breaking cross-turn
longest-prefix matching. The reusable KV/prefix cache lives at the head
of the prompt, so anchoring at offset 0 makes the chain a true
prefix-chain: P and P+suffix share their full leading overlap. Add a
regression spec proving cross-turn stability past the cap.

Performance: Index.Decide precomputes each candidate's Weight once
(decorate-sort-undecorate) instead of calling the O(tree size) Weight
inside the O(n log n) sort comparator. Behavior is unchanged.

Lint: encode prev with binary.LittleEndian.PutUint64 instead of a manual
byte loop, clearing the modernize rangeint finding.

Also add a concurrent Decide/Observe/Invalidate spec to exercise Index's
documented concurrency safety under go test -race.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(messaging): prefixcache observe/invalidate subjects and payloads

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): NATS sync publish/apply for observe and invalidate

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributedhdr): ctx carrier for prefix-hash chain

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributedhdr): PrefixChainHook indirection for backend-side chain build

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend): stash prompt prefix chain on ctx before distributed routing

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend): mirror modelID fallback for prefix-chain salt parity

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): scheduling config columns for prefix-cache routing

Add RoutePolicy and per-model balance/prefix-match override columns to
ModelSchedulingConfig and include them in the SetModelScheduling upsert
DoUpdates list so updates are not dropped on conflict.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): optional route preference in FindAndLockNodeWithModel

Add a RoutePreference type and a new pref parameter so the atomic
pick+lock+increment can be biased toward a preferred node without
weakening atomicity. A nil preference reproduces the previous ORDER BY
behavior exactly. Update the ModelRouter interface, both router.go call
sites (pass nil for now; Phase 5 builds the real preference), the test
doubles, and the distributed e2e caller.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): make Sync satisfy Provider with Evict

Sync.Observe now returns whether the local index treated the assignment as
new or extended, and Sync gains an Evict method that delegates to the wrapped
index. Together these let SmartRouter hold a single prefixcache.Provider that
broadcasts via NATS. Adds a compile-time Provider assertion and an
Evict-delegates behavioral test.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): prefix-cache-aware preference and observe in SmartRouter.Route

Add a PrefixProvider + PrefixConfig to SmartRouterOptions/SmartRouter (nil
keeps routing byte-for-byte the round-robin floor). On each request Route now
calls buildPreference: it reads the prompt prefix chain from ctx
(distributedhdr.PrefixChain), resolves the per-model policy/thresholds over
the global config, loads candidate replica in-flight via a new registry read
LoadedReplicaStats (deduped to one entry per node using the MIN in-flight
across that node's replicas), asks the provider to Decide, and runs
prefixcache.Select. The chosen node is passed as the RoutePreference to
FindAndLockNodeWithModel on all three pick paths (cache hit, locked re-pick,
cold scheduleAndLoad), and the served node is recorded via Observe only when
the resolved policy is prefix_cache so round-robin models never pollute the
tree.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): invalidate prefix-cache entries on unload and stale removal

UnloadModel and both staleness fall-through paths in Route (after a failed
gRPC probe and RemoveNodeModel) now call prefixProvider.Invalidate(model,
nodeID), guarded by a nil-provider check so the round-robin floor is
unchanged. At runtime the provider is the *prefixcache.Sync, so invalidations
also broadcast to peer frontends. Adds a test that a previously hot prefix no
longer Decides to a node after UnloadModel.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): rolling forced-disturb pressure counter

Add a concurrency-safe per-model rolling counter that tracks how many
times a request had a usable hot prefix match but the load guard forced
it off the warm node. Entries outside the window are dropped lazily on
Count so the backing slice stays bounded.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): autoscale on prefix-cache forced-disturb pressure

Wire the rolling forced-disturb counter into the SmartRouter and the
ReplicaReconciler.

Router: in buildPreference, after Decide + Select, record a forced-disturb
when a usable hot prefix match existed (d.HotNodeID != "" and
d.MatchRatio >= cfg.MinPrefixMatch) but Select chose a different node (or
nothing) because the load guard ruled the warm node out. This is the
scale-worthy signal: the cache-warm replica is saturated. It deliberately
does not fire for all-unique workloads (no hot match), avoiding
false-positive scale-ups. Pressure is optional on SmartRouterOptions; nil
keeps the path a no-op.

Reconciler: read the same Pressure instance in reconcileModel as an extra
scale-up reason, reusing the existing MaxReplicas + ClusterCapacityForModel
guards and the UnsatisfiableUntil cooldown that gates the whole method.
Pressure never overrides MaxReplicas and never force-evicts; a no-capacity
model does not spin. Window and threshold come from prefixcache.Config
(PressureWindow default 1m, PressureScaleThreshold default 1) and are
configurable via ReplicaReconcilerOptions.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(prefixcache): bound Pressure slice in Record; drop dead reconciler pressureWindow

Record now prunes entries older than the rolling window (the same prune
Count does), via a shared pruneLocked helper, so a model that takes
forced-disturb records but is never Counted (e.g. one with zero loaded
replicas the reconciler skips) no longer grows its backing slice
unbounded.

Also removes the dead pressureWindow struct field and the
ReplicaReconcilerOptions.PressureWindow option from the reconciler: they
were stored but never read (the window lives inside the *prefixcache.Pressure
instance). The scale block now reads pressure.Count once into a local.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(api): prefix-cache fields in scheduling endpoint DTO with validation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): prefix-cache routing controls in node scheduling form

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): wire prefix-cache index, NATS sync, and config

Activates prefix-cache-aware routing in distributed mode. Builds the
prefixcache Index + NATS-backed Sync + Pressure counter, installs the
distributedhdr.PrefixChainHook so core/backend/llm.go attaches a prefix
chain per request, subscribes to prefixcache.observe/prefixcache.invalidate
to apply peers' events to the local index (no re-broadcast), threads
PrefixProvider/PrefixConfig/Pressure into the SmartRouter and
Pressure/PressureThreshold into the ReplicaReconciler, and runs a
background eviction ticker (every TTL/2) bound to the app context.

Enabled by default; --distributed-prefix-cache=false (LOCALAI_DISTRIBUTED_PREFIX_CACHE)
opts out and leaves the provider/pressure nil so routing stays round-robin.
--distributed-prefix-cache-ttl (LOCALAI_DISTRIBUTED_PREFIX_CACHE_TTL, default 5m)
controls entry idle-timeout and eviction cadence.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(nodes): round-robin-floor invariant for prefix-cache routing

Drives Select directly: a saturated hot node (in_flight 50 vs 0) is never
picked even with a perfect prefix match (round-robin floor holds), while a
balanced hot node within the load slack is reused.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(prefixcache): clear branch lint findings and em dashes

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): validate prefix-cache config at startup wiring

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* perf(radixtree): single-walk WeightsFor for batch value weights

Add Tree.WeightsFor(values, now) which computes the recency-weighted
weight for many values in a single O(N + len(values)) tree traversal,
versus calling Weight once per value (O(len(values) * N)). Consumers
that score K candidates against the tree under the read lock no longer
pay K full walks.

Extract the per-entry contribution math into an unexported helper shared
by both Weight and WeightsFor so the metric stays identical (DRY).
Weight's public behavior is unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(config): add ModelConfig.ModelID() single source of truth

The c.Name fallback to c.Model was duplicated in core/backend/options.go
(feeding model.WithModelID) and hand-copied into core/backend/llm.go (the
prefix-chain salt). These MUST agree or the prefix-cache salt diverges
silently from the id the model loader tracks. Consolidate both into a new
config.ModelConfig.ModelID() helper and call it from both sites. Behavior
is identical.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* perf(prefixcache): reuse one xxhash.Digest in ExtractChain

ExtractChain allocated a fresh xxhash.New() Digest per block (up to MaxDepth
per call) and grew the chain slice without preallocation. Reuse a single
Digest via Reset() before each block and preallocate the chain to
min(nBlocks, MaxDepth).

xxhash seed 0 is stateless, so Reset()+Write produces the byte-identical
value to a fresh New()+Write. Output hashes are unchanged, preserving the
cross-process determinism that peers rely on over NATS. Verified by capturing
ExtractChain output for the existing test inputs before and after the
refactor: identical. Existing extractor tests pass unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(prefixcache): drop hot match when matched node is not a candidate; weigh cold candidates in one walk

Index.Decide called radixtree.LongestMatch over the whole tree, so the
deepest match could be a node that is offline, unloaded, or simply not in
the passed candidate set. Honoring that as HotNodeID produced a false
forced-disturb signal upstream (buildPreference records pressure when
chosen != HotNodeID), making it look like a warm replica was load
saturated when it was actually absent.

Build the candidate set once and only set HotNodeID/MatchRatio when the
matched node is an actual candidate; otherwise fall back to cold
placement. A future refinement could ask the tree for the longest match
restricted to the candidate nodes (shallower-but-valid) instead of
dropping it.

Also replace the per-candidate tree.Weight call in the cold-order sort
with a single tree.WeightsFor walk, turning O(K*N) under the read lock
into O(N + K).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(prefixcache): remove Select's unreachable deterministic fallback

buildPreference always passes ColdOrder as a permutation of the full
candidate set, so the cold-order loop hits every eligible candidate. The
trailing best/bestIF scan was dead. Replace it with a plain "return """
and document that ColdOrder is guaranteed to cover all candidates, so ""
means none were eligible.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(nodes): fetch model scheduling config once per Route

GetModelScheduling was read three times per request - in
resolveSelectorCandidates, buildPreference, and nodeMatchesScheduling -
three DB round-trips for one row that is immutable for the life of the
request, and not a consistent snapshot. Fetch it once near the top of
Route and thread the *ModelSchedulingConfig (may be nil) into all three
helpers. scheduleNewModel keeps its own fetch since it runs outside the
Route snapshot. Behavior is identical for nil sched.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(autoscale): add Pressure.Reset to consume forced-disturb signal

Pressure.Count is non-draining (it prunes only by age), so a single burst
of forced-disturbs stays within the rolling window for the whole window and
keeps Count >= threshold on every reconciler tick. The reconciler will use
Reset to clear a model's events after acting on the signal so a fresh
scale-up requires fresh forced-disturbs to accumulate, rather than one burst
driving the model toward MaxReplicas.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(autoscale): at most one scale-up per reconcile tick, consume pressure

Two autoscale bugs:

1. Over-scaling: the pressure scale-up block read Pressure.Count but never
   consumed it. With a non-draining counter a single forced-disturb burst
   kept Count >= threshold across the whole window, firing scaleUp on every
   tick and pushing the model toward MaxReplicas off one transient burst.
   After a successful pressure-triggered scale-up the reconciler now calls
   Pressure.Reset to consume the signal.

2. Double scale-up in one tick: the all-replicas-busy block and the pressure
   block could both fire in the same reconcileModel pass, each calling
   scaleUp(+1) against the same `current` read once at the top, so a model
   that was both busy and over threshold scaled +2 and could overshoot
   MaxReplicas by one. A scaledUp flag now enforces at most one scaleUp(+1)
   per tick: the pressure block is skipped if the busy block already scaled,
   and scale-down is skipped in any tick that scaled up.

MinReplicas enforcement, UnsatisfiableUntil backoff, and capacity guards are
unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): replica-removed chokepoint hook for prefix-cache invalidation

Add SetReplicaRemovedHook to NodeRegistry and fire it from both
RemoveNodeModel and RemoveAllNodeModelReplicas after a successful
delete. This is the single chokepoint every replica-removal path funnels
through (router eviction, reconciler scale-down, probe reaper,
health-monitor node-down reap, RemoteUnloaderAdapter), so the
prefix-cache index can be invalidated by construction rather than wiring
each call site individually.

The hook is stored in an atomic.Pointer so the startup wiring (setter)
and the request/reconcile-time fire are race-free; it is nil-safe when
unset. GORM Delete reports no error for a no-op delete, so the hook also
fires when nothing was removed; the consumer's Invalidate(model, node)
is idempotent so this is harmless.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): invalidate prefix-cache on any replica removal via registry hook

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(prefixcache): single source of truth for threshold bounds

Extract ValidateThresholds into prefixcache/config.go so the per-model
override validation (nodes.go endpoint) and Config.Validate share one
implementation of the numeric bounds (min_prefix_match in [0,1],
balance_abs_threshold >= 0, balance_rel_threshold == 0-or->= 1) instead
of hard-coding them in two places. The route_policy allow-list stays
explicit (not ParsePolicy, which maps typos to Default).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(nodes): preserve prefix-cache settings on partial scheduling update

A scheduling POST that omitted route_policy/thresholds (e.g. a
min_replicas-only update) full-replaced every column and silently reset
the model's previously-configured prefix-cache settings to empty/zero.

Make the four prefix-cache request fields pointers so omitted is
distinguishable from explicit zero, and merge PATCH-style in
SetSchedulingEndpoint: a provided pointer wins, an omitted one preserves
the existing config value (zero default when none). Non-prefix fields
keep their full-replace PUT semantics. Validation now runs on the
resolved values via prefixcache.ValidateThresholds.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(prefixcache): make Invalidate a no-op for uncached models and skip empty broadcasts

A registry chokepoint fires Sync.Invalidate(model, nodeID) for every replica
removal of every model, including round-robin models that never used the
prefix cache. Index.Invalidate previously called tree(model), which lazily
created and permanently retained an empty radix tree for any model that ever
lost a replica, growing the trees map without bound. Sync.Invalidate also
published a NATS PrefixCacheInvalidateEvent on every call, amplifying no-op
removals across the cluster.

Index.Invalidate now looks the tree up read-only via existingTree and returns
without allocating when none exists. The Provider interface is unchanged;
Sync gates the broadcast through an optional invalidateExisting(bool) capability
type-asserted from the wrapped Index, falling back to the prior always-broadcast
behavior for other Provider implementations.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* perf(prefixcache): derive Decide candidacy from WeightsFor and skip trivial sort

WeightsFor already returns a map keyed by every requested candidate, so the
separate candidates set built to validate the hot match was redundant: a node
is a candidate iff it is a key in the weights map. Drop the extra map and gate
the hot-match check on weights membership. Also skip the sort when there is at
most one candidate, since the input order is already the cold order. Behavior
is unchanged.

Deferred follow-up: skipping the WeightsFor walk entirely when a hot match wins
would need lazy cross-file changes and is out of scope here.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(nodes): fire replica-removed hook on bulk node_models deletes; trim LoadedReplicaStats columns

Bulk node-scoped node_models deletes (Register re-register cleanup,
MarkOffline, MarkDraining, Deregister) removed rows directly without
firing the replica-removed hook, so the prefix-cache index kept
pointing at nodes whose models were gone. Capture the DISTINCT model
names before each bulk delete and fire fireReplicaRemoved once per
model after a successful delete, restoring the single-chokepoint
invariant for all removal paths. The pre-query is skipped when no hook
is set so the no-hook path stays cheap.

Also narrow LoadedReplicaStats to SELECT only node_id and in_flight
(the only fields the router consumer reads), dropping the JOIN-side
available_vram fetch and unused columns while keeping the
[]ReplicaCandidate return type unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(reconciler): consume autoscale signals only on a real scale-up

scaleUp was fire-and-forget (void) yet its callers unconditionally
consumed the pressure signal (Pressure.Reset) and the MinReplicas
hysteresis (ClearUnsatisfiable) right after calling it. If scaleUp
added nothing (ScheduleAndLoadModel errored, or no node could be
loaded) the saturated warm replica got no new replica AND its
accumulated forced-disturb history was wiped, forcing the signal to
re-accumulate over a full PressureWindow before the next attempt.

Make scaleUp return whether at least one replica was actually
scheduled, and gate the side effects on it:

- pressure block (2b): set scaledUp and call Pressure.Reset only on
  success; on failure preserve the signal so the next tick retries off
  the same accumulated pressure.
- busy-burst block (2): set scaledUp from the return value so a failed
  attempt does not suppress the pressure path or scale-down.
- MinReplicas block: call ClearUnsatisfiable only on success so a
  failed attempt does not reset the unsatisfiable counter.

All existing invariants (MaxReplicas, capacity gating,
UnsatisfiableUntil cooldown, at-most-one-scale-up-per-tick) are
preserved.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(nodes): drop router's redundant prefix-cache Invalidate calls

The NodeRegistry removal chokepoint (RemoveNodeModel /
RemoveAllNodeModelReplicas) now fires SetReplicaRemovedHook, which
invalidates the prefix-cache index. The router was also calling
prefixProvider.Invalidate explicitly right after each registry removal
on the two stale-replica health-probe fall-throughs in Route and in
UnloadModel, so every router-side eviction invalidated twice (double
tree-prune + double NATS broadcast).

Remove the three redundant explicit Invalidate calls and their empty
nil-guards. Each removed call sat immediately after a registry removal
that fires the hook, so invalidation is preserved via the chokepoint.
Decide/Observe usage is untouched.

Re-point the unit test (fake registry fires no hook) to assert the
removal chokepoint is exercised on unload instead of the router's
direct invalidation.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(prefixcache): broadcast invalidations unconditionally for cross-frontend coherence

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(prefixcache): reject TTL<=0 in Config.Validate (eviction ticker would panic)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(nodes): make capture+delete atomic in bulk node_models removal paths

MarkOffline, MarkDraining, and the Register re-register cleanup ran the
nodeModelNames SELECT and the bulk node_models DELETE as two separate
statements on r.db with no transaction. A SetNodeModel landing between
the two was deleted but its replica-removed hook never fired, leaving
the prefix-cache index pointing at a removed replica until TTL or
candidacy self-heal.

Wrap the capture and the delete in a single db.Transaction in each path
(mirroring how Deregister already does it). The captured model names are
collected into a slice declared outside the closure; the
replica-removed hook fires for each only after the transaction commits,
so a rollback never invalidates the index for a removal that did not
persist. The set of fired hooks now equals exactly the set of
node_models rows actually deleted, with no interleaving gap.

The status flip in MarkOffline/MarkDraining (setStatus) is a separate,
pre-existing operation and routing already filters non-healthy nodes, so
it stays outside the transaction; return contracts are unchanged.
Deregister was already correct and is untouched. The cheap-path skip
(no hook -> skip the SELECT) is preserved.

Adds a spec asserting MarkOffline fires hooks for exactly the rows it
deletes and leaves no node_models row behind (consistent snapshot).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(nodes): debug logging for prefix-cache routing decisions and observations

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(radixtree): match shared prefixes by valuing every node on insert

Insert recorded the value (node id) only on the final node of the key
chain, leaving every intermediate prefix node valueless. LongestMatch
returns the deepest node that hasValue, so two chains that share a
leading block but diverge in the tail never matched: only exact-repeat
queries hit. That broke the prefix-cache routing core use cases (shared
system prompt, multi-turn extension, volatile tail), all of which rely
on prefix matching rather than exact-repeat.

Set value/hasValue/lastSeen at every node along the chain so each
prefix-block node remembers the node id that served that prefix
(SGLang/vLLM-style). The deepest match wins, and the last writer owns a
shared prefix node (a recency heuristic: the most recent chain through a
block is the one most likely still warm). size now counts valued nodes,
which is the intended meaning.

Updated radixtree tests to the new semantics: deepest-prefix test uses
non-overlapping chains, a new test asserts last-writer-owns-shared-node,
Evict/Remove/MaxEntries expectations recomputed for per-prefix-node
counting, and a shared-prefix LongestMatch red test added. Added a
prefixcache Decide test proving a prefix-only query routes to the warm
node. No prefixcache .go logic changed.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(distributed): lock in prefix-cache routing behavior end to end

Add a DB-backed e2e spec that drives SmartRouter against a real
NodeRegistry (Postgres testcontainer) and the real prefixcache.Index
radix-tree provider, using a fake gRPC backend factory so no real
inference runs. Covers the five behaviors validated by hand:

1. Cold miss + observe: an unseen prefix chain cold-places and is recorded.
2. Hot-match affinity: the same chain returns to its warm node X.
3. Shared-prefix match: a divergent chain sharing X's leading prefix
   still routes to X (the radix-tree regression we fixed).
4. Negative control: an unrelated chain is a cold miss, not a false
   hot match on X.
5. Failover + invalidation: removing X's replica fires the registry
   chokepoint hook to invalidate the prefix entry, and the chain fails
   over to surviving node Y and re-homes there.

Replaces the need for manual docker-compose re-runs.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(prefixcache): make prefix-cache affinity replica-granular

Track prefix-cache affinity per loaded replica (a backend process with its
own KV cache) instead of per node, so multiple replicas of the same model on
one node each keep distinct affinity and a hot prefix routes back to the exact
replica that served it.

- radixtree: add RemoveFunc(pred) and reimplement Remove on top of it.
- prefixcache: introduce ReplicaKey{NodeID, Replica}; Index/Candidate/
  PrefixDecision/Select/Provider now key on ReplicaKey. Add InvalidateNode to
  drop every replica of a node; Invalidate drops one replica. Select returns
  (ReplicaKey, bool) and gains a deterministic least-in-flight eligible
  fallback (tiebreak NodeID then Replica).
- messaging: carry Replica on PrefixCacheObserveEvent and
  PrefixCacheInvalidateEvent (Replica < 0 means all replicas of the node).
- Sync delegates + broadcasts with replica; InvalidateNode broadcasts
  Replica=-1; ApplyInvalidate routes negative replica to InvalidateNode.

This is part 1 of 2; the registry/router/wiring consumers are updated
separately.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): make prefix-cache routing replica-granular

Wire the SmartRouter, NodeRegistry, and distributed startup to the
replica-keyed prefixcache API. Affinity is now tracked per replica
(each replica is a separate process with its own KV cache), so a prefix
served by (node,0) no longer leaks onto the same-node sibling (node,1).

- RoutePreference gains PreferredReplica; FindAndLockNodeWithModel locks
  the EXACT (node_id, replica_index) row, falling through to the default
  ORDER BY when that replica is not loaded.
- SetReplicaRemovedHook now carries replicaIndex; RemoveNodeModel fires
  the specific replica, RemoveAllNodeModelReplicas and the four bulk
  node-scoped deletes fire replica<0 (all replicas of the node).
- buildPreference builds one Candidate per loaded replica and locks the
  exact replica the policy chose; observePrefix records the served
  ReplicaKey at every call site.
- distributed.go routes the hook to InvalidateNode (replica<0) or
  Invalidate(key).
- Tests updated to the replica-keyed API plus new coverage: a hot prefix
  on (node,0) prefers replica 0 over the same-node sibling (router unit +
  e2e), and FindAndLock locks the exact preferred replica.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(distributed): derive prefix chain from messages for tokenizer-template models

Prefix-cache-aware routing built its prompt-prefix chain from the rendered
prompt string `s` in ModelInference. For models with
TemplateConfig.UseTokenizerTemplate the frontend never renders a prompt - the
backend tokenizes the structured messages itself - so `s` is empty, the chain
is empty, and routing silently falls back to round-robin. That covers the bulk
of modern chat models (qwen3, llama3, ...), so the feature effectively never
engaged for them.

Fall back to messagesPrefixSource(messages): a deterministic, prefix-stable
head-first serialization of the conversation (role + content per turn). Two
requests sharing a leading system prompt and early turns share a leading byte
prefix, which ExtractChain maps to a shared chain prefix - landing both on the
same cache-warm replica. The rendered `s` is still preferred when present
(higher fidelity for non-template models).

Found via the multi-replica-per-node e2e: zero "prefix-cache routing decision"
logs despite per-request Route calls, traced to the empty-chain guard.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): document prefix-cache routing roadmap

Add a routing-and-caching roadmap section to the distributed-mode guide,
linking the epic (#10063) and the follow-up issues (#10064-#10070) surfaced
from a survey of SGLang, vLLM production-stack, Ray Serve, llm-d, AIBrix, and
NVIDIA Dynamo.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-30 23:24:22 +02:00
LocalAI [bot]
aee4611ab2 chore: ⬆️ Update mudler/parakeet.cpp to 30a307553f1965ceb38a1a922069a71e7dd67bf3 (#10092)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-30 22:48:09 +02:00
LocalAI [bot]
486467623c chore: ⬆️ Update antirez/ds4 to e16ead1e29c81a67bbb64e5b001117679cf9ce6e (#10076)
* ⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix(ds4): link new ds4_distributed.o into grpc-server build

Upstream ds4 e16ead1e split distributed inference into a new translation
unit (ds4_distributed.c/.h). ds4.c and ds4_cpu.o now reference its
ds4_dist_* symbols, so the grpc-server link fails with undefined
references unless that object is built and linked.

Add ds4_distributed.o to both the upstream object build (Makefile) and
the grpc-server link set (CMakeLists.txt) for every GPU mode. It is a
single GPU-agnostic object, so it is built/linked unconditionally.

Verified: the six undefined ds4_dist_session_* references in ds4_cpu.o
are all defined by the newly built ds4_distributed.o (nm cross-check).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-30 22:08:30 +02:00
LocalAI [bot]
4912c9b73a feat(parakeet-cpp): add NVIDIA NeMo Parakeet ASR backend (parakeet.cpp) (#10084)
* feat(parakeet-cpp): L0 backend scaffold, LoadModel + AudioTranscription (text)

Add a Go gRPC backend that bridges LocalAI to parakeet.cpp via the flat
C-API (parakeet_capi.h), loaded with purego (cgo-less, mirrors the
whisper / vibevoice-cpp backends).

L0 scope:
- main.go: dlopen libparakeet.so (override via PARAKEET_LIBRARY), register
  the C-API entry points, start the gRPC server.
- goparakeetcpp.go: Load (parakeet_capi_load), AudioTranscription
  (parakeet_capi_transcribe_path, decoder=0 = per-arch default head),
  Free, serialized through base.SingleThread since the C engine is a
  thread-unsafe singleton. char* returns are bound as uintptr so the
  malloc'd buffer is freed via parakeet_capi_free_string after copy.
- AudioTranscriptionStream returns a clear "not implemented in L0" error
  (closes the channel so the server doesn't hang), wired in L2.
- Makefile: clone-at-pin + cmake (PARAKEET_VERSION for bump_deps.sh),
  with a local-symlink dev shortcut; run.sh / package.sh mirror whisper.
- Test auto-skips without PARAKEET_BACKEND_TEST_MODEL/_WAV fixtures.

Builds clean (CGO_ENABLED=0), gofmt clean, test passes. The single
unsafeptr vet note in goStringFromCPtr is documented and matches the
whisper backend's tolerated pattern.

Word/segment timestamps (L1) and cache-aware streaming (L2) follow.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(parakeet-cpp): L1 word/segment timestamps via transcribe_path_json

AudioTranscription now calls parakeet_capi_transcribe_path_json and shapes
the per-word / per-token timestamps into the TranscriptResult:

- Bind parakeet_capi_transcribe_path_json (purego, char* as uintptr like
  the other returns) and register it in main.go + the test loader.
- Parse the JSON document ({"text","words":[{w,start,end,conf}],
  "tokens":[{id,t,conf}]}) into typed structs.
- Synthesise a single whole-clip segment (parakeet emits no native segment
  boundaries) spanning the first word start to the last word end; token ids
  populate Segment.Tokens.
- Attach word-level timings only when timestamp_granularities=["word"],
  matching the OpenAI API (segment-level default). secondsToNanos mirrors
  the whisper backend's nanosecond convention.

Verified end-to-end against tdt_ctc-110m (f16): both the default and
word-granularity specs pass; builds clean, gofmt clean, vet shows only the
one documented unsafeptr note shared with the whisper backend.

Cache-aware streaming (L2) follows.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(parakeet-cpp): L2 cache-aware streaming with EOU segmentation

Wire AudioTranscriptionStream to the streaming RNN-T C-API:

- Bind parakeet_capi_stream_{begin,feed,finalize,free}; feed takes 16 kHz
  mono float PCM ([]float32 via purego) and writes *eou_out on <EOU>/<EOB>.
- Decode opts.Dst to 16 kHz mono PCM (utils.AudioToWav + go-audio, same as
  the whisper backend), feed it in 1 s chunks, and emit each newly-finalized
  text run as a TranscriptStreamResponse delta.
- <EOU>/<EOB> events close the current segment; a closing FinalResult carries
  the full transcript plus the per-utterance segments (with a whole-clip
  fallback segment when no EOU fired).
- stream_begin returns 0 for non-streaming models, surfaced as a clear
  error instead of an empty stream. Honours context cancellation between
  chunks. Frees every malloc'd delta and the session.

Verified end-to-end against realtime_eou_120m-v1 (f16): the streamed
transcript matches the offline 110m reference word-for-word, deltas
reconstruct the final text, and the spec passes alongside the offline
specs. Builds clean, gofmt clean, vet shows only the shared documented
unsafeptr note.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(parakeet-cpp): L3 register backend in build/CI/gallery (whisper parity)

Wire the new Go gRPC parakeet-cpp backend (parakeet.cpp ggml port of NVIDIA
NeMo Parakeet ASR) into LocalAI's build/CI/gallery surfaces, matching the
existing ggml whisper Go backend 1:1.

- .github/backend-matrix.yml: add 11 linux entries + 1 darwin entry mirroring
  every whisper build (cpu amd64/arm64, intel sycl f32/f16, vulkan amd64/arm64,
  nvidia cuda-12, nvidia cuda-13, nvidia-l4t-arm64, nvidia-l4t-cuda-13-arm64,
  rocm hipblas, metal-darwin-arm64), all on ./backend/Dockerfile.golang with
  backend: "parakeet-cpp" and -*-parakeet-cpp tag-suffixes.
- scripts/changed-backends.js: explicit inferBackendPath branch resolving
  parakeet-cpp to backend/go/parakeet-cpp/ before the generic golang branch.
- .github/workflows/bump_deps.yaml: track the PARAKEET_VERSION pin in
  backend/go/parakeet-cpp/Makefile (repo mudler/parakeet.cpp, branch master).
- backend/index.yaml: add &parakeetcpp meta + latest/development image entries
  for every matrix tag-suffix.
- Makefile: add backends/parakeet-cpp to .NOTPARALLEL, BACKEND_PARAKEET_CPP
  definition, docker-build target eval, and test-extra-backend-parakeet-cpp-
  transcription target (mirrors test-extra-backend-whisper-transcription).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(parakeet-cpp): L4 gallery importer for parakeet GGUFs

Add ParakeetCppImporter so parakeet.cpp GGUFs auto-detect on /import-model
and route to the parakeet-cpp backend (it also surfaces in /backends/known,
which drives the import dropdown).

- Match is narrow: a .gguf whose name carries a parakeet architecture token
  (<arch>-<size>-<quant>.gguf, e.g. tdt_ctc-110m-f16.gguf, rnnt-0.6b-q4_k.gguf,
  realtime_eou_120m-v1-q8_0.gguf), a direct URL to one, or
  preferences.backend="parakeet-cpp". It deliberately does NOT claim arbitrary
  llama-style GGUFs, nor the upstream nvidia/parakeet-* NeMo repos (.nemo, not
  runnable here).
- Registered in the ASR batch BEFORE LlamaCPPImporter so its GGUFs aren't
  swallowed by the generic .gguf importer.
- Import nests files under parakeet-cpp/models/<name>/, defaults to the
  smallest quant (q4_k, near-lossless on parakeet) with a size-ladder
  fallback, and honours preferences.quantizations / name / description.

Tested with synthetic HF details (no network): metadata, positive matches
(HF repo, direct URL, preference), narrowness negatives (llama GGUF, NeMo
repo), and import (default quant, override, direct URL), 9 specs pass,
build/vet/gofmt clean.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(parakeet-cpp): document the parakeet-cpp transcription backend

Add parakeet-cpp to the audio-to-text backend list and a dedicated usage
section: direct GGUF import (auto-detects to the backend), model YAML,
word-level timestamps via timestamp_granularities[]=word, and cache-aware
streaming with the realtime_eou model. Points at the mudler/parakeet-cpp-gguf
collection repo.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(parakeet-cpp): wire transcription gRPC e2e test into test-extra

The L3 commit added the test-extra-backend-parakeet-cpp-transcription
Makefile target but never invoked it in CI. Mirror the whisper job:

- Add a parakeet-cpp output to detect-changes (emitted by
  changed-backends.js from the matrix entry).
- Add tests-parakeet-cpp-grpc-transcription, gated on the parakeet-cpp
  path filter / run-all, building the backend image and running the
  transcription e2e against tdt_ctc-110m + the JFK clip.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* style(parakeet-cpp): drop em dashes from comments and docs

Replace em dashes with plain punctuation in the backend comments, the
importer, package.sh, and the audio-to-text docs section (and use "and"
instead of the multiplication sign). No behaviour change.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add parakeet-cpp f16 models to the model gallery

Add the 10 NVIDIA Parakeet models (f16, the recommended quality/speed
default) as gallery entries that install on the parakeet-cpp backend from
mudler/parakeet-cpp-gguf: tdt_ctc-110m/1.1b, tdt-0.6b-v2/v3, tdt-1.1b,
ctc-0.6b/1.1b, rnnt-0.6b/1.1b, and the cache-aware streaming
realtime_eou_120m-v1. Each pins the file sha256 and routes transcript
usecases to the backend.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(parakeet-cpp): satisfy govet lint + bump PARAKEET_VERSION

- goparakeetcpp.go: //nolint:govet on the C-owned-pointer unsafe.Pointer
  conversion (golangci-lint reports new-only issues, so unlike the whisper
  backend's identical line this one is flagged).
- Makefile: bump PARAKEET_VERSION to the current parakeet.cpp master commit
  (the previous pin's commit no longer exists after upstream history was
  squashed), so the backend image clone/build resolves again.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(parakeet-cpp): pin PARAKEET_VERSION to a tag-stable commit

The previous SHA pin was orphaned when parakeet.cpp's single-commit master
was amended/force-pushed, so the backend image clone (git fetch <sha>) failed
across every build variant. Repoint to 845c29e, which upstream now keeps
permanently fetchable via the `localai-backend-pin` tag, so future upstream
amends no longer break the backend build.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(parakeet-cpp): init the ggml submodule in the backend image clone

The backend Dockerfile clones parakeet.cpp at PARAKEET_VERSION with a shallow
fetch + checkout but never initialised submodules, so third_party/ggml was
empty and the parakeet.cpp cmake build failed at
`add_subdirectory(third_party/ggml)` (CMakeLists.txt:53) on every build
variant. Add `git submodule update --init --recursive --depth 1
--single-branch` after checkout, mirroring the whisper backend. Verified
locally: clone + submodule + cmake configure now succeeds.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(parakeet-cpp): statically link ggml into libparakeet.so

The shared libparakeet.so linked ggml's shared libs (libggml*.so), but the
package only ships libparakeet.so, so at runtime dlopen failed with
"libggml.so.0: cannot open shared object file" (the e2e transcription test
panicked on load). Build ggml static + PIC (BUILD_SHARED_LIBS=OFF,
CMAKE_POSITION_INDEPENDENT_CODE=ON) so libparakeet.so embeds ggml and depends
only on system libs already present in the runtime image. Verified locally:
ldd shows no libggml dependency.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(parakeet-cpp): non-streaming fallback in AudioTranscriptionStream

The e2e streaming test ran AudioTranscriptionStream against tdt_ctc-110m
(not a cache-aware streaming model), so stream_begin returned 0 and the call
errored. Per LocalAI's streaming contract (and the whisper backend), a
non-streaming model should fall back to a single offline transcription
emitted as one delta plus a closing FinalResult. Do that instead of erroring,
so the streaming endpoint works for every parakeet model. Verified locally:
the streaming spec passes against the non-streaming 110m model via fallback.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-30 14:46:10 +02:00
Richard Palethorpe
12d1f3a697 security(http): refuse redirects on outbound clients via hardened pkg/httpclient (#10087)
LocalAI's outbound HTTP clients used Go's default redirect policy, which
follows up to 10 redirects. On a cross-host redirect Go forwards custom
request headers — including credential headers such as Anthropic's
x-api-key — to the redirect target (Go strips Authorization, Cookie and
WWW-Authenticate cross-host, but NOT arbitrary custom headers). An
attacker able to elicit a redirect from an upstream (a hijacked or
spoofed upstream, DNS trickery, or a malicious upstream_url) then
harvests the operator's provider API key.

This was first reported against the cloud-proxy / MITM PII path
(GHSA-3mj3-57v2-4636); the same class affects every other outbound
client. Rather than patch each call site, add pkg/httpclient as the one
sanctioned constructor for outbound HTTP and route everything through it.

pkg/httpclient:
  - New(...)             refuses redirects, TLS 1.2 floor, no body
                         deadline (streaming/SSE safe)
  - NewWithTimeout(d)    simple request/response calls
  - WithFollowRedirects  opt-in following that still strips credential
                         headers on any cross-host hop; different
                         scheme/host/port == different origin, guarding
                         the curl CVE-2022-27774 port-confusion class
  - WithTransport(rt)    keep a custom transport (IP-pin, HTTP/2, a
                         credential-injecting RoundTripper)
  - HardenedTransport()  base transport with the TLS floor + bounded setup
  - Harden(c)            apply the policy to a library-supplied *http.Client
  - NoRedirect           the CheckRedirect policy; wraps ErrRedirectBlocked

Lint: a forbidigo rule flags http.DefaultClient and http.Get/Post/
PostForm/Head, pointing at pkg/httpclient (.golangci.yml,
.agents/coding-style.md). forbidigo cannot match the &http.Client{}
composite literal without also flagging legitimate *http.Client type
references, so that form is enforced by review.

Migrates every non-test outbound call site across core/, pkg/, cmd/, and
the Go backend (backend/go/cloud-proxy). Credential-bearing and
internal-RPC clients refuse redirects; download / CDN / registry clients
use WithFollowRedirects so they keep working while stripping secrets
cross-host. The only credential-bearing client that follows redirects is
the gated-download path (pkg/downloader/uri.go), which strips the token
on the cross-host hop to the CDN. Hardening this closes, in passing:
  - MCP remote-server bearer token leaking via a redirect (the
    RoundTripper re-injected Authorization on every hop)
  - agent multimedia/webhook clients leaking user-supplied auth headers
  - cors_proxy following redirects, bypassing its SSRF IP-pin
  - downloader's authorized read path leaking the token cross-host

Fixes: GHSA-3mj3-57v2-4636 (cloud-proxy leaks operator provider API key
(x-api-key) to attacker host on cross-host redirect)
Reported-by: tonghuaroot
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-30 12:04:10 +02:00
LocalAI [bot]
a7cad704b9 chore: ⬆️ Update ggml-org/llama.cpp to 22d66b567eef11cf2e9832f04db64ee0323a0fd0 (#10080)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-30 08:34:00 +02:00
LocalAI [bot]
7e4df67556 chore: ⬆️ Update ggml-org/whisper.cpp to f24588a272ae8e23280d9c220536437164e6ed28 (#10078)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-30 01:09:52 +02:00
LocalAI [bot]
5b24b4dacc chore: ⬆️ Update mudler/rf-detr.cpp to 65c0ffcc9a9bc9dae38252f63d0417c9845a6cf7 (#10075)
⬆️ Update mudler/rf-detr.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-30 00:55:41 +02:00
LocalAI [bot]
52fdb46892 docs: ⬆️ update docs version mudler/LocalAI (#10074)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-30 00:24:34 +02:00
LocalAI [bot]
b389f0fe5f chore(model-gallery): ⬆️ update checksum (#10081)
⬆️ Checksum updates in gallery/index.yaml

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-30 00:11:57 +02:00
LocalAI [bot]
74281be340 chore: ⬆️ Update vllm-project/vllm cu130 wheel to 0.22.0 (#10079)
⬆️ Update vllm-project/vllm cu130 wheel

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-30 00:11:41 +02:00
LocalAI [bot]
cacf2f7a2c chore: ⬆️ Update ikawrakow/ik_llama.cpp to 8960c5ba5ee9db30ba838304373aa4dbec9f7cbd (#10077)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-30 00:11:27 +02:00
LocalAI [bot]
4a2cc64d07 feat(reasoning): honor per-request reasoning_effort on chat completions (#10082)
The OpenAI `reasoning_effort` field only reached the prompt template; it
never toggled the backend's thinking. Map it onto
ReasoningConfig.DisableReasoning (which becomes the enable_thinking gRPC
metadata) in the request merge, so reasoning_effort="none" disables
reasoning per request: the use case from #10072 (run a single Qwen3-style
model and turn reasoning off for low-latency tasks while keeping it on
for others).

Effort levels (minimal/low/medium/high) enable thinking unless the model
config explicitly disabled it (reasoning.disable: true wins and is never
re-enabled by a request); "none" always disables.

Closes #10072


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-29 22:09:07 +00:00
Richard Palethorpe
4647770316 fix(model): track intentional stops, stop misreading clean shutdowns as crashes (#10060)
Two separate issues made graceful backend shutdown look ungraceful in the
logs, even though the processes were being terminated correctly
(go-processmanager defaults to process-group SIGTERM + 15s grace + SIGKILL):

1. "failed to read PID" — startProcess registers a per-process graceful-
   termination handler that calls Stop(), but StopAllGRPC (registered
   earlier, via app.Shutdown) already stopped and released store-tracked
   backends first. The second Stop() then failed reading the removed
   pidfile. Guard the handler with IsAlive() so it skips already-stopped
   processes; it still covers backends StopAllGRPC doesn't track (worker-
   supervised ones).

2. "Backend process exited unexpectedly" exitCode=-1 — the exit watcher
   treated only exit codes 0/143 as clean. But a child killed by our own
   SIGTERM/SIGKILL is reported by Go as exitCode -1 (signal termination),
   not the shell's 128+signal convention, so every intentional stop logged
   a false crash warning. The exit code can't distinguish an intended stop
   from a signal-induced crash.

Track intent directly instead: a stoppingProcs sync.Map (keyed by the
*process.Process pointer) is marked wherever LocalAI calls Stop() on
purpose, and the exit watcher uses it to pick the log level — Info
"stopped" when intentional, Warn "exited unexpectedly" otherwise (still
catching real crashes). The raw exit code is reported as a field but no
longer interpreted.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-29 18:54:27 +02:00
LocalAI [bot]
3c9b9529c0 chore(model gallery): 🤖 add 1 new models via gallery agent (#10061)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-29 16:39:14 +02:00
TLoE419
fc2bd0986c test(utils): cover path verification, sanitization, and unique naming (#9978)
pkg/utils/path.go provides the security primitives for download paths
(VerifyPath, InTrustedRoot) and the file-naming helpers used by every
import flow (SanitizeFileName, GenerateUniqueFileName). None of them had
test coverage, so a future regression in the traversal check or in the
".." stripping inside SanitizeFileName would land unnoticed.

The new specs pin the lexical contract for each helper:

- VerifyPath accepts strict descendants and inner traversal that stays
  inside the base, rejects "..", compound traversal, and the base path
  itself. An explicit spec documents that the check is purely lexical
  (filepath.Clean, not EvalSymlinks) so any future caller that needs
  symlink-aware defence knows to EvalSymlinks first.
- InTrustedRoot rejects the trusted root and sibling directories,
  accepts deeply nested descendants.
- SanitizeFileName covers the leading-directory and absolute-prefix
  paths plus the embedded ".." case ("foo..bar" -> "foobar") that the
  Clean+Base layer alone would leave intact.
- GenerateUniqueFileName covers the no-collision, single-collision,
  walk-the-counter, and empty-extension cases using GinkgoT().TempDir()
  so the suite stays hermetic.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: TLoE419 <tloemizuchizu@gmail.com>
2026-05-29 10:40:08 +00:00
Ching
a473a32678 test(react-ui): cover models gallery empty-state reset flow (#10019)
Exercise the filtered empty-state path in the models gallery and verify
that the clear-filters action restores the list and resets the filter
selection.

Assisted-by: Codex:gpt-5

Signed-off-by: Ching Kao <0980124jim@gmail.com>
2026-05-29 10:39:33 +00:00
LocalAI [bot]
3e220373b0 fix(functions): validate auto-detected XML tool-call names — robust glm-4.5/Hermes guard (#9722, supersedes #9940) (#10059)
fix(functions): validate auto-detected XML tool-call names (#9722)

The XML tool-call auto-detector tries every preset, including glm-4.5 whose
tool block is <tool_call>name...</tool_call>. When a Hermes/NousResearch model
emits <tool_call>{"name":"bash","arguments":{...}}</tool_call>, glm-4.5
mis-claims the block and returns the entire JSON object (or leading prose, or a
JSON array) as the function NAME. The misparse then wins over the JSON parser,
so streaming clients receive a tool call whose name is a JSON blob.

Guard the auto-detect paths in ParseXMLIterative: a returned tool name must look
like a real function name ([A-Za-z0-9_.-]+). Results that don't are dropped so
auto-detection falls through to the next format and ultimately to JSON parsing,
which handles Hermes correctly. An explicitly forced format (format != nil) is
left untouched and trusted verbatim.

This supersedes PR #9940, which dropped only names with a leading "{". That
narrower check misses leading prose ("Sure: {...}"), JSON arrays ("[{...}]")
and brace-less garbage ("name: bash, ..."); the name-shape check rejects all of
them while still accepting legitimate glm-4.5 calls. The fix applies to both the
streaming worker and the non-streaming ParseFunctionCall path, which both call
ParseXMLIterative with auto-detection.


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-29 12:03:33 +02:00
Richard Palethorpe
fbcd886a47 fix(application): stop backend processes synchronously on shutdown (#10058)
application.New wires a fire-and-forget goroutine that runs
StopAllGRPC + distributed.Shutdown when the app context is cancelled.
Callers (tests, CLI signal handler) cancel the context and then exit
immediately, so the test binary / process can terminate before that
goroutine kills the spawned backend children. go-processmanager sets no
Pdeathsig, so the orphans are reparented to init and survive — leaving
dozens of stray mock-backend processes after an e2e run.

Add Application.Shutdown(), which runs the same cleanup synchronously on
the caller's stack and is idempotent via sync.Once. The context-cancel
goroutine, the CLI signal handler, and the test suites all call it, so
cleanup is deterministic and the duplicated teardown logic collapses to
one place. The async goroutine remains as a safety net for callers that
forget; sync.Once dedupes the double call.

Wire e2e_suite_test and the two mock-backend Contexts in app_test to
call Shutdown in their AfterSuite/AfterEach.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-29 11:40:43 +02:00
泊舟
e1a782b70f fix(openai): stop streaming tool-call double-emission when autoparser is active (#10055)
Streaming /v1/chat/completions could emit the same logical tool call at
multiple `index` values. In processStreamWithTools the Go-side iterative
parser (ParseXMLIterative / ParseJSONIterative) runs on every token and
emits tool-call deltas, while the C++ chat-template autoparser delivers
its own tool calls via ChatDeltas that are flushed at end-of-stream by
ToolCallsFromChatDeltas -> buildDeferredToolCallChunks. With both paths
active the same call is emitted twice at different indices, so OpenAI
clients that accumulate tool calls by `index` dispatch the tool N times.

Skip the Go-side iterative parser once the autoparser is producing tool
calls (hasChatDeltaToolCalls). The deferred flush stays guarded by
lastEmittedCount, so the race where the Go parser emitted before the flag
flipped also remains single-emission. Backends without an autoparser
(e.g. vLLM) keep hasChatDeltaToolCalls=false and are unaffected.

Refs #9722

Signed-off-by: bozhouDev <259759010+bozhouDev@users.noreply.github.com>
Co-authored-by: bozhouDev <259759010+bozhouDev@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-29 11:39:09 +02:00
LocalAI [bot]
73cfedc023 fix: tool-call JSON leaks into content with stream+tools on tokenizer-template models (#10052) (#10057)
* fix(grammars): honor properties_order entry at index 0

The JSON-schema-to-GBNF property sort used `aOrder != 0 && bOrder != 0` as
its "is this key ordered?" guard. That treats index 0 — the first key listed
in properties_order — as unset, so `properties_order: name,arguments` fell
back to alphabetical ordering and still emitted "arguments" before "name".

Use presence in the order map instead: listed keys sort by their index and
ahead of unlisted keys, which keep a stable alphabetical order. This makes
the documented `properties_order: name,arguments` actually produce
name-first tool-call JSON. Relates to #10052.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(functions): defer tool grammar to the backend when the tokenizer template owns templating (#10052)

When use_tokenizer_template delegates templating to the backend (llama.cpp),
the backend also owns tool-call grammar generation and parsing. LocalAI was
still generating its own GBNF grammar and sending it down. With a grammar
present, llama.cpp does not hand the tools to its template, so its native
peg/json tool parser never engages: it streams the grammar-constrained
tool-call JSON back as plain content instead of emitting tool_calls. In
streaming mode the JSON object leaked into the content field, and the
Go-side incremental detector never gated content because the
LocalAI-generated grammar emitted "arguments" before "name".

The GGUF auto-import path already couples use_tokenizer_template with
grammar.disable, but that block is skipped when a template is already
configured, so gallery and hand-written configs (e.g. qwen3) that set the
tokenizer template directly never got the paired grammar.disable.

- SetDefaults now enforces the coupling for every config: when
  use_tokenizer_template is set, grammar generation is disabled and tools
  flow to the backend's native (name-first) pipeline. This also fixes
  already-installed models without editing each config.
- Set function.grammar.disable in the shared gallery/qwen3.yaml, which is
  the base config referenced by every qwen3 gallery entry.

Verified end to end against qwen3-4b with stream:true + tools: content no
longer carries the tool-call JSON, reasoning is classified separately, and
tool calls stream as proper name-first tool_calls deltas.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-29 10:12:53 +02:00
LocalAI [bot]
b982c977d5 chore: ⬆️ Update ggml-org/whisper.cpp to c932729a304f7d9eb5354afa38624cfa86a780cf (#10051)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-29 08:42:06 +02:00
LocalAI [bot]
532ca1b3a2 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 6eff055a0cc0e427a6849cfcb5de531b4b82d667 (#10050)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-29 08:41:50 +02:00
LocalAI [bot]
00ad55b590 chore: ⬆️ Update ggml-org/llama.cpp to 751ebd17a58a8a513994509214373bb9e6a3d66c (#10049)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-29 08:41:35 +02:00
LocalAI [bot]
4c58fd302f chore: ⬆️ Update leejet/stable-diffusion.cpp to 0e4ee04488159b81d95a9ffcd983a077fd5dcb77 (#10048)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-29 08:41:18 +02:00
LocalAI [bot]
66582e7035 chore: ⬆️ Update antirez/ds4 to 22393e770ea8eb7501d8718d6f66c6374004e03f (#10047)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-29 08:41:02 +02:00
LocalAI [bot]
1d13949588 docs: ⬆️ update docs version mudler/LocalAI (#10046)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-29 08:40:47 +02:00
LocalAI [bot]
c8ad67bbca chore: ⬆️ Update mudler/rf-detr.cpp to ecf64d7f7f20d73ebd906a983f398ed287256320 (#10035)
⬆️ Update mudler/rf-detr.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-29 08:39:47 +02:00
LocalAI [bot]
1c92b00918 fix(turboquant): guard upstream-only grpc-server fields for fork (#10043)
fix(turboquant): guard upstream-only grpc-server fields for fork build

backend/cpp/llama-cpp/grpc-server.cpp is reused by the turboquant build,
which compiles against an older llama.cpp fork (TheTom/llama-cpp-turboquant).
Two recent changes added references to upstream-only struct fields outside the
existing LOCALAI_LEGACY_LLAMA_CPP_SPEC guards:

  - common_params::checkpoint_min_step (default + option handler), added with
    the ggml-org/llama.cpp 35c9b1f3 bump (#9998)
  - the common_params_speculative::draft tensor_buft_overrides sentinel
    termination (#9919), which sat after the guard's #endif

The fork has neither field, so grpc-server.cpp failed to compile for every
turboquant flavor. Wrap the three references in #ifndef
LOCALAI_LEGACY_LLAMA_CPP_SPEC, matching the existing fork-compat guards, so the
stock llama-cpp build is unchanged and the fork build skips them. Update
patch-grpc-server.sh's doc comment to record what the macro now gates out.

Verified by a local fallback-flavor turboquant build: grpc-server.cpp compiles
against the fork and the backend image builds.


Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-28 17:37:54 +02:00
Richard Palethorpe
b81a6d01b3 perf(react-ui): code-split bundle, speed up coverage suite (#10042)
* Curate the highlight.js build to ~29 languages (lib/core + the
  common set) instead of the full ~190-grammar default: -787 KB raw /
  -230 KB gz on the base bundle.
* Code-split every route via React.lazy with a per-layout <Suspense>
  in App.jsx so the sidebar stays mounted on navigation. Initial entry
  chunk drops from 3194 KB raw / 887 KB gz to 397 KB / 122 KB (-87%).
  Warm chunks on sidebar hover/focus/touch via a preload registry so
  the click finds the chunk already in flight or cached.
* Migrate Playwright coverage from istanbul (build-time counters) to
  native Chromium V8 coverage, with per-worker accumulation +
  conversion. Suite drops from 71s to 30s at 20 workers (~58%) at the
  non-instrumented floor.
* Keep the coverage gate bundling-invariant: the coverage build inlines
  dynamic imports so every shipped source file lands in the denominator
  (otherwise untested page chunks silently drop out and inflate the
  percentage). Production builds stay code-split.
* Add UI_TEST_WORKERS=N Makefile knob; tighten coverage tolerance to
  0.8pp now that jitter sits near istanbul's ~0.5pp again.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-28 13:43:15 +02:00
Tai An
0fd666ee6e fix(openresponses): populate Content and accept bare {role,content} items (#10039) (#10040)
* fix(openresponses): populate Content and accept bare {role,content} items (#10039)

Fixes mudler/LocalAI#10039 — `/v1/responses` silently returned empty
output on any model whose YAML doesn't include a Go-side
`template.chat_message` block.

Three cooperating bugs:

* `convertORInputToMessages` populated only `StringContent` for string
  input and for the `input.Instructions` system message, leaving the
  `Content` (any) field nil.
* `TemplateMessages` gated all fallback content-rendering branches on
  `Content != nil && StringContent != ""` — but every branch in that
  function consumes `StringContent`, not `Content`. The `&&` silently
  dropped messages that had StringContent set and Content nil, producing
  an empty prompt that the 5× empty-retry guard then turned into a
  200 OK with `output: []`.
* The array-input branch of `convertORInputToMessages` dispatched on
  `itemMap["type"]` with no default, dropping bare `{role, content}`
  items emitted by the OpenAI Python SDK helper
  `client.responses.create(input=[{...}])`.

Fix:

* Set both `Content` and `StringContent` in the two openresponses
  message-construction sites that only set one.
* Treat a bare `{role, content}` item (no `type`) as
  `type: "message"` for OpenAI-SDK compatibility.
* Gate `TemplateMessages` fallback rendering on `StringContent != ""`,
  which is what every downstream branch in that function actually
  reads.

Regression test added to `evaluator_test.go` covering the fallback
path (no `ChatMessage` template) with a StringContent-only message,
both with and without a role mapping.

* test(openresponses): guard Content population and ToProto path (#10039)

Add regression tests for the two seams the original fix touched but
left uncovered:

* convertORInputToMessages must populate both Content and StringContent
  for plain string input and for bare {role, content} array items (the
  OpenAI SDK shape that omits the type discriminator). Both are
  functional reds against the pre-fix code.
* Messages.ToProto reads Content, not StringContent — this is the path
  UseTokenizerTemplate backends (imported GGUFs) take. The cases pin
  that contract so a future regression on the producer side is caught.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-28 07:21:48 +00:00
LocalAI [bot]
7763fb23a3 chore: ⬆️ Update antirez/ds4 to 072bc0feb187be5f374c08b16d0045e1ad7bc9bc (#10036)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-28 08:41:03 +02:00
LocalAI [bot]
324277ccfd chore: ⬆️ Update ggml-org/whisper.cpp to 6dcdd6536456158667747f724d6bd3a2ceaa8d88 (#10032)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-28 00:25:20 +02:00
LocalAI [bot]
10d02e6c59 chore: ⬆️ Update leejet/stable-diffusion.cpp to 29ab511fc75f89fbab148665eab1a8e10a139a72 (#10033)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-28 00:24:59 +02:00
LocalAI [bot]
05ae06c17b chore: ⬆️ Update ggml-org/llama.cpp to aa50b2c2ae91326d5aad956ceeb015d1d48e626b (#10034)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-28 00:23:40 +02:00
LocalAI [bot]
2671e0c6f7 chore(model-gallery): ⬆️ update checksum (#10038)
⬆️ Checksum updates in gallery/index.yaml

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-28 00:22:19 +02:00
LocalAI [bot]
81b6b94f0b chore: ⬆️ Update ikawrakow/ik_llama.cpp to 3bf7e836c2c5a895e8d12d3eb7e398ae7ab2f9ce (#10037)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-28 00:21:45 +02:00
LocalAI [bot]
373dc44992 fix(react-ui): force .check() on hidden Toggle input in fits-filter e2e (#10031)
* fix(react-ui): force .check() on hidden Toggle input in fits-filter e2e

The polish PR (#10030) swapped the raw <input type=checkbox> for the
shared <Toggle> component, which visually hides its native input via
opacity:0;width:0;height:0. Playwright's .check() waits for visibility
before clicking and times out after 30 s, breaking two UI E2E tests:

  - enabling fits filter hides models that exceed available VRAM
  - fits filter state persists after reload

Pass { force: true } to skip the visibility check; the input is still
the real focusable checkbox and toggles state on click. The companion
.toBeChecked() assertion only reads state and works unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7

* fix(react-ui): click visible Toggle track in fits-filter e2e

force:true skips the actionability checks but not the viewport check,
and the Toggle's hidden input has width:0;height:0 so Playwright still
reports "Element is outside of the viewport". Click the visible
.toggle__track inside the filter-bar-group__toggle wrapper instead —
that's what a real user clicks, and label-input association toggles
the wrapped checkbox naturally.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-27 22:41:01 +02:00
LocalAI [bot]
02a0e70396 fix(react-ui): polish 'Fits in my GPU' filter to use design-system Toggle (#10030)
* fix(react-ui): polish 'Fits in my GPU' filter to use design-system Toggle

The recently added VRAM-fit filter in the Models page used a raw
<input type="checkbox"> next to the themed range slider, breaking the
visual language of the rest of the row. Swap it for the shared
<Toggle> component (already used by Backends, Settings, Traces,
AgentCreate), adopt the filter-bar-group__toggle class to drop the
duplicated inline styles, add a fa-microchip icon to mirror the
per-row fit indicator, and add a subtle left divider so the filter
reads as separate from the context-size slider on its left.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(react-ui): move 'Fits in GPU' filter to filter row and unify copy

Two follow-ups on the previous polish pass:

1. Move the toggle from the context-slider row into the filter-button
   row above. The toggle is a filter on the result set, not a config
   for VRAM estimation, so it belongs with the type chips and backend
   select. The context slider stays its own thing.

2. Unify the label copy. The same locale file had "Fits in my GPU"
   for the filter and "Fits in GPU" for the per-row indicator; pick
   the shorter, possessive-free variant everywhere (en/de/es/it/zh-CN).
   Update e2e selectors to match.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-27 21:09:14 +02:00
LocalAI [bot]
7a4ca8f60d feat(backend): rfdetr-cpp native object detection + segmentation backend (#10028)
Adds a Go native gRPC backend that dlopens librfdetrcpp.so (built from
mudler/rf-detr.cpp at the pinned RFDETR_VERSION) via purego and exposes
the rfdetr.cpp inference pipeline through LocalAI's existing Detect RPC.

Supports all 5 RF-DETR detection variants (Nano/Small/Base/Medium/Large)
and 6 segmentation variants (SegNano/SegSmall/SegMedium/SegLarge/
SegXLarge/Seg2XLarge) with F32/F16/Q8_0/Q4_K quantizations. Pre-built
GGUFs ship at mudler/rfdetr-cpp-* on HuggingFace.

Detection returns Bbox + class_name + confidence; segmentation also
returns PNG-encoded per-detection masks via the rfdetr_capi accessor
functions (rfdetr_capi_get_detection_{class_id,box,score,class_name,
mask_png}).

End-to-end verified through POST /v1/detection: HTTP -> gRPC -> purego
dlopen -> rfdetr.cpp -> ggml -> response (9 detections on the detection
model, 21 detections + valid PNG masks on the seg-nano model against
the kitchen fixture).

Wiring:
  - backend/go/rfdetr-cpp/{main.go,gorfdetrcpp.go,CMakeLists.txt,
    Makefile,run.sh,package.sh,test.sh,.gitignore}
  - Top-level Makefile: BACKEND_RFDETR_CPP, docker-build target,
    .NOTPARALLEL, prepare-test-extra, test-extra
  - backend/go/rfdetr-cpp/Makefile: `test` target invoked by test-extra
  - .github/backend-matrix.yml: CPU + CUDA-12/13 + L4T CUDA-12/13
    (arm64) + HIP + Vulkan (amd64 + arm64) + SYCL f32/f16
  - backend/index.yaml: rfdetr-cpp meta anchor + latest/development
    image entries for every matrix tag-suffix
  - .github/workflows/bump_deps.yaml: RFDETR_VERSION pin tracking
    (mudler/rf-detr.cpp branch main)
  - gallery/index.yaml: 11 rfdetr-cpp-* entries (nano + 4 detection
    variants + 6 seg variants), all backed by mudler/rfdetr-cpp-*
    on HuggingFace with sha256 pinning on the F16 default
  - core/gallery/importers/rfdetr.go: GGUF auto-routing for HF imports
    (mudler/rfdetr-cpp-* repos route to rfdetr-cpp, Transformer-format
    repos stay on the Python rfdetr backend; explicit preferences.backend
    overrides both heuristics)
  - core/gallery/importers/rfdetr_test.go: table-driven coverage of the
    auto-routing + a live mudler/rfdetr-cpp-nano cross-check

scripts/changed-backends.js needs no change: the existing
Dockerfile.golang -> backend/go/${item.backend}/ branch already routes
the 9 rfdetr-cpp matrix entries to the correct backend path.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-27 18:43:57 +02:00
LocalAI [bot]
893e69cbf8 fix(react-ui): share single /api/operations poller across consumers (#10029)
useOperations() spun up its own setInterval per hook instance, so on
pages like /app/models the OperationsBar in App.jsx plus the page's
own useOperations() call each polled /api/operations at 1 Hz - 2 RPS
sustained for the whole session, repeated on Backends and Chat.

Lift the poller into an OperationsProvider mounted under AuthProvider
so all consumers (OperationsBar, Models, Backends, Chat) share one
timer. The hook file re-exports from the context to keep call sites
unchanged.


Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-27 16:39:09 +02:00
Siddharth More
c9a1a7e6a0 UI: add 'Fits in my GPU' filter on Install Models (#10017)
* feat(ui): add GPU fit filter on models install page

* Delete docs/vram-fits-filter-backend-optionals.md

Signed-off-by: Siddharth More <siddimore@gmail.com>

---------

Signed-off-by: Siddharth More <siddimore@gmail.com>
2026-05-27 15:17:44 +02:00
LocalAI [bot]
4d01298048 chore: ⬆️ Update antirez/ds4 to e8e8779b261c10f36ad6270ba732c8f0be5b62e3 (#10024)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-27 15:16:43 +02:00
LocalAI [bot]
b6055e7ecf chore: ⬆️ Update leejet/stable-diffusion.cpp to 92dc7268fc4ffb0c0cc0bd52dfcefea91326e797 (#10023)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-27 15:16:23 +02:00
LocalAI [bot]
51bad74bf8 chore: ⬆️ Update ggml-org/llama.cpp to 0d18aaa9d1a8af3df9abccd828e22eeaac7f840b (#10022)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-27 00:29:14 +02:00
LocalAI [bot]
f3236b74cf chore: ⬆️ Update ggml-org/whisper.cpp to 27101c01dcac1676e2b6422256233cd0f1f9ae28 (#10021)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-27 00:28:55 +02:00
LocalAI [bot]
eed3ecff82 chore: ⬆️ Update ikawrakow/ik_llama.cpp to d2da6da05c73aeb658a3d1751f386c24e6963856 (#10020)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-27 00:28:32 +02:00
番茄摔成番茄酱
df7623fd87 fix(nemo): extract Hypothesis.text for TDT/RNNT ASR models (#10012)
* fix(nemo): extract Hypothesis.text for TDT/RNNT ASR models

CTC models (e.g. Whisper) return List[str] from transcribe(), but
TDT/RNNT models (e.g. parakeet-tdt-0.6b-v3) return List[Hypothesis]
where the decoded text lives in the Hypothesis.text attribute.

Previously, results[0] was assigned directly to the protobuf string
field, causing silent empty output for non-CTC models.

Now checks the return type and extracts .text from Hypothesis objects,
with a safe fallback via getattr().

* refactor: simplify Hypothesis text extraction per Copilot review

Use single getattr() call instead of hasattr() + double access,
and return empty string for unknown types instead of str(result)
to avoid leaking internal repr to clients.
2026-05-26 20:35:23 +00:00
番茄摔成番茄酱
4e5ec6f67b fix(qwen-asr): enable timestamp output when forced_aligner is configured (#10013)
* fix(qwen-asr): enable timestamp output when forced_aligner is configured

Two bugs prevented timestamps from working in the qwen-asr backend:

1. transcribe() was called without return_time_stamps=True, so the
   forced aligner was loaded but never invoked. Now we pass
   return_time_stamps=True when a forced_aligner is present.

2. The timestamp parsing code expected (list, tuple) items, but the
   qwen_asr library returns ForcedAlignItem dataclass instances with
   .text, .start_time, .end_time attributes. Added hasattr() check
   to handle this correctly, falling back to tuple parsing for
   backward compatibility.

* refactor: address Copilot review for qwen-asr timestamps

- Wrap return_time_stamps kwarg in try/except TypeError for safety
- Add defensive float() normalization for timestamp times
- Use str() for text extraction to ensure string type

* fix(qwen-asr): convert seconds to nanoseconds for Go time.Duration

The Go server reads TranscriptSegment.start/end via time.Duration,
which is in nanoseconds. Previously the backend sent milliseconds
(* 1000), causing timestamps to be 1000x too small (e.g. 8e-8
instead of 0.08). Convert seconds → nanoseconds (* 1e9) instead.

Also applies to the legacy tuple path for consistency.

* feat(qwen-asr): respect timestamp_granularities (segment vs word)

Read request.timestamp_granularities from the gRPC request.
- 'word': return one segment per aligned item (character / word)
- 'segment' (default): merge consecutive items at sentence boundaries

Sentence boundaries detected via CJK punctuation (。!?;…)
and Latin endings (. ! ? ;). This matches the OpenAI Whisper API
contract where omitting the parameter defaults to segment-level.

* fix(qwen-asr): escape smart quotes in punctuation set

Unicode curly quotes (U+2018/2019) were being interpreted as Python
string delimiters, causing SyntaxError. Use explicit unicode escapes.

* fix(qwen-asr): use time-gap threshold for segment boundaries

The forced aligner strips punctuation from its output, so text-based
sentence detection doesn't work. Instead, detect segment boundaries
by measuring time gaps between consecutive aligned items.

Threshold = max(median_gap * 4, 0.3s). This cleanly separates
intra-sentence gaps (< 0.24s) from inter-sentence gaps (> 0.3s)
across Chinese, English, and other languages.

* fix(qwen-asr): smart join with spaces for non-CJK tokens

The forced aligner strips whitespace from tokenized text, so English
words like ['hello', 'world'] were joined as 'helloworld'. Add
_smart_join() that inserts spaces between non-CJK tokens while
keeping CJK characters and punctuation unspaced. Works for Chinese,
English, Korean, Japanese, and mixed-language text.

---------

Co-authored-by: fqscfqj <fqsfqj@outlook.com>
2026-05-26 20:34:21 +00:00
Richard Palethorpe
8d70855ea6 test: add Go + React UI coverage gates and fill test gaps (#9989)
- Strict monotonic Go coverage gate (make test-coverage-check, 45% baseline)
  run in CI; fixes ginkgo dropping all-but-one coverprofile across multiple
  recursive roots, builds with -tags auth, and folds in the in-process
  tests/e2e suite via --coverpkg.
- React UI e2e coverage (make test-ui-coverage: vite-plugin-istanbul + nyc,
  nix-provided Chromium) plus e2e specs for 6 previously-untested pages, and a
  UI coverage gate (make test-ui-coverage-check) with a small tolerance since
  e2e line coverage jitters ~0.5pp run-to-run.
- pre-commit hook: lint + coverage on Go changes, Playwright e2e + UI coverage
  gate on react-ui changes; install with make install-hooks.
- New Go handler tests (settings, branding), hermetic base64 download test.
- fix(ui): model editor reads vram_display (snake_case), so the VRAM estimate
  renders again; covered by a regression test.

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-26 22:06:10 +02:00
dependabot[bot]
90c29f9258 chore(deps): bump protobuf from 6.33.5 to 7.35.0 in /backend/python/transformers (#10004)
chore(deps): bump protobuf in /backend/python/transformers

Bumps [protobuf](https://github.com/protocolbuffers/protobuf) from 6.33.5 to 7.35.0.
- [Release notes](https://github.com/protocolbuffers/protobuf/releases)
- [Commits](https://github.com/protocolbuffers/protobuf/commits)

---
updated-dependencies:
- dependency-name: protobuf
  dependency-version: 7.35.0
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-26 22:03:59 +02:00
dependabot[bot]
66aaa525e5 chore(deps): update transformers requirement from >=5.8.1 to >=5.9.0 in /backend/python/transformers (#10005)
chore(deps): update transformers requirement

Updates the requirements on [transformers](https://github.com/huggingface/transformers) to permit the latest version.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](https://github.com/huggingface/transformers/compare/v5.8.1...v5.9.0)

---
updated-dependencies:
- dependency-name: transformers
  dependency-version: 5.9.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-26 22:03:38 +02:00
dependabot[bot]
a29a06f6b6 chore(deps): bump sentence-transformers from 5.5.0 to 5.5.1 in /backend/python/transformers (#10007)
chore(deps): bump sentence-transformers in /backend/python/transformers

Bumps [sentence-transformers](https://github.com/huggingface/sentence-transformers) from 5.5.0 to 5.5.1.
- [Release notes](https://github.com/huggingface/sentence-transformers/releases)
- [Commits](https://github.com/huggingface/sentence-transformers/compare/v5.5.0...v5.5.1)

---
updated-dependencies:
- dependency-name: sentence-transformers
  dependency-version: 5.5.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-26 22:03:05 +02:00
LocalAI [bot]
80893a298b chore(model gallery): 🤖 add 1 new models via gallery agent (#10016)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-26 22:02:12 +02:00
Richard Palethorpe
9d90e04418 fix(dockerignore): exclude local-only artifacts from build context (#10015)
The build context shipped to the daemon included several large
untracked directories the image never needs: saved image tarballs
(backend-images), locally-installed backends (local-backends), the
host-built binary (local-ai), the rust target/ build output, and
host node_modules/protoc/tests. This bloated the context to ~23GB.

Exclude them so only the sources the Dockerfile actually copies are
transferred. backend/rust sources stay tracked; only target/ is ignored.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-26 22:01:50 +02:00
Ettore Di Giacinto
11f43ad8b0 docs(readme): refresh News section for 4.0-4.3 and split Past sponsors
Walk down the release history and add per-release one-liners for 4.3.0,
4.2.0, 4.1.0, and 4.0.0 in the Latest News section, leading with the
headline win for each release. Move Prem into a collapsible "Past
sponsors" block under the active sponsors row.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.7 [claude-code]
2026-05-26 16:40:40 +00:00
LocalAI [bot]
437f0fa193 chore(model gallery): 🤖 add 1 new models via gallery agent (#10011)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-26 08:45:10 +02:00
dependabot[bot]
aa743f8824 chore(deps): bump actions/stale from 10.2.0 to 10.3.0 (#10002)
Bumps [actions/stale](https://github.com/actions/stale) from 10.2.0 to 10.3.0.
- [Release notes](https://github.com/actions/stale/releases)
- [Changelog](https://github.com/actions/stale/blob/main/CHANGELOG.md)
- [Commits](b5d41d4e1d...eb5cf3af3a)

---
updated-dependencies:
- dependency-name: actions/stale
  dependency-version: 10.3.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-26 08:39:13 +02:00
dependabot[bot]
2162611dca chore(deps): bump github.com/aws/aws-sdk-go-v2/credentials from 1.19.15 to 1.19.17 (#10008)
chore(deps): bump github.com/aws/aws-sdk-go-v2/credentials

Bumps [github.com/aws/aws-sdk-go-v2/credentials](https://github.com/aws/aws-sdk-go-v2) from 1.19.15 to 1.19.17.
- [Release notes](https://github.com/aws/aws-sdk-go-v2/releases)
- [Commits](https://github.com/aws/aws-sdk-go-v2/compare/credentials/v1.19.15...credentials/v1.19.17)

---
updated-dependencies:
- dependency-name: github.com/aws/aws-sdk-go-v2/credentials
  dependency-version: 1.19.17
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-26 08:34:56 +02:00
LocalAI [bot]
4aad97971c chore: ⬆️ Update ggml-org/llama.cpp to 35c9b1f39ebe5a7bb83986d64415a079218be78d (#9998)
* ⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix(llama-cpp): track upstream rename checkpoint_every_nt -> checkpoint_min_step

Upstream llama.cpp renamed common_params::checkpoint_every_nt to
checkpoint_min_step and changed its default from 8192 to 256. The semantics
also shifted: it used to enforce a fixed checkpoint cadence during prefill,
now it sets a minimum spacing between context checkpoints. Track the new
field name in grpc-server.cpp and accept the old option names as backward-
compatible aliases for users with existing configs.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-26 08:34:41 +02:00
LocalAI [bot]
e4c70fca7a fix(streaming/tools): don't leak prefill-misclassified content as trailing reasoning chunk (#10000)
When the C++ autoparser is in pure-content fallback mode (qwen3-4b after
model emits a tool-call JSON in non-thinking mode, the streaming worker
ended the SSE stream with a spurious

    data: {...,"delta":{"reasoning":"{\"name\":\"exec\",\"arguments\":...}"}}

chunk carrying the same JSON that was already in delta.tool_calls.

The Go-side ReasoningExtractor is configured from
DetectThinkingStartToken, which scans the model's jinja chat template
verbatim and finds <think> inside an {% if enable_thinking %} block
without evaluating the conditional. Every output chunk then runs through
PrependThinkingTokenIfNeeded, which synthesizes a <think> in front and
makes ExtractReasoning treat everything after as reasoning. The autoparser
correctly classifies zero reasoning (qwen3's tool format isn't on
llama.cpp's recognized-tool list, so all tokens land in
ChatDelta.Content), but processStreamWithTools then preferred
extractor.Reasoning() over functions.ReasoningFromChatDeltas at the
end-of-stream flush — handing the polluted Go-side state to
buildDeferredToolCallChunks, which emitted it as a trailing reasoning
chunk.

Two changes:

* Add a sticky preferAutoparser flag to processStreamWithTools, mirroring
  the analogous flag in processStream from #9985. Once any ChatDelta
  carries content or reasoning, the flag stays on for the rest of the
  stream and the worker stops falling back to the Go-side extractor for
  per-token deltas. This avoids the per-chunk leak path and the cumulative
  pollution.

* Extract chooseDeferredReasoning, a small helper that selects the
  end-of-stream reasoning source. When preferAutoparser is set, return
  functions.ReasoningFromChatDeltas(chatDeltas); otherwise fall back to
  extractor.Reasoning() (the correct source for vLLM and other backends
  with no autoparser).

The helper has a focused test suite covering both sides of the contract:
autoparser-active with empty reasoning (the qwen3 case — the fix's
purpose), autoparser-active with real reasoning_content
(jinja-with-recognized-format models), and autoparser-not-active with
genuine Go-side reasoning (vLLM-style backends).

E2E with combined #9988 and this fix on qwen3-4b post-#9985 gallery
shape: 18 content chunks of the tool-call JSON, 1 tool_call chunk with
name='exec' and the right arguments, finish_reason=tool_calls, and zero
reasoning chunks — down from one polluted reasoning chunk before this
fix.

Depends on #9999 (the streaming JSON tool-call gating bug for qwen3) to
make the trailing chunk observable end-to-end; the helper unit tests are
independent.

Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-26 08:34:26 +02:00
dependabot[bot]
4b398c9798 chore(deps): bump github.com/nats-io/nats.go from 1.50.0 to 1.52.0 (#10003)
Bumps [github.com/nats-io/nats.go](https://github.com/nats-io/nats.go) from 1.50.0 to 1.52.0.
- [Release notes](https://github.com/nats-io/nats.go/releases)
- [Commits](https://github.com/nats-io/nats.go/compare/v1.50.0...v1.52.0)

---
updated-dependencies:
- dependency-name: github.com/nats-io/nats.go
  dependency-version: 1.52.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-26 08:34:09 +02:00
LocalAI [bot]
4a5219fa9c chore: ⬆️ Update ggml-org/whisper.cpp to e0fd1f6787a5bd4a4957dd97c5b64df882ee7b0c (#9997)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-26 08:33:53 +02:00
LocalAI [bot]
b5a620294e chore: ⬆️ Update leejet/stable-diffusion.cpp to 1ceb5bd9df7784bcdf67dd9ed8bf0198b542ebc9 (#9994)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-26 08:33:37 +02:00
LocalAI [bot]
5d544a7868 chore: ⬆️ Update ikawrakow/ik_llama.cpp to b4e1d916c5ec7e75ea3c124dd090425a99fc613f (#9995)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-25 23:57:17 +02:00
LocalAI [bot]
87e01aa290 chore: ⬆️ Update antirez/ds4 to ad0209f6a4b067574d2b4afe896c08c177156b31 (#9996)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-25 23:56:33 +02:00
LocalAI [bot]
f17d99f6e5 fix(streaming/tools): stop healing-marker stubs from gating off content (#9999)
* fix(streaming/tools): stop healing-marker stubs from gating off content

When the C++ autoparser is in pure-content fallback mode (e.g. qwen3
without --jinja) and the model emits a tool call as JSON, the streaming
worker calls ParseJSONIterative on each new chunk. parseJSONWithStack
heals partial input like `{` into `{"<marker>":1}` where <marker> is a
random integer. removeHealingMarkerFromJSON only stripped the marker
from values, so the synthetic key survived and downstream callers saw
a stub object with a random-looking key.

chat_stream_workers.go's JSON tool-call detector then bumped
lastEmittedCount past the stub even though no real tool call was
emitted, gating off ALL subsequent content chunks. The qwen3 + tools +
streaming case ended up dribbling only the first `{"` to clients and
then nothing, even when the model went on to call the noAction
`answer({"message": "…"})` pseudo-tool.

Three changes, each with its own regression test:

* removeHealingMarkerFromJSON now strips the marker suffix from keys
  too, dropping the entry when the truncated key is empty. Inputs like
  `{` no longer leak `{"<marker>":1}` to callers; partial keys like
  `{ "code` still preserve the model-typed prefix `code`.

* ParseJSONIterative skips empty-after-healing maps so a healed `{`
  doesn't surface as a stub result.

* The streaming JSON detector now breaks (not continues) on entries
  without a usable `name`, and only bumps lastEmittedCount past
  successfully-emitted entries. Defense-in-depth against any future
  partial-parse shape.

The parser tests cover eight partial-JSON-prefix shapes and verify no
marker characters leak into keys, plus the two early shapes (`{`,
`{"`) that should not surface a stub at all.

Fixes #9988

Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(streaming/tools): cover the autoparser-correctly-working path

Extract the JSON tool-call streaming emit loop into emitJSONToolCallDeltas
and unit-test it against every shape that can hit the streaming worker:

* the bug case — a healing-marker stub at index 0 must NOT bump
  lastEmittedCount, so subsequent content chunks keep flowing;
* the autoparser-correctly-working case — empty jsonResults (because
  the C++ autoparser cleared the raw text and delivers tool calls via
  TokenUsage.ChatDeltas) is a no-op, leaving the deferred end-of-stream
  emitter to ship the autoparser's tool calls;
* a single complete tool call — emit one chunk, advance to 1;
* arguments arriving as a JSON-string vs as a nested object — both
  serialize to the wire as JSON-string arguments;
* multiple parallel tool calls — one chunk each;
* a real tool call followed by a partial stub — emit the real one,
  stop at the stub, resume on a later chunk once the stub completes.

Locks down the no-regression guarantee the user asked for: this PR's
fix is scoped to the pure-content fallback path; when the autoparser
actually classifies tool calls (jinja-recognized chat format with tool
support), the helper is a no-op and nothing changes.

Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-25 23:55:35 +02:00
LocalAI [bot]
597daa925b docs: ⬆️ update docs version mudler/LocalAI (#9993)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-25 22:40:52 +02:00
LocalAI [bot]
3e0612b8b4 feat(swagger): update swagger (#9992)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-25 22:40:32 +02:00
LocalAI [bot]
de2ce74bea fix(stablediffusion-ggml): mux LTX-2 audio into output MP4 (#9990)
feat(stablediffusion-ggml): mux LTX-2 audio into output MP4

sd.cpp's generate_video now returns a sd_audio_t* alongside the video
frames for models with an audio VAE (LTX-2.3). Our gosd wrapper was
already collecting that pointer but immediately freed it without ever
muxing it into the output, so LTX-2 generations landed as silent MP4s
even though the audio VAE decode succeeded.

Stage the planar float32 waveform to a temp WAV (IEEE float, header
hand-built; samples interleaved on the fly), then add it as a second
ffmpeg input with -c:a aac -map 0:v:0 -map 1:a:0 -shortest. The temp
WAV is cleaned up unconditionally after ffmpeg exits, including on
the write/waitpid error paths.

Non-LTX models (Wan i2v / FLF2V) keep their current behaviour: audio
arg is nullptr, the audio-related ffmpeg flags are not added, and no
temp file is created.

Assisted-by: Claude:claude-opus-4-7

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-25 22:40:16 +02:00
LocalAI [bot]
1c6c3adad6 fix(reasoning): stop <think> leaking into content when autoparser is in pure-content mode (#9991)
When LocalAI templates a thinking model outside of jinja (the default for
the qwen3 gallery family), llama.cpp's chat parser falls back to a
"pure content" PEG parser that dumps the entire raw response into
ChatDelta.Content with an empty ReasoningContent. The Go side then
trusted that content verbatim and overrode tokenCallback's
correctly-split reasoning, so <think>...</think> blocks ended up in the
OpenAI `content` field. Regression from v4.0.0 introduced when the
autoparser ChatDeltas path was added (#9224).

The override now runs Go-side reasoning extraction defensively when the
autoparser delivered content but no reasoning. The streaming worker
gains a sticky preferAutoparser flag that flips on the first chunk
where the autoparser classified reasoning_content; until then we use
the streaming Go-side extractor. Realtime mirrors the non-streaming
fallback. When the autoparser already populated ReasoningContent we
trust it untouched, so jinja-enabled installs are not regressed.

gallery/qwen3.yaml now enables use_jinja, letting the autoparser
classify <think> natively for all 20+ qwen3 family entries that share
this template.

Fixes #9985

Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-25 22:39:50 +02:00
Ettore Di Giacinto
c2cd3b9ada fix(gallery/ltx-2.3): add vae_decode_only:false for i2v / flf2v (#9987)
LTX-2.3 i2v inference fails inside generate_video with:

  [ERROR] LTXAV image conditioning requires VAE encoder weights;
  create the context with vae_decode_only=false

Without vae_decode_only:false in the options block, gosd.cpp creates
the sd_ctx with VAE encoder weights freed, so latent encoding of the
init_image is impossible. Adding the option mirrors what we already
do for Wan i2v entries.

Affects all six LTX-2.3 entries (dev/distilled × UD-Q4_K_M, Q4_K_M,
Q8_0). T2V wasn't impacted by the missing option since it has no
init image to encode, which is why the T2V smoke earlier passed.

Assisted-by: Claude:claude-opus-4-7
2026-05-25 21:40:12 +02:00
Ettore Di Giacinto
9ff270eb65 fix(gallery/ltx-2.3): add diffusion_model flag to all variants (#9986)
LTX-2.3 entries (dev / distilled, UD-Q4_K_M / Q4_K_M / Q8_0) were
missing the `diffusion_model` option in their overrides. Without it,
gosd.cpp routes the main GGUF through the regular `model_path` code
path in sd.cpp, which doesn't apply the `model.diffusion_model.` tensor
prefix. sd.cpp's LTX-2.3 architecture detection (`VERSION_LTXAV`) in
get_sd_version checks for prefixed tensor names — without the prefix,
detection fails and load_model returns "could not load model".

This is the same bug we hit for Wan when the option was missing.
Adding `- diffusion_model` to all six LTX-2.3 entries' option blocks
makes load_model take the diffusion_model_path branch so detection
succeeds.

Assisted-by: Claude:claude-opus-4-7
2026-05-25 21:10:40 +02:00
LocalAI [bot]
8d6548c0b9 fix(distributed): sync gallery OpCache + caches across frontend replicas (#9983)
When the LocalAI frontend deployment is scaled past one replica, the UI's
/api/operations poll round-robins between pods. Each pod kept the OpCache
(galleryID->jobID), OpStatus map, and the post-install in-memory caches
(ModelConfigLoader, UpgradeChecker) purely in-process. Reads never
consulted PostgreSQL or NATS even though writes already published to PG.
Symptoms:

- A user installing a model on replica A saw the operation card flicker
  in and out as the load balancer alternated.
- The Models page re-fetched the whole gallery on every flicker because
  useEffect([operations.length]) re-fires when the count changes.
- A chat completion that landed on replica B after the install completed
  on replica A failed to find the new model — B's ModelConfigLoader was
  still the old one because nothing told it to reload from disk.
- The UpgradeChecker 6-hour cache stayed stale on peer replicas after a
  backend upgrade, so /api/backends/upgrades kept surfacing an upgrade
  that had already shipped.

Mirror the jobs Dispatcher pattern for gallery ops:

- OpCache learns SetMessagingClient/SetGalleryStore + a Start(ctx) that
  hydrates from PostgreSQL and subscribes to gallery.opcache.{start,end}.
  Set/SetBackend now upsert cache_key + is_backend_op on the gallery_
  operations row and broadcast OpCacheEvent so peers merge it in. The
  hydrate path uses a new GalleryStore.ListActive() (status in {pending,
  downloading, processing} and updated within 30 min).
- GalleryService.SubscribeBroadcasts wires a SubjectGalleryProgress-
  Wildcard subscriber that calls a new lock-light mergeStatus into the
  local statuses map, plus a SubjectGalleryCancelWildcard subscriber that
  runs the locally-registered cancel func. Hydrate() restores active rows
  from PostgreSQL on startup so a freshly-started replica is not
  observably empty mid-install. CancelOperation tolerates the cancel func
  living on a different replica and publishes anyway.
- modelHandler and backendHandler publish on the new
  SubjectCacheInvalidateModels / SubjectCacheInvalidateBackends after
  a successful install/delete/upgrade. SubscribeBroadcasts wires peers
  to refresh: OnModelsChanged (re-runs LoadModelConfigsFromPath) and
  OnBackendOpCompleted (re-triggers UpgradeChecker). The originating
  replica reloads inline so it never enters the broadcast handler.
- OpStatus.Error (an error interface) flat-marshalled to "{}" over JSON,
  so a failed install replicated to a peer arrived with a nil error and
  the UI's failure banner never appeared. Add MarshalJSON/UnmarshalJSON
  via an opStatusWire shim that round-trips Error as a string.
- UpdateStatus and CancelOperation now drop the mutex before publishing
  to NATS or persisting to PostgreSQL. The wildcard subscriber's
  mergeStatus loops back into the same service on the publishing replica
  and would deadlock otherwise; this also prevents future PG round-trips
  from stalling concurrent readers on every progress tick.

Tests cover the OpStatus error round-trip, OpCache propagation through a
shared in-memory bus, OpCache PostgreSQL hydration (active-only),
GalleryService progress + cancel broadcast, Nodes preservation across a
peer's bare progress tick, GalleryService hydration from PG, and the
two cache-invalidation broadcasts (models + backends). 44 specs total
in galleryop; routes/operations specs and jobs/agents suites still pass.


Assisted-by: claude-code:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-25 17:28:14 +02:00
LocalAI [bot]
b02e3ffe61 feat(stablediffusion-ggml): LTX-2 support + LTX-2.3 GGUF gallery entries (#9980)
stable-diffusion.cpp gained LTX-2 video generation, which requires an
audio VAE and an embeddings_connectors safetensors in addition to the
usual diffusion model, VAE, and LLM text encoder. The pinned commit
exposes audio_vae_path and embeddings_connectors_path on
sd_ctx_params_t; wire both through the option parser so gallery entries
can point at the LTX-specific assets.

Ship six LTX-2.3 GGUF gallery entries (dev + distilled, UD-Q4_K_M /
Q4_K_M / Q8_0 each) backed by a new ltx-ggml.yaml template that
defaults to euler / cfg_scale 6.0 / vae_decode_only:false /
diffusion_flash_attn / offload_params_to_cpu — matching the upstream
LTX-2 CLI recipe. Each entry pulls the model GGUF plus the QAT
gemma-3-12b-it text encoder, video VAE, audio VAE, and embeddings
connectors needed for T2V / I2V / FLF2V.


Assisted-by: Claude:claude-opus-4-7 [Claude-Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-25 13:00:28 +02:00
LocalAI [bot]
a891eedd08 fix(distributed): persist per-model load info so reconciler survives frontend restart (#9981)
* feat(distributed): add per-model ModelLoadInfo persistence

Adds a dedicated ModelLoadInfo table keyed by model name, decoupled from
the per-replica NodeModel rows. The reconciler can now recover model load
metadata after every NodeModel row has been removed (worker death,
eviction, MarkOffline reaping, frontend restart with stale heartbeats),
which is the read side of Bug-1 from the distributed mode bug hunt.

Registry exposes:
  - UpsertModelLoadInfo: ON CONFLICT (model_name) update; last-write-wins,
    matching the existing per-replica blob semantics under concurrent
    multi-frontend dispatch.
  - GetModelLoadInfo: read from the new table first; fall back to the
    legacy NodeModel-blob scan for rows written before any frontend in
    the cluster ran an UpsertModelLoadInfo (rolling-upgrade transition).

SetNodeModelLoadInfo (per-replica blob) is preserved for backward
compatibility and per-replica diagnostics; the dispatch-path hook in the
next commit calls both.

The new table joins the existing nodes AutoMigrate set under the same
schema-migration advisory lock.

Refs: Bug-1, docs/superpowers/specs/2026-05-24-distributed-mode-bug-hunt-findings.md

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7[1m]

* fix(distributed): persist per-model load info on dispatch

scheduleAndLoad now writes the (backendType, ModelOptions blob) pair to
the new ModelLoadInfo table in addition to the existing per-replica
NodeModel.model_opts_blob field. The per-replica blob still works for
the hot path; the per-model row outlives every NodeModel row going away,
which is what unblocks the reconciler on the read side.

Both writes are best-effort with warn-level logging on failure: a write
miss here just means the reconciler may need a fresh inference request
to repopulate, which is the pre-fix behavior.

Concurrency: two frontends loading the same model at the same time both
fire UpsertModelLoadInfo; ON CONFLICT (model_name) makes the row
converge to whichever commits last. Matches the existing per-replica
blob semantics.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7[1m]

* test(distributed): cover load info persistence and Bug-1 recovery

Adds Ginkgo specs that prove the persistence layer behaves correctly and
that the reconciler actually recovers from the frontend-restart scenario
that was failing in production:

registry_test.go:
  - per-model row survives RemoveAllNodeModelReplicas (the bug repro)
  - ON CONFLICT (model_name) updates backend type + blob, last-write-wins
  - legacy NodeModel-blob fallback still works (rolling-upgrade transition)
  - GetModelLoadInfo returns ErrRecordNotFound when both sources are empty
  - UpsertModelLoadInfo rejects empty model names

reconciler_test.go:
  - Bug-1 end-to-end: with min_replicas=2, no NodeModel rows, but a
    ModelLoadInfo row present, one reconcile tick fires two scheduler
    calls. Pre-fix this returned "no load info" and the scheduler never
    got called until a fresh inference request arrived.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7[1m]

* docs(distributed): note restart-safe reconciler behavior

Adds a bullet to the Replica Reconciler section explaining that per-model
load metadata is persisted across frontend restarts via the new
model_load_infos PostgreSQL table, so a rolling upgrade no longer needs a
fresh inference request per model before the reconciler can replace dead
replicas.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7[1m]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-25 13:00:06 +02:00
LocalAI [bot]
06e777b75e feat(distributed): gated X-LocalAI-Node response header (middleware + wrapper) (#9976)
* feat(distributed): add per-request node ID context holder

Introduce pkg/distributedhdr, a leaf package carrying a per-request
*atomic.Value holder for the picked worker node ID from the
SmartRouter (core/services/nodes) up to the HTTP response writer
wrapper (core/http/middleware). Avoids the import cycle that a shared
key in either consumer would create.

Exposes NewHolder, WithHolder, Holder, Stamp, Load, Inherit. The
holder is atomic.Value so cross-goroutine publish from the router to
the response writer wrapper is race-clean.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): add ExposeNodeHeader middleware + response writer wrapper

New ApplicationConfig.ExposeNodeHeader bool + --expose-node-header CLI
flag / LOCALAI_EXPOSE_NODE_HEADER env var (default off; the node ID
reveals internal topology and is opt-in).

The middleware creates a per-request *atomic.Value holder, attaches it
to c.Request().Context() via distributedhdr.WithHolder, and wraps
c.Response().Writer with a custom http.ResponseWriter that sets the
X-LocalAI-Node header on first Write / WriteHeader / Flush by reading
the holder. Implements http.Flusher, http.Hijacker, Unwrap so it
composes cleanly with Echo and http.NewResponseController.

request.go propagates the holder onto derived contexts via
distributedhdr.Inherit so the holder survives the correlation-ID
context replacement.

Unit + race-clean concurrency + integration specs.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): stamp node ID in router and wire middleware to inference routes

ModelRouterAdapter.Route stamps the picked node ID into the
per-request holder via distributedhdr.Stamp(ctx, result.Node.ID) right
after replica selection.

Wire ExposeNodeHeader middleware to:
- OpenAI chat/completion/embeddings + audio transcriptions/speech + image generations/inpainting
- Anthropic /v1/messages
- Ollama /api/chat, /api/generate, /api/embed, /api/embeddings
- Jina /v1/rerank
- LocalAI /v1/vad

The middleware's wrapper reads the holder on first byte and sets the
X-LocalAI-Node response header before delegating to the underlying
writer. Per-request scope means no race under concurrent multi-replica
routing.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(distributed): thread request context through backend Load + cover ctx propagation

Five non-OpenAI backend helpers were silently using app.Context instead
of the request context for the gRPC backend call: transcription, TTS,
image generation, rerank, VAD. Effect: distributedhdr.Stamp in the
router callback was a silent no-op for these paths, AND client
cancellation didn't propagate to in-flight inference.

Thread c.Request().Context() (or the equivalent input.Context after
the request middleware has installed the correlation-ID derived
context) through each helper and into ModelOptions via
model.WithContext(ctx). ImageGeneration's signature gains a leading
ctx parameter; in-tree callers (openai image, openai inpainting,
openai inpainting_test) are updated to match.

ModelEmbedding gains a leading ctx parameter for the same reason; the
openai and ollama embedding handlers pass the request context through.

chat_stream_workers.go defers the initial role=assistant chunk
emission until the first token callback so the wrapper's lazy
X-LocalAI-Node lookup against the loader runs AFTER ml.Load has
stamped the per-modelID node ID; semantically identical for clients
(role still arrives before any text).

Regression test core/backend/ctx_propagation_test.go pins ctx
propagation for all five helpers.

Docs updated to enumerate the full endpoint coverage of the
--expose-node-header flag.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-25 10:51:48 +02:00
Richard Palethorpe
90ea327178 fix(intel): VRAM detection (#9944)
* fix(gpu-detect): clinfo --json fallback for Intel discrete VRAM

ghw returns 0 VRAM for any i915-driven Intel GPU because the kernel
driver doesn't expose VRAM through the sysfs paths ghw checks (no
mem_info_vram_total — that's an amdgpu interface). xpu-smi, the
canonical Intel tool, isn't in the oneAPI base image (it lives in a
separate xpumanager package). The capability gate added in 19c92c70
("default to CPU if there is less than 4GB of GPU available") then
demotes the host to CPU even on a 16 GB Arc A770.

clinfo ships with the OpenCL ICD loader and is present in the oneAPI
base image, so plug it in as the last-resort Intel VRAM source:

  xpu-smi -> intel_gpu_top -> clinfo --json

The parser drops UMA devices via HOST_UNIFIED_MEMORY=true so an iGPU
sibling can't double-count system RAM, and dedups by PCI BDF when
multiple ICDs enumerate the same physical device (POCL caps reported
GLOBAL_MEM_SIZE at 4 GiB; the largest non-capped value wins).

Subprocess is wrapped in a 2s timeout and memoised with sync.OnceValue
— GPU hardware is static for the process lifetime. The Intel branch
also short-circuits when ghw saw no Intel vendor, so NVIDIA-only hosts
don't pay the spawn cost.

Verified end-to-end on Intel Arc A770: ghw -> 0, clinfo path reports
16,225,243,136 bytes (15.11 GiB), capability gate now passes naturally
without LOCALAI_FORCE_META_BACKEND_CAPABILITY=intel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(gpu-detect): live VRAM usage from DRM fdinfo

The clinfo fallback reports total VRAM correctly but leaves UsedVRAM
at 0 because OpenCL has no portable live-memory property — the UI
ends up showing 0% utilisation even when llama-cpp is actually
holding gigabytes in device memory.

Fill that gap with the standardised Linux DRM fdinfo interface
(Documentation/gpu/drm-usage-stats.rst, kernel ≥5.19). Walking
/proc/<pid>/fdinfo for any fd that points at /dev/dri/render* yields
drm-total-<region> / drm-resident-<region> keys; aggregate per
render-node, resolve the render node to a PCI BDF via
/sys/class/drm/<name>/device, and merge the result into the matching
GPUMemoryInfo by BDF.

Region naming is driver-defined — i915 uses "local0" for device-local
VRAM, amdgpu and xe use "vram0" — so a prefix-match on local/vram
covers all three DRM drivers that LocalAI cares about. system/gtt/
stolen regions are deliberately excluded since they're host RAM
mirrors and would double-count against system RAM.

GPUMemoryInfo gains an optional BDF field (`bdf,omitempty` in JSON)
so future vendor-specific detectors can plug into the same matcher.
Empty BDF skips the merge — non-PCI devices and detection paths that
don't surface PCI location keep their existing behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 09:29:00 +02:00
Richard Palethorpe
6a80e23733 feat(middleware): Model routing, PII filtering, Cloud model proxies (#9802)
Add a routing middleware stack and a cloud-proxy backend.

* cloud-proxy: a Go gRPC backend that forwards OpenAI- and
  Anthropic-shaped chat requests to upstream providers, with an
  optional translate mode (OpenAI request -> Anthropic /v1/messages
  -> OpenAI response) and full tool-calling support.

* routing: admission control, content-aware model routing
  (embedding cache + classifier + rerank + Arch-Router score),
  PII detection/redaction (regex + NER) with streaming filter and
  OpenAI/Anthropic adapters, and a per-user/per-key billing recorder
  backed by GORM or in-memory storage.

* middleware: UsageMiddleware records usage via the billing recorder,
  plus admission, route-model, usage-stamp and trace middlewares.

* observability: BackendTrace ring buffer stores full request bodies
  (capped), MITM proxy emits structured trace events, and router
  classifier decisions surface at /api/router/decide.

* gallery: Arch-Router-1.5B (Q4_K_M and Q8_0).

* UI: cloud-proxy model-editor fields, classifier system-prompt and
  score-normalization config, and a Traces page rendering request
  bodies.

Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-25 09:28:27 +02:00
LocalAI [bot]
1dcd1ae915 chore: ⬆️ Update ggml-org/llama.cpp to 549b9d84330c327e6791fa812a7d60c0cf63572e (#9974)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-25 09:22:56 +02:00
LocalAI [bot]
acad78a95a chore: ⬆️ Update ikawrakow/ik_llama.cpp to 9f7ba245ab41e118f03aa8dd5134d18a81159d02 (#9973)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-25 00:05:29 +02:00
LocalAI [bot]
c94d1e1f5b chore: ⬆️ Update antirez/ds4 to f91c12b50a1448527c435c028bfc70d1b00f6c33 (#9975)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-25 00:05:15 +02:00
Copilot
270c256409 Fix kokoros backend build break from Backend trait drift (#9972)
* Initial plan

* fix(kokoros): implement missing AudioToAudioStream trait stubs

Agent-Logs-Url: https://github.com/mudler/LocalAI/sessions/e3c6b042-f055-4df9-a05e-e2d8434ee58b

Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-24 22:39:15 +02:00
Ettore Di Giacinto
1a30020a82 ci(backend-signing): set COSIGN_EXPERIMENTAL=1 for oci-1-1 referrers mode
cosign v2.4.1 still gates --registry-referrers-mode=oci-1-1 behind the
experimental flag, so the first signing run after the backend-signing
merge failed with "you must set COSIGN_EXPERIMENTAL=1". Set it at the
job env level so both the quay and dockerhub cosign steps inherit it,
and note the requirement in .agents/backend-signing.md so a future
cosign bump can drop the flag.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
2026-05-24 08:21:05 +00:00
LocalAI [bot]
8bbe89a537 fix(distributed): route per request across loaded replicas + cache probeHealth (#9968)
* refactor(distributed): extract PickBestReplica from FindAndLockNodeWithModel

Lifts the replica-selection policy (in_flight ASC, last_used ASC,
available_vram DESC) out of the SQL ORDER BY into a pure Go function in
the new replicapicker.go. The SQL clause keeps its FOR UPDATE atomicity
and remains the production path used by SmartRouter; PickBestReplica is
the canonical implementation that the future per-frontend rotating
replica cache (TODO referenced from pkg/model) will call against an
in-memory snapshot without paying a DB round-trip per inference.

A new registry_test mirror spec seeds a multi-tier scenario and asserts
both layers pick the same replica, so any future tweak to either side
fails the test until the other side is updated.

No behavior change.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

* fix(distributed): route per inference request and cache probeHealth

Two related fixes that together restore load balancing across loaded
replicas of the same model.

1. ModelLoader.Load and LoadModel bypass the local *Model cache when
   modelRouter is set. The cached *Model wraps an InFlightTrackingClient
   bound to a single (nodeID, replicaIndex) — reusing it pinned every
   subsequent request to whichever node won the very first pick, so
   FindAndLockNodeWithModel's round-robin never got a chance to run
   even after the reconciler scaled the model out to a second node. In
   distributed mode SmartRouter.Route now runs per request, and
   PickBestReplica picks the least-loaded replica each time.

   SmartRouter has its own coalescing (advisory DB lock for first-time
   loads + singleflight on backend.install RPC) so concurrent first
   requests for a not-yet-loaded model still produce a single worker
   side install.

2. SmartRouter.probeHealth memoizes successful gRPC HealthCheck results
   in a new probeCache (probe_cache.go) with a 30s TTL. With per-request
   routing every inference call hits probeHealth, and llama.cpp-style
   backends serialize HealthCheck behind active Predict — so a burst of
   incoming requests stalled on the probe to a node already mid-stream,
   tripping the 2s timeout and falling through to the install path.
   singleflight collapses N concurrent first-time probes for the same
   (node, addr) into one round-trip, failed probes invalidate the entry
   so the staleness-recovery path still triggers, and the TTL matches
   pkg/model/model.go's healthCheckTTL so the single-process and
   distributed paths share a staleness budget. The background
   HealthMonitor still reaps actually-dead backends within ~45s.

The bypass introduces one short FindAndLockNodeWithModel transaction per
inference. A TODO in pkg/model/loader.go documents the future per modelID
rotating-replica cache that would reuse PickBestReplica against an
in-memory snapshot and skip the DB round-trip for hot paths.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-24 08:15:27 +00:00
LocalAI [bot]
dcc5599f89 chore: ⬆️ Update leejet/stable-diffusion.cpp to a397e03488cc27e1a42da646b82dfce9f50741c0 (#9965)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-24 08:35:36 +02:00
LocalAI [bot]
a95f4e63e0 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 642c038ccdf3dd08e6d9ac6fdc3b1c311ebd8a02 (#9966)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-23 23:52:51 +02:00
LocalAI [bot]
dfd19a3f88 chore: ⬆️ Update ggml-org/llama.cpp to c0c7e147e7efa6c5858754b47259ba4880f8a906 (#9963)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-23 23:52:36 +02:00
LocalAI [bot]
d7387c725c feat(swagger): update swagger (#9962)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-23 23:52:10 +02:00
LocalAI [bot]
63d84a5705 chore: ⬆️ Update antirez/ds4 to 444afce822057d87f14c4dec307dce24fd49b3ee (#9964)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-23 23:51:53 +02:00
LocalAI [bot]
1198d10b58 fix(traces): cap backend trace Data to keep admin UI responsive (#9960)
* fix(traces): cap backend trace Data field so the admin UI stays responsive

The previous fix (#9946) capped API trace bodies but missed backend traces,
which carry the same blast radius:

  - LLM backend traces store the full chat messages JSON, full response, and
    full streaming deltas. Every agent-pool reasoning step ships the full
    RAG-augmented history (50-500 KiB per trace, often 100+ traces queued).
  - TTS / audio_transform / transcript traces embed a 30s audio snippet as
    base64, around 1.3 MiB per trace.

Both blow the /api/backend-traces JSON past tens of MiB. The admin Traces
page then keeps re-downloading and re-parsing the buffer faster than the
5s auto-refresh and stays in the loading state forever, the same symptom
the API-side fix addressed.

Apply two complementary caps, both honoring LOCALAI_TRACING_MAX_BODY_BYTES:

Option A (safety net in core/trace): RecordBackendTrace walks the Data map
recursively and replaces any string value larger than the cap with
"<truncated: N bytes>". Catches anything a future producer forgets.

Option B (head-preserving at the producer):
  - core/backend/llm.go: TruncateToBytes on messages, response, and
    chat_deltas content/reasoning_content so the leading content stays
    readable in the UI.
  - core/trace/audio_snippet.go: omit audio_wav_base64 when the encoded
    blob would exceed the cap (truncated base64 is undecodable). The
    quality metrics still ship and the UI's WaveformPlayer simply skips
    when the field is absent.

TruncateToBytes is bounded to <= maxBytes so Option A leaves the producer's
head-preserving output alone instead of replacing it with the bare marker.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7

* fix(react-ui): expose tracing_max_body_bytes in Settings and Traces panels

The setting was already plumbed through env (LOCALAI_TRACING_MAX_BODY_BYTES),
CLI flag, and the runtime_settings.json GET/PUT schema, but neither the main
Settings page nor the inline Traces panel offered an input for it. Admins
hitting the "Traces UI stuck loading" symptom had to know to set an env var
or PUT raw JSON to /api/settings to dial the cap.

Add a "Max Body Bytes" row next to "Max Items" in both places. Same input
type, same disabled-when-tracing-off semantics, placeholder shows the 65536
default so users see what they're inheriting.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7

* test(react-ui): disambiguate Max Items locator after adding Max Body Bytes

The Tracing settings panel now has two number inputs. The previous spec
matched 'input[type="number"]' which became ambiguous and triggered a
Playwright strict-mode violation in CI. Switch to getByPlaceholder('100')
for Max Items and add a parallel spec for the new Max Body Bytes field
using getByPlaceholder('65536').

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-23 14:50:40 +02:00
LocalAI [bot]
a0f3e26245 fix(distributed): make admin backend installs resilient and observable (#9958)
* feat(distributed): add configurable NATS backend install/upgrade timeouts

Adds BackendInstallTimeout and BackendUpgradeTimeout to DistributedConfig
with 15m defaults, following the existing MCPToolTimeout / WorkerWaitTimeout
pattern. These will replace the hardcoded literals in RemoteUnloaderAdapter
so admin-driven backend installs across the cluster survive long OCI image
pulls that previously timed out at 3m.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* style(distributed): gofmt alignment after timeout fields

Re-aligns the Validate() negative-duration map and the Default* const
block so the new BackendInstall/UpgradeTimeout entries do not leave
the surrounding columns mis-padded.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(cli): surface LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT and _UPGRADE_TIMEOUT

Parses the two new env vars on the run CLI and threads them through the
existing AppOption builder so DistributedConfig picks them up. Invalid
duration strings now fail loudly at startup rather than silently falling
back to the default.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): inject NATS install/upgrade timeouts into RemoteUnloaderAdapter

Removes the hardcoded 3m / 15m literals from RemoteUnloaderAdapter and
threads in DistributedConfig.BackendInstallTimeoutOrDefault() and
BackendUpgradeTimeoutOrDefault() at construction. Install now defaults
to 15m (was 3m); cold OCI image pulls on Jetson Wi-Fi routinely blew
past the old ceiling. Scripted messaging client captures the timeout
so tests can assert the configured value actually reaches the NATS
request.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): introduce galleryop.ErrWorkerStillInstalling sentinel

When the NATS request-reply for backend.install (or .upgrade) times out
the worker is almost always still pulling the OCI image. Wrap the timeout
in a typed sentinel so the manager above can distinguish "worker hung"
from "worker still working" and leave the pending_backend_ops row in
place for the reconciler to confirm via backend.list.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): treat NATS install timeout as in-progress, not failure

When a worker times out replying to backend.install but the install is
still running on the worker, enqueueAndDrainBackendOp now reports a
running_on_worker status and pushes NextRetryAt out by the install
timeout so the reconciler does not immediately re-fire another install
while the worker is still pulling the image. The pending_backend_ops
row stays in place for the next reconciler pass to confirm via
backend.list.

InstallBackend wraps the result in galleryop.ErrWorkerStillInstalling
so callers can branch (galleryop renders yellow in-progress instead of
red error). UpgradeBackend uses the same wrap.

Adds RemoteUnloaderAdapter.InstallTimeout() so the manager can push
NextRetryAt by the configured timeout without reaching into a private
field, and NodeRegistry.RecordPendingBackendOpInFlight as the soft
cousin of RecordPendingBackendOpFailure.

Also includes incidental gofmt-driven struct-field alignment in
registry.go on lines unrelated to the change (touched files are
re-formatted to canonical form per project policy).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(distributed): don't increment Attempts on in-flight install timeout

An in-flight timeout (worker still pulling the OCI image) is not a
failed attempt, it's a delayed one. Incrementing Attempts let
genuinely-progressing slow installs (e.g. 30 GB CUDA images on Wi-Fi)
trip the reconciler's maxPendingBackendOpAttempts cap and dead-letter
the queue row while the worker was still legitimately working.

RecordPendingBackendOpInFlight now only updates LastError and NextRetryAt.
Also documents "running_on_worker" in the NodeOpStatus.Status enum
comment so Task 6 implementers see the full surface.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(galleryop): surface ErrWorkerStillInstalling as non-error OpStatus

When the distributed backend manager returns an error that wraps
ErrWorkerStillInstalling, backendHandler now completes the op with a
"still installing in background" message rather than marking it as a
red failure. Admin UI sees a yellow in-progress state; reconciler
confirms completion on its next pass.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(distributed): end-to-end install-timeout-then-reconcile

Wires Task 1-6 end-to-end so any seam mismatch surfaces in CI rather
than during a real cluster install. NATS times out, the queue row
stays alive with running_on_worker status, the worker eventually
reports the backend installed via backend.list, the manager surfaces
it via ListBackends.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): document LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT / _UPGRADE_TIMEOUT

Add the two new operator-tunable env vars to the Frontend Configuration
table in the distributed-mode docs. Explains the 15m default, when to
raise it (slow links pulling multi-GB OCI images), and the new
"still installing in background" admin-UI state when the round-trip
times out but the worker is still working.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): clear pending install rows when backend.list confirms

DistributedBackendManager.ListBackends now proactively clears
pending_backend_ops install rows whose (nodeID, backend) is reported
installed by backend.list. Operator UI updates immediately instead of
waiting up to installTimeout (default 15m) for the next reconciler
tick after NextRetryAt.

Only install rows are cleared; upgrade and delete intents are not
satisfied by presence in backend.list and continue to drain through
their normal reconciler paths.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(messaging): add BackendInstallProgressEvent wire type and subject

New NATS subject nodes.<nodeID>.backend.install.<opID>.progress lets the
worker publish transient progress events (file, current/total bytes,
percentage, phase) while a long-running install pulls its OCI image.
BackendInstallRequest gains an optional OpID field so the worker knows
which subject to publish on.

Transient pub/sub (not JetStream): the install reply remains ground
truth for success/failure; dropped progress events are tolerable.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* style(messaging): drop em-dash from BackendInstallProgress test comment

Per project convention (no em-dashes anywhere). Comment substance is
unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): worker publishes debounced install progress over NATS

When BackendInstallRequest.OpID is set, the worker's backend.install
handler wires a debounced publisher (250ms window) into the gallery
download callback. Each tick becomes a BackendInstallProgressEvent on
nodes.<nodeID>.backend.install.<opID>.progress; the publisher always
emits a final event on Flush so the UI sees the terminal percentage.

Old masters that do not set OpID continue to run silent installs: no
behavior change for them. Lock ordering: the publisher releases its
mutex before calling messaging.Publish so a slow network never stalls
the install loop.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): RemoteUnloaderAdapter subscribes to install progress

InstallBackend gains opID + onProgress parameters. When both are set,
the adapter subscribes to nodes.<nodeID>.backend.install.<opID>.progress
BEFORE publishing the install request, decodes each message into the
caller's onProgress callback in a goroutine (so a slow callback never
stalls the NATS reader thread), and unsubscribes after RequestJSON
returns.

When onProgress is nil OR opID is empty (the reconciler retry path),
subscription is skipped entirely - silent installs cost nothing extra.

Subscribe failure is logged at Warn and the install proceeds without
progress streaming; the NATS round-trip still owns terminal status.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): forward backend install progress into galleryop OpStatus

DistributedBackendManager.InstallBackend now passes the gallery op ID
and a progress bridge into the adapter call. Each
BackendInstallProgressEvent from the worker becomes a
galleryop.ProgressCallback tick - which the existing backendHandler
already turns into OpStatus.UpdateStatus, so the admin UI/SSE polling
sees per-byte progress for distributed installs without any UI-side
change.

UpgradeBackend is intentionally left silent for now: its wire request
(BackendUpgradeRequest) does not carry OpID, and rolling-update
fallback is the rarer path. Will be picked up in a follow-up if the
worker upgrade path also gets a progress channel.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(distributed): InstallBackend tolerates silent (pre-Phase-2) workers

A worker on pre-Phase-2 code never publishes progress events. The new
master subscribes optimistically; this spec pins that a silent worker
still produces a green install with no progressCb ticks. The install
reply is the source of truth for terminal state; the progress stream
is a best-effort UX enrichment.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): document install progress streaming

Note the new nodes.<nodeID>.backend.install.<opID>.progress subject and
the silent-worker compatibility behavior so operators know to expect
real-time progress and what happens on a mixed-version cluster.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): note progress-event ordering trade-off in InstallBackend

Document near the goroutine dispatch why ordering at the consumer is
best-effort, why it rarely matters in practice (worker debounce >>
goroutine jitter), and what a future hardening pass would look like
(Seq field + stale-by-seq drop). Stops the next reader from accidentally
"fixing" the goroutine pool away.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(galleryop): add NodeProgress + OpStatus.Nodes for per-node breakdown

Adds the data model the UI needs to render an expandable per-node
breakdown of a fanned-out backend install. NodeProgress carries node
identity (ID + name), per-node status (queued / running_on_worker /
success / error / downloading), the current file + bytes + percentage
from the Phase 2 progress stream, and any per-node error.

OpStatus.Nodes is the slice the /api/operations handler will surface
in a follow-up.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(galleryop): UpdateNodeProgress merges per-node ticks by NodeID

GalleryService.UpdateNodeProgress(opID, nodeID, np) merges a NodeProgress
into OpStatus.Nodes (keyed by NodeID, no duplicates) and mirrors the
latest tick into the aggregate Progress / FileName /
DownloadedFileSize / TotalFileSize fields so the legacy single-bar
OperationsBar view keeps working unchanged alongside the new per-node
breakdown.

Concurrent-safe via the existing g.Mutex.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): write per-node OpStatus entries during install fan-out

DistributedBackendManager now accepts a nodeProgressSink and feeds it
two streams:

1. enqueueAndDrainBackendOp emits a per-node terminal entry on each
   status it appends to BackendOpResult (queued, success, error,
   running_on_worker). The opID is threaded through the function so
   the sink gets the right gallery op identity.

2. The install apply closure fans each BackendInstallProgressEvent
   into the sink as a downloading entry, alongside the legacy
   progressCb path so the aggregate single-bar view stays correct.

Production wiring passes the GalleryService (which implements
UpdateNodeProgress via Task 2) as the sink. Single-node tests pass
nil. DeleteBackend and UpgradeBackend pass an empty opID so the
sink path no-ops for ops that aren't gallery-tracked the same way
as Install.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(operations): expose per-node breakdown on /api/operations

When an operation's OpStatus has Nodes entries (populated by the
Phase 4 progress sink wiring), surface them as a "nodes" array on the
/api/operations response, sorted by node_name for stable rendering.

Backward compatible: legacy clients ignore the field; ops without any
node entries (single-node mode, model installs) omit the array entirely
thanks to the empty-slice guard.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): per-node breakdown in OperationsBar

When an install op fans out to more than one worker, the operations
bar now shows a "N nodes" chevron that expands into a per-node list.
Each row carries the node's status (color-coded pill), the current
file being downloaded, byte counts, percentage, and a thin per-node
progress bar. Yellow "Worker busy" pill marks running_on_worker
status with a tooltip explaining the NATS round-trip timed out but
the worker is still installing in the background.

Backward compatible: ops without a nodes field (legacy or single-node
mode) render as before. State for expand/collapse is local to the
component, keyed by jobID/id - reload starts collapsed.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): document per-node breakdown in the operations bar

Adds a short subsection covering the expandable "N nodes" chevron in
the OperationsBar admin UI, the meaning of each status pill, and
how it relates to the /api/operations nodes array.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(galleryop): UpdateStatus preserves Nodes when caller sends none

Real-world bug surfaced by the Phase 4 multi-worker smoke test: the
nodes[] array in /api/operations flickered between a single node at a
time on a 2-worker install. Root cause: the Phase 2 progress bridge
also calls the legacy progressCb -> UpdateStatus(&OpStatus{...}) on
every tick. UpdateStatus then overwrote the entire status pointer,
wiping the Nodes slice that UpdateNodeProgress had just merged in.

Fix: in UpdateStatus, if the incoming op has an empty Nodes slice,
carry forward the previous status's Nodes before storing. Callers
that explicitly populate Nodes still win (their slice replaces the
prior one, no merge across the two code paths).

Two regression specs added pinning both directions of the contract.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): strip implementation details from user-facing docs

Trim the new install/upgrade timeout rows and the install-progress
sections to focus on what the operator sees and tunes. Drops:

- the NATS subject names and pub/sub mechanics
- "round-trip" / reconciler / backend.list jargon
- /api/operations polling cadence
- "pre-2026-05-22" version references

Reframes the breakdown text around the admin UI (Operations Bar,
chevron, status pills, "Worker busy" tooltip). Implementation context
lives in the agent notes and code comments.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(config): move DistributedConfig.Validate flag names to constants

The negative-duration check map was a wall of literal kebab-case
strings that had to stay in sync with the kong-derived CLI flag names
manually. Move them to a Flag* const block alongside the existing
Default* block so a rename of either the Go field or the CLI naming
convention forces a compile error rather than silent drift.

Sole consumer today is Validate; the constants are exported so future
operator-facing surfaces (e.g. error messages on other validation
paths) can reference them by name instead of repeating the literals.

Tests pin both the literal values (so a future "let's just rename
this" doesn't accidentally regress the CLI flag) and the negative-
duration error message for the new BackendInstall / BackendUpgrade
fields.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(distributed): extract NodeStatus and Phase enums to constants

Sweep for the same literal-string-as-identifier pattern called out on
the Validate flag names: the per-node install status enum
("queued" | "downloading" | "running_on_worker" | "success" | "error")
appeared as raw literals across managers_distributed.go (10+ sites,
including 3 separate `n.Status == "running_on_worker"` checks),
operation.go, and the test suite. Same shape for the Phase enum
("resolving" | "downloading" | "extracting" | "starting") in the
worker-side progress publisher.

Promote both to exported const blocks:

- galleryop.NodeStatus{Queued,Downloading,RunningOnWorker,Success,Error}
  shared between galleryop.NodeProgress.Status (the wire field) and
  nodes.NodeOpStatus.Status (the in-process per-node summary)
- messaging.Phase{Resolving,Downloading,Extracting,Starting}
  shared between the worker publisher and any future consumer that
  needs to switch on phase

Tests pin both the literal values (so a future "let's just rename" doesn't
silently change the JSON wire) and use the constants in setup (so the
producer side stays drift-protected). Wire-format assertions on the
/api/operations JSON output keep their literals deliberately, so the
constant value can never silently diverge from what the UI receives.

Out of scope for this PR (separate cleanup): the finetune and
quantization job-status enums have the same anti-pattern with 14+
literal sites each, but predate this PR's work.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-23 12:35:44 +02:00
LocalAI [bot]
e4cc1f11f3 chore: ⬆️ Update ggml-org/llama.cpp to 1acee6bf8939948f9bcbf4b14034e4b475f06069 (#9952)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-23 08:38:29 +02:00
LocalAI [bot]
6ed269d0b9 chore: ⬆️ Update ggml-org/whisper.cpp to 0ccd896f5b882628e1c077f9769735ef4ce52860 (#9954)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-23 08:37:26 +02:00
LocalAI [bot]
5756fb046d chore: ⬆️ Update leejet/stable-diffusion.cpp to 0baf721215f45335a5df8caf0ecb34e870c956e7 (#9955)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-23 08:37:10 +02:00
Copilot
7980629bc5 Fix backend manifest merge signing on current cosign releases (#9957)
* Initial plan

* fix: remove deprecated cosign bundle flag from backend merge workflow

Agent-Logs-Url: https://github.com/mudler/LocalAI/sessions/4207dabc-14ec-4655-9594-487338977fcf

Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-23 00:20:28 +02:00
LocalAI [bot]
d0a59be9de chore: ⬆️ Update ikawrakow/ik_llama.cpp to b3d39cff8bffbd67296d6badd4076a1486a0715c (#9953)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-22 23:58:48 +02:00
LocalAI [bot]
5cda4f1ccf fix(L4T13 backends): switch vllm/sglang/vllm-omni to PyPI aarch64+cu130 wheels (#9950)
* fix(vllm): switch L4T13 backend to PyPI aarch64+cu130 wheels

The L4T13 vllm backend pulled torch / torchvision / torchaudio / vllm from
pypi.jetson-ai-lab.io's sbsa/cu130 mirror via [tool.uv.sources] with no
version pins. That mirror started shipping torch 2.11.0 next to a
vllm-0.20.0+cu130 wheel that was still compiled against torch 2.10's c10
ABI, so uv landed on the mismatched pair and vllm crashed at import:

  ImportError: vllm/_C.abi3.so: undefined symbol:
  _ZN3c1013MessageLoggerC1EPKciib

(c10::MessageLogger's constructor signature changed between torch 2.10 and
2.11; the vllm wheel referenced the 2.10 form, the installed libc10.so
exported only the 2.11 form.)

Since torch 2.11 (April 2026) PyPI publishes its own aarch64 + cu130
manylinux wheels, and vllm 0.20.0 ships an aarch64 wheel whose Requires-
Dist locks torch==2.11.0 / torchvision==0.26.0 / torchaudio==2.11.0. That
makes uv's resolver produce an ABI-consistent set automatically, so the
mirror and the [tool.uv.sources] pinning are no longer needed.

flash-attn is dropped from the dep list: PyPI has no aarch64 wheel, but
vLLM 0.20+ already bundles its own vllm_flash_attn (fa2 + fa3) inside the
main wheel, so the Dao-AILab package isn't required at runtime.

Reference: https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] [WebFetch]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(vllm): retire l4t13 pyproject.toml in favor of requirements-*.txt

pyproject.toml only existed because uv pip install -r requirements.txt
doesn't honor [tool.uv.sources]. The previous commit dropped [tool.uv.
sources] (PyPI now serves the aarch64 + cu130 wheels directly), so the
file no longer carries any logic the requirements-*.txt path can't.

Replace with the same two-file pattern every other build profile uses:

  - requirements-l4t13.txt       (accelerate / torch / transformers /
                                  bitsandbytes - matches cublas13's split)
  - requirements-l4t13-after.txt (vllm; runs after the base resolve so
                                  the cu130 torch wheel lands first)

install.sh's whole l4t13 elif branch goes away; libbackend.sh's
installRequirements already handles the requirements-install.txt build-
deps pass, the C_INCLUDE_PATH export for PORTABLE_PYTHON, and the
runProtogen call, so falling through to the standard else: branch
produces identical install behavior with less surface area.

No functional change at install time - same wheels, same order.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(sglang,vllm-omni): switch L4T13 backends to PyPI aarch64+cu130 wheels

Same root cause and same fix as the vllm backend in the previous commits:
the L4T13 sglang and vllm-omni backends both pulled their accelerator
stack from pypi.jetson-ai-lab.io's sbsa/cu130 mirror with no version
pins, so they would silently land on the same torch 2.11 vs cu130-built
wheel ABI mismatch the moment the mirror published an out-of-sync pair.

sglang
------

- Drop pyproject.toml + [tool.uv.sources]. The historical comment said
  the [all] extra was unsafe on aarch64 because of decord, but sglang
  0.5.x now uses `decord2` on aarch64/arm/armv7l (which ships cp312
  aarch64 wheels), so we can match cublas13's sglang[all]>=0.5.11 pin
  and stop being capped at the 0.5.1.post2 the L4T mirror shipped.
  That unblocks Gemma 4 / MTP recipes on Jetson Thor.
- New requirements-l4t13.txt mirrors the cublas13 split (accelerate /
  torch / torchvision / torchaudio / transformers), requirements-l4t13-
  after.txt carries sglang[all]>=0.5.11.
- install.sh's l4t13 elif branch goes away; falls through to the
  standard installRequirements path.

vllm-omni
---------

- requirements-l4t13.txt drops --extra-index-url to jetson-ai-lab and
  drops flash-attn (PyPI has no aarch64 wheel, vLLM 0.20+ bundles its
  own vllm_flash_attn fa2 + fa3 internally).
- install.sh's l4t13 vllm-install branch collapses into the cublas13
  branch since both now just run `pip install vllm --torch-backend=auto`
  against PyPI.
- --index-strategy=unsafe-best-match is dropped from the top-level
  l4t13 guard; without the L4T mirror in the picture it had no purpose.

The from-source vllm-omni install on top still keeps its existing
`sed -i '/^fa3-fwd[[:space:]]*==/d' requirements/cuda.txt` workaround -
fa3-fwd has no aarch64 wheel and no sdist, unrelated to flash-attn.

Reference: https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] [WebFetch]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(sglang): drop [all] extra on l4t13 - xatlas has no aarch64 wheel

CI revealed that sglang[all]==0.5.12 transitively pulls xatlas via the
[diffusion] sub-extra, and xatlas ships no aarch64 wheel. Its sdist
depends on scikit_build_core without declaring it in build-system.
requires, so under --no-build-isolation uv can't build it from source:

    × Failed to build `xatlas==0.0.11`
    ├─▶ The build backend returned an error
    ╰─▶ Call to `scikit_build_core.build.build_wheel` failed (exit status: 1)
        ModuleNotFoundError: No module named 'scikit_build_core'
    help: `xatlas` (v0.0.11) was included because `sglang[all]` (v0.5.12)
          depends on `xatlas`

Upstream sglang explicitly gates st_attn and vsa on
`platform_machine != aarch64` inside the same [diffusion] extra but
forgot xatlas - same class of bug that bit the old decord pin.

Use plain `sglang>=0.5.11` on l4t13. backend.py imports only base
sglang.srt symbols (Engine, ServerArgs, FunctionCallParser,
ReasoningParser); the [all] extras are optional accelerators not
required at import time. cublas13 (x86_64) keeps [all] because xatlas
has x86_64 wheels there.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-22 23:01:22 +02:00
LocalAI [bot]
c500461c69 feat(config): default prompt_cache_all to true (#9951)
Upstream llama.cpp defaults `cache_prompt = true` (common/common.h),
but `parse_options` in the grpc-server backend unconditionally forwards
the proto `PromptCacheAll` field, so any model that didn't set
`prompt_cache_all: true` in its YAML was getting `cache_prompt=false` —
silently overriding llama.cpp's own default. With `kv_unified` and
`cache_idle_slots` already on by default, this was the last piece
preventing the per-request prompt cache from being usable out of the
box.

Make `PromptCacheAll` tristate (`*bool`), default it to `true` in
`SetDefaults`, and dereference at the proto boundary. Users can still
opt out with an explicit `prompt_cache_all: false`. Same pattern as
`MMap`, `MMlock`, `Reranking`, etc.

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 22:06:22 +02:00
LocalAI [bot]
834ecc36bf fix(react-ui): unify backend-logs entry point for distributed mode (#9949)
In distributed mode the local /api/backend-logs WebSocket has nothing
behind it (inference runs on workers), so the "View backend logs" link
in Traces (and the action in Manage when previously not hidden) dead-
ended on /app/backend-logs/<modelId>. Manage worked around it by
hiding the action; Traces still rendered the link.

Make /app/backend-logs/:modelId the single, mode-aware entry point.
A new BackendLogsRouter probes useDistributedMode and forks:

  - standalone: existing local WebSocket view (BackendLogsDetail).
  - distributed: DistributedBackendLogsResolver fans out to each node
    via nodesApi.getModels, filters by model_name, and routes:
      * 0 hits   -> empty state with a link to the Nodes page.
      * 1 hit    -> <Navigate replace> to
                    /app/node-backend-logs/<nodeId>/<modelId>,
                    preserving the ?from= deep-link timestamp.
      * N hits   -> picker listing each hosting worker (node id,
                    replica index, load state) so the operator can
                    choose which worker's logs to view.

Bare modelId in the redirect target intentionally aggregates that
node's replicas via the worker's BackendLogStore, matching the
existing per-node link pattern in Nodes.jsx.

Revert the per-caller distributed checks now that routing is
centralised: drop the hidden:distributedMode guard on Manage's
Backend logs action, and remove the prop threading in Traces so the
link is unconditional. Any future view that wants to link to backend
logs uses the same URL and gets correct behaviour in both modes.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-22 22:00:08 +02:00
LocalAI [bot]
61bf34ea2f fix(traces): cap captured body size to keep admin Traces UI responsive (#9946)
The trace middleware buffered the full request and response bodies for every
JSON exchange. With a chatty agent-pool RAG workload, /embeddings responses
(large vector arrays) accumulated to tens of MB in the in-memory buffer; the
admin Traces page would then download and parse 40+ MB on every load and on
every 5s auto-refresh, locking the UI in a loading state.

Add LOCALAI_TRACING_MAX_BODY_BYTES (default 64 KiB) that caps each captured
body. The full payload still flows through to the real client; only the
trace copy is bounded. Exchanges record body_truncated and original
body_bytes so the dashboard can show that truncation happened. The cap is
configurable via env, CLI, and runtime_settings.json.

Also unblock recovery: the Traces page now keeps the Clear button enabled
while loading, since "buffer too large to render" is exactly when the user
needs to clear it.


Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-22 15:29:24 +02:00
LocalAI [bot]
0b2ae3c6ca fix(openai): stream usage non-zero when tools are enabled (#9941)
* chore: ignore local .worktrees directory

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(openai): stream usage non-zero when tools are enabled

The streaming chat-completions worker for tool-bearing requests
(processTools in core/http/endpoints/openai/chat.go) never forwarded the
cumulative TokenUsage from ComputeChoices to the chunks it placed on the
responses channel. The outer streaming loop's running usage tracker
therefore stayed at the zero value, and the include_usage trailer
reported {prompt_tokens:0, completion_tokens:0, total_tokens:0} whenever
the request carried a `tools` array. Without tools, the alternative
`process` path stamps Usage on every chunk, so that path was unaffected.

Forward the final TokenUsage via a usage-only sentinel chunk (empty
Choices, populated Usage) emitted right before close(responses). The
outer loop's per-chunk Usage capture moves above the empty-Choices skip
so the sentinel updates the tracker without ever reaching the wire,
keeping the existing OpenAI spec contract (intermediate chunks carry no
`usage` field, and the deferred-final-chunk helpers remain Usage-free
per the regression test for issue #8546).

Adds streamUsageFromTokenUsage, usageSentinelChunk, and
applyChunkToUsage helpers with focused Ginkgo coverage plus a flow-level
test that mirrors the outer-loop sequence.

Fixes #9927

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4-7 [Claude Code]

* refactor(openai): return final TokenUsage from stream workers

Replace the usage-only sentinel SSE chunk introduced in the previous
commit with a plain return value. The streaming workers process and
processTools (now extracted as package-level processStream and
processStreamWithTools) return (backend.TokenUsage, error); the outer
ChatEndpoint loop reads the cumulative counts off the existing `ended`
channel (now carrying streamWorkerResult{usage, err}) and builds the
include_usage trailer from a normal Go value after the LOOP exits.

This drops the empty-Choices "skip but capture Usage" rule from the
outer loop and removes the usageSentinelChunk / applyChunkToUsage
helpers entirely. The SSE responses channel is back to a single
purpose: wire chunks only.

processStream and processStreamWithTools move into chat_stream_workers.go
so they can be exercised directly from tests. The chat_stream_usage_test.go
suite now drives the workers with a mocked backend.ModelInferenceFunc
and asserts on the returned TokenUsage. The regression coverage for
issue #9927 is therefore behavioral: reverting the fix (discarding
ComputeChoices' usage return) makes the assertions fail with concrete
count mismatches.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4-7 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-22 10:13:41 +02:00
LocalAI [bot]
4735345105 chore: ⬆️ Update ggml-org/llama.cpp to bb28c1fe246b72276ee1d00ce89306be7b865766 (#9934)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-22 09:49:33 +02:00
LocalAI [bot]
7384fd800b chore: ⬆️ Update antirez/ds4 to 8d576642c39b9a2d782a80159ba84ef5a81c0b81 (#9932)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-22 08:31:49 +02:00
LocalAI [bot]
6942713d85 chore: ⬆️ Update leejet/stable-diffusion.cpp to 3a8788cb7d74f185d6b18688e9563015524ecaf5 (#9933)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-22 00:31:19 +02:00
LocalAI [bot]
0cf52c44d4 chore: ⬆️ Update ggml-org/whisper.cpp to 8443cf05e3fa8ce1b32348e1bcbcf8fc31f7f3ae (#9929)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-21 23:24:01 +02:00
LocalAI [bot]
0d34cf7cbd chore: ⬆️ Update ikawrakow/ik_llama.cpp to 48a55f74e4c6e2aeda363dd386c1ac9170a0af71 (#9930)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-21 23:23:37 +02:00
LocalAI [bot]
f0cb02afb8 feat(usage): attribute Sources rows to user accounts in admin view (#9935)
The merged feature (#9920) let admins see per-API-key and per-source
totals but did not surface which user owned each key, and lumped
every user's Web UI traffic into a single global Web UI row. This
makes the admin Sources tab properly per-user attributable:

- KeyTotal gains UserID + UserName, populated from the snapshot the
  usage middleware already records. The by_key roll-up now groups by
  (api_key_id, api_key_name, user_id, user_name).
- New SourceTotals.ByUserSource roll-up groups (source, user_id,
  user_name) for sources without a key identity (web, legacy). Only
  populated on the admin path (includeLegacy=true); the non-admin
  endpoint stays unchanged for backwards compatibility.
- SourcesTable accepts showUserColumn={isAdmin}; admin view renders
  a User column, makes the search match user name/id, and expands
  Web UI / legacy pseudo-rows from the global aggregate to one row
  per user using by_user_source.

Refs: #9862

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-21 23:23:06 +02:00
LocalAI [bot]
a39e025d64 fix(nodes): make per-node backend install async via gallery job queue (#9928)
* feat(galleryop): add TargetNodeID to ManagementOp for single-node installs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(galleryop): add NodeScopedKey helpers for per-node opcache rows

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(galleryop): use strings.Cut for NodeScopedKey parsing, reject empty nodeID

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): scope DistributedBackendManager.InstallBackend to single node via TargetNodeID

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(http): make /api/nodes/:id/backends/install async via gallery service job queue

The handler previously called unloader.InstallBackend synchronously and
blocked the browser for up to 3 minutes waiting on the NATS reply. It now
enqueues a TargetNodeID-scoped ManagementOp on BackendGalleryChannel and
returns HTTP 202 + jobID immediately, matching /api/backends/install/:id.

The opcache key is built via NodeScopedKey(nodeID, backend) so concurrent
installs of the same backend across different nodes do not stomp each
other. galleryService/opcache/appConfig are threaded through
RegisterNodeAdminRoutes for this.

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(http): log malformed backend_galleries override and stop test drain goroutine

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(api): expose nodeID for node-scoped backend ops in /api/operations

Node-scoped backend installs land in opcache under "node:<nodeID>:<backend>"
keys. Without splitting that prefix back out, the operations panel renders
the full key as the display name and has no structured way to label which
worker an install is targeting. Detect the prefix, surface nodeID as its own
response field, and reduce the display name back to the bare backend slug.
Bare (non-scoped) ops are left untouched so legacy installs do not gain a
misleading empty nodeID.

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(react-ui): poll job status for node-targeted backend installs

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(react-ui): make NodeInstallPicker state updates pure and surface cancellations as errors

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(react-ui): clarify async semantics in handleInstallOnTarget

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(http): use statusUrl casing for node install response to match codebase precedent

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-21 22:25:53 +02:00
Ettore Di Giacinto
05e8e1e9f4 ci(images): publish chronologically-orderable master-<epoch>-<sha> tags
The existing master push pipeline produces `master` (rolling) and
`sha-<short>` tags. Neither is orderable by build time, so downstream
GitOps that want to auto-bump to the newest master build (e.g. Flux
ImagePolicy) can't pick the latest from the tag list — alphabetical
sort over hex shas is effectively random, and the rolling `master`
tag can't be referenced as an immutable bump target.

Add a third tag of the form `master-<epoch>-<sha>` (Unix epoch in
seconds + short sha), gated on default-branch pushes via metadata-
action's `is_default_branch` predicate. The sha is retained for
traceability; the epoch makes the tags numerically orderable, so a
Flux ImagePolicy like

  filterTags:
    pattern: '^master-(?P<ts>[0-9]+)-[a-f0-9]+$'
    extract: '$ts'
  policy:
    numerical:
      order: asc

will reliably bump to the newest master build.

Applied to both image_build.yml (OCI labels stay consistent) and
image_merge.yml (the actual tag publisher via buildx imagetools).
2026-05-21 17:18:30 +00:00
Rin
a7f6cc8956 [utils] Fail immediately on extraction errors (#9926)
utils: fail immediately on extraction errors

Setting ContinueOnError to false ensures that ExtractArchive does not
leave the model or backend directory in an inconsistent state if a
partial failure occurs. This improves robustness against malformed
archives or unexpected I/O issues during installation.

Signed-off-by: RinZ27 <222222878+RinZ27@users.noreply.github.com>
2026-05-21 19:00:33 +02:00
LocalAI [bot]
f15b9178ec feat(usage): track and visualise usage per API key (#9920)
* feat(usage): add Source, APIKeyID, APIKeyName columns to UsageRecord

Adds three additive columns plus UsageSource* constants. The columns
are auto-migrated by InitDB. APIKeyID is a nullable foreign reference
to UserAPIKey.ID; APIKeyName is snapshotted on each row so revoked
keys keep showing their name in history.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(usage): backfill Source on pre-feature usage rows

InitDB now classifies any pre-existing usage_record with an empty
source: 'legacy-api-key' user -> legacy, everything else -> web.
The backfill is idempotent (only touches NULL/empty rows).

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(usage): add GetUserUsageBySource aggregator

Groups by (bucket, source, api_key_id, api_key_name). Filters out
legacy by default. Returns both per-bucket detail and roll-ups
(by_source, by_key sorted desc and capped at 200, grand_total).

The MAX(created_at) projection is iterated via Rows().Scan into a
string column and parsed manually because the SQLite driver surfaces
the aggregated timestamp as a string, which database/sql refuses to
scan directly into time.Time. Postgres returns a real timestamp; the
same string path handles its RFC3339 form too.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(usage): log Rows() errors and assert LastUsed in tests

Adds rows.Err() and Rows() open-failure logging in
computeSourceTotals so silent data drops surface in logs. Logs on
parseLastUsedString format misses for the same reason. Strengthens
the snapshot-survival test to assert LastUsed is a recent timestamp,
locking the SQLite time-string parser behaviour.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(usage): add admin GetAllUsageBySource with filters and truncation

Optional user_id and api_key_id filters (composed with AND). Legacy
bucket is included for admin callers. truncated=true when more than
200 distinct keys would be in the by_key roll-up.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(auth): plumb auth_source and auth_apikey through Echo context

tryAuthenticate now sets auth_source on every successful branch
(web for session/Bearer-session, apikey for Bearer-key/x-api-key/
token-cookie, legacy for legacy env key match). For named-key
branches it also stores the resolved *UserAPIKey under auth_apikey
so downstream middlewares can snapshot id+name without re-validating.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(auth): expand tryAuthenticate godoc and cover Bearer-session branch

Documents all three context-keys side effects (auth_source,
auth_apikey, _auth_session) plus the split of responsibilities with
the parent Middleware. Adds a test for the Bearer-as-session-token
classification so future regressions there fail loudly.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(usage): UsageMiddleware records source + snapshots key name

Reads auth_source and auth_apikey from the Echo context (set by
auth.Middleware in the previous task). Snapshots UserAPIKey.ID and
Name onto each row so revoked keys remain readable in history.
Falls back to source=web when no auth_source is set (auth disabled
or unrecognised path).

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(usage): add /api/auth/usage/sources and admin variant

Self endpoint filters legacy server-side; admin endpoint includes
legacy and accepts user_id + api_key_id filters. Response includes
buckets, totals.{by_source, by_key, grand_total}, and a truncated
flag set when the per-key roll-up was capped at 200.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(routes): mark test mirror handlers as keep-in-sync with production

The newTestAuthApp helper duplicates production route handlers
inline because it cannot use RegisterAuthRoutes (which requires a
*application.Application). Naming the source path on each mirror
makes the drift contract explicit for future maintainers.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): add usageApi.getMySources/getAdminSources + i18n strings

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): add Sources tab skeleton with data fetch

Adds Usage page tab that fetches /api/auth/usage/sources (or the
admin variant). Renders raw totals plus a placeholder key list;
real visualisations land in subsequent commits. Restructures the
existing tab button block so Models and Sources are visible to
non-admins (Users remains admin-only).

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): source mix ribbon + searchable/sortable sources table

Replaces the SourcesTab placeholder rendering with two reusable
components: SourceMixRibbon (one segmented bar per source class)
and SourcesTable (search + sort + revoked-key dim). Pulls the
current API key list to detect revoked keys.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ui): skip revoked-key detection until the key list is known

existingKeyIds defaulted to an empty Set, which made every live
api_key row render as (revoked) during the brief window before
apiKeysApi.list() resolved, and permanently after a fetch failure.
Use null as the unknown state and suppress the revoked badge until
the parent provides a real Set.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): top-N stacked time chart and drill-in chip for Sources tab

Top 7 sources by total tokens get distinct colours; the rest roll up
into 'Other'. Clicking a row in the SourcesTable dims everything
except that series in the chart; the chip is the canonical clear.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(usage): document per-API-key Sources tab and endpoints

Extends features/authentication.md Usage Tracking section with:
- A 'Sources' tab description and source-class taxonomy
- Endpoint documentation for /api/auth/usage/sources and the
  admin variant
- Response shape example with by_source / by_key / grand_total
- Migration note about pre-feature row backfill

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(usage): silence errcheck on deferred rows.Close

CI errcheck flagged the bare 'defer rows.Close()' in
computeSourceTotals. Wrap in a closure that discards the close
error explicitly; an error here is non-actionable since we have
already drained the rows and logged any iteration failure.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(usage): bound batcher intake and add Shutdown/FlushNow hooks

The pre-existing usage batcher had no cap on its add() path; the
usageMaxPending=5000 constant only guarded the re-queue path after
a failed write, leaving memory growth unbounded if the DB fell
behind. This commit:

- Adds the cap to add() so saturation drops new records (rate-limited
  warn at 1/1024) instead of growing unbounded.
- Raises usageMaxPending to 50000 to absorb realistic inference bursts.
- Replaces the package-level batcher global with a mutex-guarded pair
  plus a currentBatcher() accessor so Init / Shutdown cycles are
  race-free.
- Adds ShutdownUsageRecorder() for graceful drain on process exit
  (not yet wired into app shutdown, just published).
- Adds FlushNow() for deterministic tests; the middleware suite no
  longer needs 6s sleeps per spec and now runs in ~50ms instead of 18s.
- Re-queue on failed flush is now cap-aware: prepends as much of the
  failed batch as fits alongside concurrent arrivals, instead of
  dropping the whole batch when full.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(usage): drain usage batcher on graceful shutdown

Registers ShutdownUsageRecorder with the existing
signals.RegisterGracefulTerminationHandler so SIGINT/SIGTERM
synchronously flushes any in-memory usage records before the
process exits. Without this, up to one flush interval (5s) of
recorded usage was lost when LocalAI restarted.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-21 16:34:02 +02:00
LocalAI [bot]
959de86761 feat(llama-cpp): make server-side prompt cache work by default (#9925)
Aligns LocalAI's llama-cpp gRPC backend with upstream's auto-on prompt
cache path so repeated system prompts (agents, OpenAI/Anthropic-compatible
CLIs, coding assistants) skip prefill on subsequent calls without any
YAML changes. Reported in #9921.

Upstream's server enables `kv_unified=true` (and bumps `n_parallel` to 4)
when slot count is auto, which unlocks `cache_idle_slots`. LocalAI
hardcodes `n_parallel=1` and so far also hardcoded `kv_unified=false`,
which silently force-disables idle-slot saving at server init. The host
prompt cache was allocated but never written across requests.

Changes in backend/cpp/llama-cpp/grpc-server.cpp:
- params.kv_unified: false -> true (single-slot path now benefits from
  the prompt cache; users can opt out with `kv_unified:false`)
- params.n_ctx_checkpoints: 8 -> 32 (match upstream default)
- params.cache_idle_slots = true initialized explicitly (upstream default)
- params.checkpoint_every_nt = 8192 initialized explicitly (upstream default)
- New option parsers: cache_idle_slots / idle_slots_cache,
  checkpoint_every_nt / checkpoint_every_n_tokens

Docs:
- features/text-generation.md: fix misleading `cache_ram` description
  (it's the host-side prompt cache, not the KV cache), document the
  kv_unified + cache_ram + cache_idle_slots interaction, add rows for
  the two newly-exposed options, and add a worked example for the
  agent/CLI workload from the issue.
- advanced/model-configuration.md: mark the legacy `prompt_cache_path`
  / `prompt_cache_all` / `prompt_cache_ro` YAML fields as unused by the
  llama-cpp gRPC backend (they target upstream's CLI completion tool
  and are not consumed by grpc-server.cpp) and point readers at the
  new prompt-cache explainer.

Closes #9921

Assisted-by: claude:opus-4.7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-21 16:31:48 +02:00
LocalAI [bot]
4c234abc2c refactor(agents): bump skillserver, drop redundant Name from list_skills output (#9916)
refactor(agents): bump skillserver, drop redundant Name from list_skills/search_skills

skillserver's list_skills MCP tool used to ship every entry with name=""
(field was commented out), while search_skills populated it - two tools
with inconsistent shape for the same data. skill.Name and skill.ID are
populated from the same source string anyway (the directory name), so
returning both was pure duplication.

Bumps github.com/mudler/skillserver to a7317cb, which drops the Name
field from both SkillInfo and SearchResult and leaves ID as the single
canonical identifier (already what read_skill consumes).

Adds core/services/skills/skills_mcp_test.go, a regression that drives
the LocalAI FilesystemManager through an in-process MCP session and
asserts a newly-created skill is visible by ID on the still-open session.

This is a cleanup, not the root cause of #9868 - the reporter likely
sees something deeper than a cosmetic JSON shape issue.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-21 14:45:53 +02:00
Richard Palethorpe
c68818a62e fix(llama-cpp): terminate tensor_buft_overrides with sentinel (#9919)
llama.cpp's model loader asserts back().pattern == nullptr on
params.tensor_buft_overrides (and on params.kv_overrides.back().key[0]
== 0) before binding them into llama_model_params. PR #8560 attempted
to satisfy llama_params_fit's placeholder requirement by pre-filling
params.tensor_buft_overrides up to llama_max_tensor_buft_overrides()
*before* the option-parse loop. Any subsequent push_back from
override_tensor / draft_cpu_moe / draft_n_cpu_moe / draft_override_tensor
then appended real entries after the placeholders, leaving back() with
a real pattern and tripping the assert. The draft override vector
likewise had no terminator at all.

Mirror upstream common/arg.cpp:645-658 instead: real entries are
pushed during option parsing, and after parsing we pad the main vector
up to ntbo (placeholders land at the end, so back() is always nullptr)
and append a single {nullptr, nullptr} to the draft vector when it is
non-empty. The existing kv_overrides terminator block already matches
upstream and stays.

Verified against ggml-org/llama.cpp@5cbaa5e: only tensor_buft_overrides
(main + draft) and kv_overrides are sentinel-terminated common_params
fields; everything else is size-driven std::vector.

Assisted-by: claude-code:claude-opus-4-7

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-21 12:55:06 +02:00
LocalAI [bot]
11d5bd0cc3 fix(react-ui/chat): stop wiping selection on every /api/operations poll (#9904) (#9917)
useOperations() was calling setOperations() with a fresh array on every
1s poll, even when the payload was identical. In React 19 the DOM diff
no longer short-circuits dangerouslySetInnerHTML on equal __html, so the
forced Chat re-render re-assigned innerHTML on every assistant message
once per second — wiping any text the user had selected.

Skip the state update when the serialised operations payload is
unchanged, and switch loading/error to functional setters so they also
short-circuit at the source.

Also fixes the chat copy button on plain HTTP: navigator.clipboard is
undefined in non-secure contexts (a common LXC+Docker deployment), but
the previous code called it unconditionally and showed a success toast
regardless. Routed Chat, AgentChat and CanvasPanel through a new
copyToClipboard() helper that uses navigator.clipboard when available
and falls back to a hidden-textarea + execCommand('copy') trick that
browsers still honour outside secure contexts. The fallback preserves
the user's existing selection.

Regression coverage in e2e/chat-polling-selection.spec.js: a
MutationObserver counts mutations on the assistant content node across
3s of polling (must be 0); the copy test stubs out navigator.clipboard
and asserts that execCommand('copy') is invoked.


Assisted-by: claude-opus-4-7-1m

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-21 12:17:51 +02:00
LocalAI [bot]
12e056e96d chore: ⬆️ Update ggml-org/llama.cpp to ad277572619fcfb6ddd38f4c6437283a4b2b8636 (#9915)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-21 09:07:31 +02:00
LocalAI [bot]
308aa8908a chore: ⬆️ Update ace-step/acestep.cpp to ed53caf164e4492a5620b2e3f2264629cf66da24 (#9913)
⬆️ Update ace-step/acestep.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-21 00:15:57 +02:00
LocalAI [bot]
b2d68a53a2 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 11a1fea9e291f12ce2c803a9d7812c30ca806bcf (#9914)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-20 22:04:06 +00:00
LocalAI [bot]
e3706c0512 chore(model-gallery): ⬆️ update checksum (#9910)
⬆️ Checksum updates in gallery/index.yaml

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-20 23:38:45 +02:00
LocalAI [bot]
1ffd82a050 chore: ⬆️ Update antirez/ds4 to 2606543be7a8c125a32cee37f5d1d85dc78f2fcf (#9909)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-20 21:22:26 +00:00
LocalAI [bot]
f515168dbe chore(acestep-cpp): bump pin to ed53caf and adapt wrapper to new API (#9908)
The new ace-step.cpp revision moves backend initialization inside each
`*_load` call and drops the separate `DiTGGMLConfig` argument from
`dit_ggml_load` (config now lives in `DiTGGML::cfg`, populated from GGUF
metadata at load time). Drop the now-removed `*_init_backend` calls and
replace `g_dit_cfg` accesses with `g_dit.cfg`.


Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-20 21:05:32 +00:00
LocalAI [bot]
ef6ca34513 chore: ⬆️ Update leejet/stable-diffusion.cpp to 5b0267e941cade15bd80089d89838795d9f4baa6 (#9907)
Adapt the C++ wrapper to the new `generate_video()` signature: upstream now
returns `bool` and writes frames/audio via out-parameters (`sd_image_t**`,
`sd_audio_t**`). Also set `p->fps` on the params struct (new upstream field)
and free the returned audio handle on both the success and error paths.


Assisted-by: claude-code:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-20 20:53:19 +00:00
dependabot[bot]
9413c3767f chore(deps): update transformers requirement from >=5.8.0 to >=5.8.1 in /backend/python/transformers (#9883)
chore(deps): update transformers requirement

Updates the requirements on [transformers](https://github.com/huggingface/transformers) to permit the latest version.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](https://github.com/huggingface/transformers/compare/v5.8.0...v5.8.1)

---
updated-dependencies:
- dependency-name: transformers
  dependency-version: 5.8.1
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-20 22:16:02 +02:00
dependabot[bot]
3bf3cce232 chore(deps): bump sentence-transformers from 5.4.0 to 5.5.0 in /backend/python/transformers (#9888)
chore(deps): bump sentence-transformers in /backend/python/transformers

Bumps [sentence-transformers](https://github.com/huggingface/sentence-transformers) from 5.4.0 to 5.5.0.
- [Release notes](https://github.com/huggingface/sentence-transformers/releases)
- [Commits](https://github.com/huggingface/sentence-transformers/compare/v5.4.0...v5.5.0)

---
updated-dependencies:
- dependency-name: sentence-transformers
  dependency-version: 5.5.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-20 22:13:39 +02:00
LocalAI [bot]
06f8159035 chore: ⬆️ Update ggml-org/llama.cpp to 67ace021da905e27ecbdf1176b0eef578a5288c0 (#9897)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-20 22:05:58 +02:00
LocalAI [bot]
f6a73f54fa feat(swagger): update swagger (#9872)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-20 22:05:35 +02:00
LocalAI [bot]
24e04d8e81 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 77413bc900f9a2bfd8a5407f184427bcc0825f6c (#9899)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-20 01:02:53 +02:00
LocalAI [bot]
b9a49449ae chore: ⬆️ Update ggml-org/whisper.cpp to afa2ea544fb4b0448916b4a31ecd33c8685bd482 (#9898)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-20 01:02:25 +02:00
LocalAI [bot]
1879e11042 chore: ⬆️ Update antirez/ds4 to 599e49d253971451f710cb8323344e789906ed6c (#9900)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-20 01:01:45 +02:00
LocalAI [bot]
403d391316 chore(model-gallery): ⬆️ update checksum (#9901)
⬆️ Checksum updates in gallery/index.yaml

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-20 01:01:20 +02:00
Daniel Liljeberg
fc3980dadd fix: inject text-file content into chat completions messages (#9896)
Non-image/non-audio file attachments (txt, md, csv, json) were being
  stored in the 'files' metadata field but never added to the message
  content array sent to /v1/chat/completions. Images and audio correctly
  received content blocks; files did not.

  Fix: push a text content block into messageContent when textContent is
  present, matching the pattern used for image_url and audio_url.

  Also fixes Home.jsx addFiles which never called file.text() at all,
  meaning files attached on the home screen had empty textContent even
  before reaching useChat.js.

  Note: PDF files use file.text() which returns raw bytes rather than
  parsed text. Proper PDF support would require PDF.js or server-side
  extraction and is not part of this fix.

Signed-off-by: Daniel Liljeberg <damien_@hotmail.com>
2026-05-20 01:00:32 +02:00
Richard Palethorpe
2009544b44 fix(nix): correct flake src path and add dev shell (#9894)
The flake set `src = ./sources;` referencing a non-existent subdirectory,
so `nix build` and `nix develop` both failed evaluation. Point `src` at
the repo root and refresh `vendorHash` accordingly.

Add `devShells.default` with the Go toolchain, protobuf generators,
Node.js/bun for the React UI (`make react-ui`), and the linters used by
`make lint` (golangci-lint, gofumpt, goimports, staticcheck).

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-19 19:28:30 +02:00
dependabot[bot]
e859345b12 chore(deps): bump github.com/alecthomas/kong from 1.14.0 to 1.15.0 (#9881)
Bumps [github.com/alecthomas/kong](https://github.com/alecthomas/kong) from 1.14.0 to 1.15.0.
- [Commits](https://github.com/alecthomas/kong/compare/v1.14.0...v1.15.0)

---
updated-dependencies:
- dependency-name: github.com/alecthomas/kong
  dependency-version: 1.15.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-19 08:07:07 +02:00
dependabot[bot]
f30712f8e8 chore(deps): bump github.com/aws/aws-sdk-go-v2 from 1.41.6 to 1.41.7 (#9892)
Bumps [github.com/aws/aws-sdk-go-v2](https://github.com/aws/aws-sdk-go-v2) from 1.41.6 to 1.41.7.
- [Release notes](https://github.com/aws/aws-sdk-go-v2/releases)
- [Commits](https://github.com/aws/aws-sdk-go-v2/compare/v1.41.6...v1.41.7)

---
updated-dependencies:
- dependency-name: github.com/aws/aws-sdk-go-v2
  dependency-version: 1.41.7
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-19 08:06:50 +02:00
dependabot[bot]
a19c77c5f8 chore(deps): bump github.com/onsi/ginkgo/v2 from 2.28.2 to 2.29.0 (#9882)
Bumps [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) from 2.28.2 to 2.29.0.
- [Release notes](https://github.com/onsi/ginkgo/releases)
- [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md)
- [Commits](https://github.com/onsi/ginkgo/compare/v2.28.2...v2.29.0)

---
updated-dependencies:
- dependency-name: github.com/onsi/ginkgo/v2
  dependency-version: 2.29.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-19 08:06:34 +02:00
LocalAI [bot]
4b02d23c0c chore: ⬆️ Update ggml-org/llama.cpp to 5cbaa5e69e09bde3334cd8c355570553a0dca027 (#9876)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-19 08:06:16 +02:00
LocalAI [bot]
21140e96b2 chore: ⬆️ Update ggml-org/whisper.cpp to 47b9eb37a33c5031a1b667ace64477330b9f36c1 (#9877)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-19 08:05:56 +02:00
dependabot[bot]
fc803e8d48 chore(deps): bump golang.org/x/crypto from 0.50.0 to 0.51.0 (#9886)
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.50.0 to 0.51.0.
- [Commits](https://github.com/golang/crypto/compare/v0.50.0...v0.51.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-version: 0.51.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-19 08:04:15 +02:00
LocalAI [bot]
ca51606bfe chore: ⬆️ Update ikawrakow/ik_llama.cpp to 40aae0b6d86d50c0ee7011b3ce59a233203e430a (#9875)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-19 08:01:41 +02:00
Azteczek
cb502de309 feat: add flake.nix for dockerless setup (#9851)
* Add flake.nix

Signed-off-by: Azteczek <243776410+Azteczek@users.noreply.github.com>

* Add flake.lock

Signed-off-by: Azteczek <243776410+Azteczek@users.noreply.github.com>

---------

Signed-off-by: Azteczek <243776410+Azteczek@users.noreply.github.com>
2026-05-18 15:23:10 +01:00
Richard Palethorpe
5d0b549049 feat(gallery): verify backend OCI images with keyless cosign (#9823)
* feat(gallery): verify backend OCI images with keyless cosign

Close a trust gap where a registry compromise or MITM could silently
replace a backend image: the gallery YAML tells LocalAI which image to
pull, but until now nothing verified the bytes came from our CI.

Consumer (pkg/oci/cosignverify):
- New package using sigstore-go to verify keyless-cosign signatures.
- OCI 1.1 referrers API + new bundle format (no legacy :tag.sig).
- Policy fields: Issuer / IssuerRegex / Identity / IdentityRegex /
  NotBefore. NotBefore is the revocation lever — keyless Fulcio certs
  are ephemeral so revocation is policy-side; advancing not_before in
  the gallery YAML invalidates every signature predating the cutoff.
- TUF trusted root cached process-wide so N backends from one gallery
  do 1 fetch, not N.

Plumbing:
- pkg/downloader: ImageVerifier interface + WithImageVerifier option
  threaded through DownloadFileWithContext. Verification runs between
  oci.GetImage and oci.ExtractOCIImage, with digest pinning via
  pinnedImageRef to close the TOCTOU window. Skips the verifier's HEAD
  when the ref is already digest-pinned.
- core/config: Gallery.Verification YAML block.
- core/gallery: backendDownloadOptions builds the verifier from the
  policy; applied on initial URI, mirrors, and tag fallbacks.
- core/gallery/upgrade: the upgrade path now routes through the same
  options builder. A regression Ginkgo spec pins this contract —
  without it, UpgradeBackend silently bypassed verification.
- core/cli: --require-backend-integrity (LOCALAI_REQUIRE_BACKEND_INTEGRITY)
  escalates missing policy / empty SHA256 from warn to hard-fail.

Producer (.github/workflows/backend_merge.yml):
- id-token: write at job scope (PR-fork-safe via existing event gate).
- sigstore/cosign-installer@v3 pinned to v2.4.1.
- After each docker buildx imagetools create, resolve the manifest
  list digest and run cosign sign --recursive --new-bundle-format
  --registry-referrers-mode=oci-1-1 against repo@digest. --recursive
  signs the index and every per-arch entry, matching how the consumer
  resolves a tag to a platform-specific manifest before verifying.

Rollout: backend/index.yaml has no `verification:` block yet, so this
PR is backward-compatible — installs proceed with a warning until the
gallery is populated. Strict mode is opt-in.

Assisted-by: claude-code:claude-opus-4-7 [Bash] [Edit] [Read] [Write] [WebSearch] [WebFetch]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* refactor(gallery): plumb RequireBackendIntegrity through config instead of env

The previous implementation re-exported the --require-backend-integrity
CLI flag into LOCALAI_REQUIRE_BACKEND_INTEGRITY via os.Setenv, then
re-read it in core/gallery via os.Getenv. This leaked process state
into the gallery package and made the flag impossible to override
per-call or test without touching the env.

Add RequireBackendIntegrity to ApplicationConfig (with a matching
WithRequireBackendIntegrity AppOption) and thread the bool through
every install/upgrade path: InstallBackend, InstallBackendFromGallery,
UpgradeBackend, InstallModelFromGallery, InstallExternalBackend,
ApplyGalleryFromString/File, startup.InstallModels. Worker subcommands
gain the same env-bound flag on WorkerFlags so distributed-worker
installs honor it consistently with the worker daemon path.

Add a forbidigo lint rule against os.Getenv / os.LookupEnv / os.Environ
to keep the env-leak pattern from creeping back. Existing offenders
(p2p, config loaders, etc.) are baseline-grandfathered by the existing
new-from-merge-base: origin/master setting; targeted path exclusions
cover the legitimate cases — kong CLI entry points, backend
subprocesses, system capability probes, gRPC AUTH_TOKEN inheritance,
test gating env vars.

Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-18 08:02:20 +02:00
LocalAI [bot]
11cff1b309 chore: ⬆️ Update ggml-org/llama.cpp to 87589042cac2c390cec8d68fb2fad64e0a2a252a (#9855)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-18 08:01:30 +02:00
LocalAI [bot]
4ca3d2cdc0 docs: ⬆️ update docs version mudler/LocalAI (#9863)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-17 23:20:16 +02:00
LocalAI [bot]
3cba35ed32 chore: ⬆️ Update antirez/ds4 to c9dd9499bfa57c1bbfbb4446eff963330ab5329b (#9864)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-17 23:19:58 +02:00
LocalAI [bot]
265ae35231 chore: ⬆️ Update ikawrakow/ik_llama.cpp to c35189d83c91aad780aba62b89f2830cb2916223 (#9866)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-17 23:19:43 +02:00
LocalAI [bot]
6a48157a80 chore: ⬆️ Update leejet/stable-diffusion.cpp to bd17f53b7386fb5f60e8587b75e73c4b2fed3426 (#9854)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-16 23:12:05 +02:00
LocalAI [bot]
41c838b2df chore: ⬆️ Update ikawrakow/ik_llama.cpp to 3e573cfea6e0a332eff822ffbdb1dd3b112e9051 (#9856)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-16 22:44:08 +02:00
LocalAI [bot]
21e793ad2a chore: ⬆️ Update antirez/ds4 to ef0a4905d05263df8e63689f2dd1efac618a752c (#9857)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-16 22:43:46 +02:00
LocalAI [bot]
7c190bb4b9 docs: ⬆️ update docs version mudler/LocalAI (#9853)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-16 22:43:06 +02:00
LocalAI [bot]
d77a9137d8 feat(llama-cpp): bump to MTP-merge SHA and automatically set MTP defaults (#9852)
* feat(llama-cpp): bump to MTP-merge SHA and document draft-mtp spec type

Update LLAMA_VERSION to 0253fb21 (post ggml-org/llama.cpp#22673 merge,
2026-05-16) to pick up Multi-Token Prediction support.

No grpc-server.cpp changes are required: the existing `spec_type` option
delegates to upstream's `common_speculative_types_from_names()`, which
already accepts the new `draft-mtp` name. The `n_rs_seq` cparam needed
by MTP is auto-derived inside `common_context_params_to_llama` from
`params.speculative.need_n_rs_seq()`, and when no `draft_model` is set
the upstream server builds the MTP context off the target model itself.

Docs: extend the speculative-decoding section of the model-configuration
guide with the new type, both load paths (MTP head embedded in the main
GGUF vs. separate `mtp-*.gguf` sibling), the PR's recommended
`spec_n_max:2-3`, and the chained `draft-mtp,ngram-mod` recipe. Also
notes that the upstream `-hf` auto-discovery of `mtp-*.gguf` siblings is
not wired through LocalAI's gRPC layer.

Agent guide: short note explaining that new upstream spec types are
picked up automatically and that MTP needs no gRPC plumbing.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama-cpp): auto-detect MTP heads and enable draft-mtp on import + load

Detect upstream's `<arch>.nextn_predict_layers` GGUF metadata key (set by
`convert_hf_to_gguf.py` for Qwen3.5/3.6 family models and similar) and,
when present and the user has not configured a `spec_type` explicitly,
auto-append the upstream-recommended speculative-decoding tuple:

  - spec_type:draft-mtp
  - spec_n_max:6
  - spec_p_min:0.75

The 0.75 p_min is pinned defensively because upstream marks the current
default with a "change to 0.0f" TODO; locking it here keeps acceptance
thresholds stable across future llama.cpp bumps.

Detection runs in two places:

  - The model importer (`POST /models/import-uri`, the `/import-model`
    UI) range-fetches the GGUF header for HuggingFace / direct-URL
    imports via `gguf.ParseGGUFFileRemote`, with a 30s timeout and
    non-fatal error handling. OCI/Ollama URIs are skipped because the
    artifact is not directly streamable; the load-time hook covers them
    once the file is on disk.
  - The llama-cpp load-time hook (`guessGGUFFromFile`) reads the local
    header on every model start and appends the same options if
    `spec_type` is not already set.

Both paths share `ApplyMTPDefaults` and respect an explicit user-set
`spec_type:` / `speculative_type:` so YAML overrides win. Ginkgo
specs cover the append, preserve-user-choice, legacy alias, and nil
safety paths.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(importer): resolve huggingface:// URIs before MTP header probe

`gguf.ParseGGUFFileRemote` only speaks HTTP(S), but the importer was
handing it the raw `huggingface://...` URI directly (and similarly for
any other custom downloader scheme). Live-test against
`huggingface://ggml-org/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-MTP-Q8_0.gguf`
exposed this: the probe failed with `unsupported protocol scheme
"huggingface"`, was caught by the non-fatal error path, and the MTP
options were silently never applied to the generated YAML.

Route every candidate URI through `downloader.URI.ResolveURL()` and
require the resolved form to be HTTP(S). After the fix the probe
successfully reads `<arch>.nextn_predict_layers=1` from the real HF
GGUF and the emitted ConfigFile carries spec_type:draft-mtp,
spec_n_max:6, spec_p_min:0.75 as intended.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-16 22:42:48 +02:00
LocalAI [bot]
661a0c3b9d fix(ollama): accept float-encoded integer options (fixes #9837) (#9849)
fix(ollama): accept float-encoded integer options (num_ctx, top_k, ...)

Home Assistant's Ollama integration encodes integer options as JSON
floats (e.g. `"num_ctx": 8192.0`). Stdlib `json.Unmarshal` refuses to
decode a number with fractional notation into an `int` field, so the
entire request was rejected with HTTP 400 before reaching the backend:

  Unmarshal type error: expected=int, got=number 8192.0,
  field=options.num_ctx

Add a custom `UnmarshalJSON` on `OllamaOptions` that routes the int
fields (`top_k`, `num_predict`, `seed`, `repeat_last_n`, `num_ctx`)
through `*json.Number`, then converts via `Int64()` with a `Float64()`
fallback. Public field types are unchanged, so endpoint code is
untouched. Float fields and `stop` continue to parse via the default
path.

Fixes #9837

Assisted-by: Claude Code:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-16 18:38:19 +02:00
LocalAI [bot]
00b8989886 chore: ⬆️ Update ggml-org/llama.cpp to 1348f67c58f561808136e8a152a9eddec168f221 (#9842)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-16 08:41:09 +02:00
LocalAI [bot]
43e0d397ca chore: ⬆️ Update ggml-org/whisper.cpp to 968eebe77225d25e57a3f981da7c696310f0e881 (#9843)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-16 00:30:04 +02:00
LocalAI [bot]
a1a7a219ed chore: ⬆️ Update antirez/ds4 to 950e8e6474a1c9fabe04e669d607606a7ef8824f (#9844)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 23:46:29 +02:00
LocalAI [bot]
3937ec6527 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 5cc0d86c760e9858e4bed4418400bb39dbe025f2 (#9845)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 23:45:54 +02:00
LocalAI [bot]
1355b55794 chore: ⬆️ Update vllm-project/vllm cu130 wheel to 0.21.0 (#9846)
⬆️ Update vllm-project/vllm cu130 wheel

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 23:45:41 +02:00
Richard Palethorpe
5a2626d465 fix(deps): bump gomarkdown/markdown for GHSA-77fj-vx54-gvh7 (#9841)
Out-of-bounds read in SmartypantsRenderer.smartLeftAngle (CWE-125,
CVSS 7.5). Reachable transitively via LocalAGI's Email connector,
which renders inbound HTML email replies using html.CommonFlags
(includes Smartypants). An unmatched `<` in the inbound body could
panic the agent service.

Bump to v0.0.0-20260411013819-759bbc3e3207 (contains the fix). The
klauspost/compress entry loses its `// indirect` tag because
go mod tidy noticed pkg/utils/untar.go imports it directly.

Assisted-by: Claude:claude-opus-4-7 [Claude-Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-15 21:48:59 +02:00
LocalAI [bot]
a39591f144 realtime: honor output_modalities to skip TTS in text-only mode (#9838)
* realtime: honor output_modalities to skip TTS in text-only mode

The emulated realtime pipeline previously ignored the OpenAI Realtime spec
field output_modalities and always synthesized TTS. Add resolveOutputModalities
+ modalitiesContainAudio helpers and gate the TTS / ResponseOutputAudio*
emission so a client requesting ["text"] gets only ResponseOutputText* events.

This lets thin clients (e.g. thing5-poc) cache TTS on the client side while
still using the realtime WS for VAD + STT + LLM + tool-call parsing.

Assisted-by: Claude:claude-opus-4-7

* realtime: plumb response-level output_modalities and echo on session

Follow-up to the previous commit:
- Resolve response.create's output_modalities at the gate so a per-response
  override of an audio session is honored (the test asserted this contract
  but the production call site was passing nil).
- Mirror OutputModalities in the RealtimeSession echo so session.update
  round-trips the client-supplied value, matching MaxOutputTokens's pattern.

Assisted-by: Claude:claude-opus-4-7

* realtime: silence errcheck on deferred os.Remove of TTS file

CI's errcheck flagged the pre-existing `defer os.Remove(audioFilePath)`
inside the audio-emission block (now wrapped by the modality gate). Wrap
the call in a closure that explicitly discards the error — the canonical
Go pattern for "I want to defer a cleanup whose error I genuinely don't
care about."

Assisted-by: Claude:claude-opus-4-7 golangci-lint

---------

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-15 12:39:47 +02:00
massy_o
8c785dbe4a Validate archive member paths before extraction (#9820)
Signed-off-by: massy-o <telitos000@gmail.com>
2026-05-15 11:12:13 +02:00
LocalAI [bot]
4abf5befbb chore: ⬆️ Update ggml-org/llama.cpp to 834a243664114487f99520370a7a7b00fc7a486f (#9826)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 10:29:22 +02:00
LocalAI [bot]
195b910260 chore: ⬆️ Update leejet/stable-diffusion.cpp to 0b8296915c4094090cff6bd2e09a5e98288c3c7d (#9827)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 10:19:52 +02:00
LocalAI [bot]
ba21bf667c docs: ⬆️ update docs version mudler/LocalAI (#9825)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 10:19:34 +02:00
LocalAI [bot]
7bd1693ad0 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 0fcffdb64d21e57f0778f342415754156e01adfa (#9828)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 10:08:46 +02:00
LocalAI [bot]
b5ac3a7373 chore: ⬆️ Update ggml-org/whisper.cpp to 46ca43d6399fdeada1b49fb2126ba373bd9ebc38 (#9829)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 10:08:24 +02:00
LocalAI [bot]
53de474ef5 chore: ⬆️ Update antirez/ds4 to 04b6fda2be395094cbf2d20d921e7a705a4166ef (#9830)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 10:08:09 +02:00
LocalAI [bot]
c33d36b870 fix(ollama): guard nil filter in galleryop.ListModels (#9817) (#9836)
The Ollama /api/tags handler passes a nil filter to galleryop.ListModels.
When ModelsPath contains any non-skipped loose file the function then
calls filter(name, nil) and panics, which Echo surfaces to clients as
"Server disconnected without sending a response" - the exact failure
Home Assistant's Ollama integration reports against LocalAI.

Mirror the nil guard already present in
ModelConfigLoader.GetModelConfigsByFilter so every caller is safe, and
add a regression test that exercises the loose-file path with a nil
filter.

Assisted-by: claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-15 10:07:50 +02:00
LocalAI [bot]
57fa178a64 feat(swagger): update swagger (#9824)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 09:30:29 +02:00
massy_o
745473cbe6 Validate video image URLs before download (#9819)
Signed-off-by: massy-o <telitos000@gmail.com>
2026-05-14 15:07:17 +02:00
massy_o
594c9fd92e Close Hugging Face scan response body (#9818)
Signed-off-by: massy-o <telitos000@gmail.com>
2026-05-14 12:35:29 +02:00
LocalAI [bot]
8af963bdd9 fix(streaming): comply with OpenAI usage / stream_options spec (#9815)
* fix(streaming): comply with OpenAI usage / stream_options spec (#8546)

LocalAI emitted `"usage":{"prompt_tokens":0,...}` on every streamed
chunk because `OpenAIResponse.Usage` was a value type without
`omitempty`. The official OpenAI Node SDK and its consumers
(continuedev/continue, Kilo Code, Roo Code, Zed, IntelliJ Continue)
filter on a truthy `result.usage` to detect the trailing usage chunk;
LocalAI's zero-but-non-null usage on every intermediate chunk made
that filter swallow every content chunk and surface an empty chat
response while the server log looked successful.

Changes:

- `core/schema/openai.go`: `Usage *OpenAIUsage \`json:"usage,omitempty"\``
  so intermediate chunks no longer carry a `usage` key. Add
  `OpenAIRequest.StreamOptions` with `include_usage` to mirror OpenAI's
  request field.
- `core/http/endpoints/openai/chat.go` and `completion.go`: keep using
  the `Usage` struct field as an in-process channel for the running
  cumulative, but strip it before JSON marshalling. When the request
  set `stream_options.include_usage: true`, emit a dedicated trailing
  chunk with `"choices": []` and the populated usage (matching the
  OpenAI spec and llama.cpp's server behavior).
- `chat_emit.go`: new `streamUsageTrailerJSON` helper; drop the
  `usage` parameter from `buildNoActionFinalChunks` since chunks no
  longer carry usage.
- Update `image.go`, `inpainting.go`, `edit.go` to wrap their Usage
  values with `&` for the new pointer field.
- UI: send `stream_options:{include_usage:true}` from the React
  (`useChat.js`) and legacy (`static/chat.js`) chat clients so the
  token-count badge keeps populating now that the server is
  spec-compliant.

Tests:

- New `chat_stream_usage_test.go` pins the spec invariants:
  intermediate chunks have no `usage` key, the trailer JSON has
  `"choices":[]` and a populated `usage`, and `OpenAIRequest` parses
  `stream_options.include_usage`.
- Update `chat_emit_test.go` to reflect that finals no longer embed
  usage.

Verified against the live LocalAI instance: before the fix Continue's
filter logic swallowed 16/16 token chunks; with the new shape it
yields 4/5 and routes usage through the dedicated trailer chunk.

Fixes #8546

Assisted-by: Claude:opus-4.7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(streaming): silence errcheck on usage trailer Fprintf

The new spec-compliant `stream_options.include_usage` trailer writes
were flagged by errcheck since they're new code (golangci-lint runs
new-from-merge-base on master); the surrounding `fmt.Fprintf` data:
writes are grandfathered. Drop the return values explicitly to match
the linter's contract without adding a nolint shim.

Assisted-by: Claude:opus-4.7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-14 08:53:46 +02:00
LocalAI [bot]
6e1dbae256 feat(llama-cpp): expose 12 missing common_params via options[] (#9814)
The llama.cpp backend already accepts a free-form options: array in the
model config that maps to common_params fields, but a coverage audit
against upstream pin 7f3f843c flagged 12 user-visible knobs that were
neither set via the typed proto fields nor reachable via options:.

Wire them up under the existing if/else chain in params_parse, before
the speculative section. Each new option follows the file's prevailing
patterns (try/catch around numeric parses, the same true/1/yes/on bool
form used elsewhere, hardware_concurrency() fallback for thread counts,
mirror of draft_override_tensor for override_tensor).

Top-level / batching / IO:
  - n_ubatch (alias ubatch) -- physical batch size; was previously
    force-aliased to n_batch at line 482, blocking embedding/rerank
    workloads that need independent control
  - threads_batch (alias n_threads_batch) -- main-model batch threads;
    mirrors the existing draft_threads_batch
  - direct_io (alias use_direct_io) -- O_DIRECT model loads
  - verbosity -- llama.cpp log threshold (line 479 had this commented
    out)
  - override_tensor (alias tensor_buft_overrides) -- per-tensor buffer
    overrides for the main model; mirrors draft_override_tensor

Embedding / multimodal:
  - pooling_type (alias pooling) -- mean/cls/last/rank/none; previously
    only auto-flipped to RANK for rerankers
  - embd_normalize (alias embedding_normalize) -- and the embedding
    handler now reads params_base.embd_normalize instead of a hardcoded
    2 at the previous embd_normalize literal in Embedding()
  - mmproj_use_gpu (alias mmproj_offload) -- mmproj on CPU vs GPU
  - image_min_tokens / image_max_tokens -- per-image vision token budget

Reasoning surface (the audit-focus three; LocalAI's existing
ReasoningConfig.DisableReasoning only feeds the per-request
chat_template_kwargs.enable_thinking and does not touch any of these):
  - reasoning_format -- none/auto/deepseek/deepseek-legacy parser
  - enable_reasoning (alias reasoning_budget) -- -1/0/>0 thinking budget
  - prefill_assistant -- trailing-assistant-message prefill toggle

All 14 referenced fields exist on both the upstream pin and the
turboquant fork's common.h, so no LOCALAI_LEGACY_LLAMA_CPP_SPEC guard
is needed.

Docs: extend model-configuration.md with new "Reasoning Models",
"Multimodal Backend Options", "Embedding & Reranking Backend Options",
and "Other Backend Tuning Options" subsections; also refresh the
Speculative Type Values table to show the new dash-separated canonical
names alongside the underscore aliases LocalAI still accepts.


Assisted-by: claude-code:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-14 08:53:34 +02:00
LocalAI [bot]
53bdb18d10 chore: ⬆️ Update ggml-org/llama.cpp to 7f3f843c31cd32dc4adc10b393342dfee071c332 (#9809)
* ⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix(llama-cpp): adapt to upstream COMMON_SPECULATIVE_TYPE_DRAFT rename

ggml-org/llama.cpp#22964 ("spec: update CLI arguments for better
consistency") renamed the speculative type enum values:
  COMMON_SPECULATIVE_TYPE_DRAFT  -> COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE
  COMMON_SPECULATIVE_TYPE_EAGLE3 -> COMMON_SPECULATIVE_TYPE_DRAFT_EAGLE3
and the registered name strings flipped from underscore- to dash-
separated form (e.g. ngram_simple -> ngram-simple), with the bare
draft/eagle3 aliases replaced by draft-simple/draft-eagle3.

This broke the build with the new LLAMA_VERSION on every variant
(vulkan/arm64, darwin and likely all the rest) at grpc-server.cpp:461.

Update the upstream branch of the speculative-type fallback to use the
new identifier (the LOCALAI_LEGACY_LLAMA_CPP_SPEC fork branch keeps the
old name), and normalize spec_type option tokens before passing them to
common_speculative_types_from_names so existing model configs that say
spec_type:draft / spec_type:ngram_simple keep working.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-14 08:53:23 +02:00
LocalAI [bot]
42a8db3573 ci(image): publish missing :latest-* and :v<X>-* singleton image tags (#9812)
* ci(image): wire singleton merges + `--` artifact separator

Closes the same singletons gap on the LocalAI server image workflow that
PR #9781 closed for backends. The user observed it as missing
:latest-gpu-nvidia-cuda-12 etc. on quay.io/go-skynet/local-ai — the
build matrix has six single-arch entries with no corresponding merge
step, so their per-arch digests push (push-by-digest=true) and never
get tagged:

  - -gpu-hipblas              (hipblas-jobs)
  - -gpu-nvidia-cuda-12       (core-image-build)
  - -gpu-nvidia-cuda-13       (core-image-build)
  - -gpu-intel                (core-image-build)
  - -nvidia-l4t-arm64         (gh-runner)
  - -nvidia-l4t-arm64-cuda-13 (gh-runner)

Only :latest, :v<X>, :latest-gpu-vulkan and :v<X>-gpu-vulkan were
actually being published before this commit (the two multiarch suffixes
that had merge jobs).

Changes:

1. image.yml: add six new merge jobs, one per single-arch entry. Each
   `needs:` only its parent build job (matching the existing pattern
   for core-image-merge / gpu-vulkan-image-merge).
2. image_build.yml: switch artifact name to
   `digests-localai<suffix>--<platform-tag-or-"single">`. The `--`
   separator anchors the merge-side glob so a singleton tag-suffix
   doesn't over-match a longer suffix that shares its prefix
   (-nvidia-l4t-arm64 vs -nvidia-l4t-arm64-cuda-13). Same convention
   as backend_build.yml's fix.
3. image_merge.yml: update the download pattern to match.

Next master push or tag release should produce :latest-gpu-hipblas,
:latest-gpu-nvidia-cuda-12, :latest-gpu-nvidia-cuda-13, :latest-gpu-intel,
:latest-nvidia-l4t-arm64, :latest-nvidia-l4t-arm64-cuda-13 (and their
:v<X>-* equivalents) for the first time on the post-#9781 workflow.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(image): add !cancelled() guard to all 8 image merge jobs

Parity pass with backend.yml's merge jobs (8521af14). Without
!cancelled(), GHA's default `needs:` cascade skips the merge when ANY
matrix cell of the parent build job fails or is cancelled — so a single
flaky leg would suppress publication of every other tag-suffix's
manifest list. Same fix the backend got after v4.2.1 showed 2 failed
singlearch builds cascade-skip 199 singlearch merge entries.

Applied to all 8 image merges:

  - core-image-merge
  - gpu-vulkan-image-merge
  - gpu-nvidia-cuda-12-image-merge       (added in e5300f1a)
  - gpu-nvidia-cuda-13-image-merge       (added in e5300f1a)
  - gpu-intel-image-merge                (added in e5300f1a)
  - gpu-hipblas-image-merge              (added in e5300f1a)
  - nvidia-l4t-arm64-image-merge         (added in e5300f1a)
  - nvidia-l4t-arm64-cuda-13-image-merge (added in e5300f1a)

Build jobs (hipblas-jobs, core-image-build, gh-runner) are
intentionally NOT changed — they have no upstream `needs:` to cascade-
skip from.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-14 00:28:48 +02:00
LocalAI [bot]
0353d3bd77 chore: ⬆️ Update ggml-org/whisper.cpp to 3e9b7d0fef3528ee2208da3cdb873a2c53d2ae2f (#9808)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-14 00:20:14 +02:00
LocalAI [bot]
ec49995190 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 949bb8f1d660fc1264c137a6f3dbd619375f6134 (#9807)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-14 00:15:32 +02:00
Tai An
67c34bbb96 fix(middleware): parse OpenAI-spec tool_choice in /v1/chat/completions (#9559)
* fix(middleware): parse OpenAI-spec tool_choice in /v1/chat/completions

Follows up on #9526 (the 3-site setter fix) by addressing the remaining
clause in #9508 — string mode and OpenAI-spec specific-function shape both
silently failed in the /v1/chat/completions parsing path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(middleware): restore LF endings and cover tool_choice parsing with specs

The previous commit on this branch saved core/http/middleware/request.go
with CRLF line endings, ballooning the diff against master to 684 / 651
for what is in reality a ~50-line parsing change. Restore LF (matches
.editorconfig end_of_line = lf).

Add 11 Ginkgo specs under "SetModelAndConfig tool_choice parsing
(chat completions)" that parallel the existing MergeOpenResponsesConfig
specs from #9509. They drive the full middleware chain (SetModelAndConfig
+ SetOpenAIRequest) and assert:

  * "required"  -> ShouldUseFunctions=true, no specific name
  * "none"      -> ShouldUseFunctions=false (tools disabled per OpenAI spec)
  * "auto"      -> default, tools available, no specific name
  * {type:function, function:{name:X}}  (spec)    -> X is forced
  * {type:function, name:X}             (legacy)  -> X is forced
  * nested wins when both forms are present
  * malformed shapes (no type, wrong type, no name, empty name) are no-ops

Update the inline comment on the string case to describe the actual
mechanism: "none" reaches SetFunctionCallString("none") downstream and
is then honored by ShouldUseFunctions() returning false. Before this PR
json.Unmarshal([]byte("none"), &functions.Tool{}) failed silently, so
"none" was ignored - making "none" actually work is a real behavior fix
this PR brings.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4-7 [Claude Code]

* fix(middleware): preserve pre-#9559 support for JSON-string-encoded tool_choice

Some non-spec clients send tool_choice as a JSON-encoded string of an
object form, e.g. "{\"type\":\"function\",\"function\":{\"name\":\"X\"}}".
The pre-#9559 code accepted this by accident: its case string: branch
ran json.Unmarshal([]byte(content), &functions.Tool{}), which succeeded
for that double-encoded shape even though it failed for the legitimate
plain string modes "auto" / "none" / "required".

The first version of this PR routed every string straight to
SetFunctionCallString as a mode, which fixed the plain-string cases but
silently regressed the double-encoded one (funcs.Select("{...}") returns
nothing). Restore the fallback: when a string looks like a JSON object,
try parsing it as a tool_choice map first; fall through to mode-string
handling only when no usable name comes out.

Factor the map-name extraction into a small helper
(extractToolChoiceFunctionName) so the string-fallback and the regular
map case go through identical code, and accept both the OpenAI-spec
nested shape and the legacy/Anthropic flat shape from either entry
point.

Add 3 Ginkgo specs covering the double-encoded case (nested form, legacy
form, and the fall-through when the JSON has no usable name).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4-7 [Claude Code]

* test(middleware): silence errcheck on AfterEach os.RemoveAll

The new tool_choice parsing tests added a second AfterEach that calls
os.RemoveAll(modelDir) without checking the error; errcheck flagged it.
Suppress with the standard _ = idiom. The pre-existing AfterEach on the
earlier Describe still elides the check the same way it did before -
leaving that untouched to keep this commit minimal.

Assisted-by: Claude:opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-14 00:14:38 +02:00
LocalAI [bot]
4430fae779 chore: ⬆️ Update antirez/ds4 to 0cba357ca1bc0e7510421cc26888e420ea942123 (#9806)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-14 00:14:23 +02:00
LocalAI [bot]
ab01ed1a3e fix(agentpool): close truncate-then-read race in agent_jobs.json persistence (#9811)
* fix(agentpool): close truncate-then-read race in agent_jobs.json persistence

Three call sites wrote and read agent_jobs.json (and agent_tasks.json)
through three independent mutexes:

  - AgentJobService.ExecuteJob spawns go saveJobs(job) -> fileJobPersister
    holding p.mu
  - AgentJobService.SaveJobsToFile holding service.fileMutex
  - AgentJobService.LoadJobsFromFile on a separate service instance holding
    a different service.fileMutex

Nothing serialized those mutexes, and both writers used os.WriteFile, which
opens O_TRUNC. A reader landing between the truncate and the write saw a
zero-byte file and surfaced as `unexpected end of JSON input` at offset 0.
The macOS tests-apple job started hitting this consistently once the path
filter was removed from .github/workflows/test.yml and the file-mode race
test ran on every push (run 25823124797 was the first observed failure).

Two changes close the window:

1. fileJobPersister.saveTasksToFile / saveJobsToFile now write to a
   same-directory temp file and os.Rename to the final path. rename(2) is
   atomic on POSIX, so concurrent readers see either the prior contents or
   the new contents and never a zero-byte window. The helper Syncs before
   close so a crash mid-write leaves either the old file intact or the temp
   behind (cleaned up on next save).

2. AgentJobService.{Load,Save}{Tasks,Jobs}{FromFile,ToFile} are collapsed
   to thin wrappers around fileJobPersister, removing the duplicate write
   path and the redundant service.fileMutex / service.tasksFile /
   service.jobsFile fields. Within a single service all task/job I/O now
   serializes on the persister's mutex; the atomic rename handles the
   cross-instance case the tests exercise.

Adds a regression test that hammers SaveJobsToFile and LoadJobsFromFile
concurrently for 500ms across two service instances on the same paths.
On master this reproduces `unexpected end of JSON input` on Linux within
~500ms; with the fix the suite ran -until-it-fails for 30s (54 attempts,
all green).

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(agentpool): route service flush/load through JobPersister interface

The first cut of the race fix made AgentJobService.{Save,Load}{Tasks,Jobs}*
type-assert s.persister to *fileJobPersister so they could reach the
unexported saveTasksToFile / saveJobsToFile helpers. That defeats the
JobPersister interface: the service is back to reasoning about a concrete
implementation instead of an abstraction.

Promote the bulk-flush operations to the interface as FlushTasks / FlushJobs:

  - fileJobPersister.FlushTasks/FlushJobs call the existing private helpers
    (atomic temp+rename writes from the prior commit).
  - dbJobPersister.FlushTasks/FlushJobs are no-ops because SaveTask/SaveJob
    are already write-through to the database.

The service's four file-named methods now talk only to the interface:
LoadTasks/LoadJobs read through s.persister.LoadTasks/LoadJobs, and the
Save side calls FlushTasks/FlushJobs. The "FromFile"/"ToFile" suffixes
stay for backward compat with user_services.go and the existing tests,
but they no longer claim a file-only contract.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-13 23:58:43 +02:00
Ettore Di Giacinto
6bfe7f8c05 ci(image-merge): apply the keepalive+ci-cache source fix to image_merge
Mirror of 8521af14 (which fixed backend_merge.yml) for image_merge.yml.
Today's master-push run 25823024353 failed the gpu-vulkan-image-merge job
with the exact same error pattern the backend merge had on v4.2.2:

  ERROR: quay.io/go-skynet/local-ai@sha256:68b22611...: not found

Same root cause: image_build.yml pushes the per-arch manifest to
quay.io/go-skynet/local-ai with push-by-digest=true (no tag), then the
merge runs minutes-to-hours later, by which time quay's per-repo manifest
GC has reaped the untagged digest from local-ai. The blob still lives in
quay's storage but local-ai@<digest> no longer resolves.

Three matching edits:

1. image_build.yml: anchor each per-arch digest into ci-cache immediately
   after the push, reusing .github/scripts/anchor-digest-in-cache.sh with
   SOURCE_IMAGE=quay.io/go-skynet/local-ai and TAG_SUFFIX defaulting to
   "-core" for the core image (matches the artifact-name convention).
2. image_merge.yml: change the quay merge source from local-ai@<digest>
   to ci-cache@<digest>. Same correctness argument as backend_merge.yml —
   the manifest content is alive in ci-cache; buildx imagetools create
   republishes it into local-ai and writes the user-facing manifest list
   pointing at it. End state in local-ai is self-contained.
3. image_merge.yml: add a sparse `actions/checkout@v6` (only
   .github/scripts) so cleanup-keepalive-tags.sh is available, plus the
   cleanup step itself with TAG_SUFFIX matching the anchor's "-core"
   placeholder.

v4.2.3's image.yml run completed successfully (~50 min between push and
merge — beat quay's GC). This commit closes the race for future releases
and master pushes regardless of run length.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-13 21:37:03 +00:00
LocalAI [bot]
5a42dbf3ec docs: ⬆️ update docs version mudler/LocalAI (#9805)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 22:42:54 +02:00
Adira
c2fe0a6475 fix(http): honor X-Forwarded-Prefix when proxy strips the prefix (#9614)
* fix(http): honor X-Forwarded-Prefix when proxy strips the prefix

Closes #9145.

Two related issues kept the React UI from loading when a reverse proxy
rewrites a sub-path with prefix-stripping (e.g. Caddy `handle_path`):

1. `BaseURL` only computed a prefix from the path StripPathPrefix had
   removed, so when the proxy strips the prefix before forwarding, the
   request arrives without it and the base URL was returned without a
   prefix. Extract a `BasePathPrefix` helper and add an
   `X-Forwarded-Prefix` header fallback so the prefix is recovered.
2. `<base href>` only changes how relative URLs resolve; the build
   emits path-absolute references like `/assets/...` and
   `/favicon.svg`, which still resolve against the origin and bypass
   the proxy prefix. Rewrite those references in the served
   `index.html` so the browser requests them through the proxy.

Adds unit coverage for `BaseURL` with a pre-stripped path and an
end-to-end test for the proxy-stripped scenario.

Assisted-by: Claude:claude-opus-4-7

* fix(http): gate X-Forwarded-Prefix through SafeForwardedPrefix in BasePathPrefix

BasePathPrefix consumed X-Forwarded-Prefix directly, so a value the
codebase elsewhere rejects (e.g. "//evil.com") slipped through and was
interpolated into the SPA index.html — both into the path-absolute asset
URL rewrite in serveIndex (turning "/assets/..." into "//evil.com/assets/...",
a protocol-relative URL that loads JS from a foreign origin) and into
<base href>. Route the header through the existing SafeForwardedPrefix
validator that StripPathPrefix and prefixRedirect already use, and
HTML-escape the prefix before injecting it into the asset rewrite as
defense in depth against attribute breakout.

Tests cover //evil.com, backslashes, control chars, CR/LF and a missing
leading slash; the integration test asserts an unsafe prefix can't poison
asset URLs.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7-1m [Read] [Edit] [Bash]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-13 21:59:33 +02:00
LocalAI [bot]
ddbbdf45b9 chore: ⬆️ Update TheTom/llama-cpp-turboquant to 5aeb2fdbe26cd4c534c6fa15de73cb5749bd0403 (#9740)
⬆️ Update TheTom/llama-cpp-turboquant

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 21:58:33 +02:00
LocalAI [bot]
b4fdb41dcc fix(distributed): cascade-clean stale node_models rows + filter routing by healthy status (#9754)
* fix(distributed): cascade-clean stale node_models on drain and filter routing by healthy status

Stale node_models rows (state="loaded") were surviving past the healthy
state of their owning node, causing /embeddings (and other inference
paths) to dispatch to a backend whose process was gone or drained. The
downstream symptom in a live cluster was pgvector rejecting inserts
with "vector cannot have more than 16000 dimensions (SQLSTATE 54000)"
because the misbehaving backend silently returned a malformed
(oversized) tensor; the Models page showed the model as "running"
without an associated node, like a stale entry, even though the node
was no longer visible in the Nodes view.

Two changes here, plus a third in a follow-up commit:

- MarkDraining now cascade-deletes node_models rows for the affected
  node, mirroring MarkOffline. Drains are explicit operator actions —
  the box has been intentionally taken out of rotation — so clearing
  the rows stops the Models UI from misreporting and prevents the
  routing layer from picking those rows if scheduling logic is ever
  relaxed. In-flight requests already hold their gRPC client through
  Route() and finish normally; the only observable effect is a
  non-fatal IncrementInFlight warning, acceptable for a drain.

  MarkUnhealthy is deliberately left status-only: it fires from
  managers_distributed / reconciler on a single nats.ErrNoResponders
  with no retry, so a transient NATS hiccup must not nuke every loaded
  model and force a full reload on recovery.

- FindAndLockNodeWithModel's inner JOIN now filters on
  backend_nodes.status = healthy in addition to node_models.state =
  loaded. The previous version relied on the second node-fetch step to
  reject non-healthy nodes, but a concurrent reader could still pick
  the same stale row in the same window. Belt-and-braces.

- DistributedConfig.PerModelHealthCheck renamed to
  DisablePerModelHealthCheck and inverted at the call site so
  per-model gRPC probing is on by default. The probe (now made
  consecutive-miss aware in a follow-up commit) independently health-
  checks each model's gRPC address and removes stale node_models rows
  when the backend has crashed even though the worker's node-level
  heartbeat is still arriving.

  Migration: the field had no CLI flag, env var binding, or YAML key
  in tree (only the bare struct field), so there is no user-facing
  migration. Anything constructing DistributedConfig in code needs to
  drop the assignment (default now does the right thing) or invert it.

Assisted-by: Claude:claude-opus-4-7 go-vet go-test golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(distributed): require consecutive misses before per-model probe removes a row

The per-model gRPC probe used to remove a node_models row on a single
failed health check. With the per-model probe now on by default, that
made any 5-second gRPC blip (network jitter, a long-running request
hogging the worker's gRPC server thread, brief GC pause) trigger a
full reload of the affected model — too eager for production.

Require perModelMissThreshold (3) consecutive failed probes before
removal. At the default 15s tick a model must be unreachable for ~45s
before reap; a single successful probe in between resets the streak.
Per-(node, model, replica) state tracked under a mutex on the monitor.

If the removal call itself fails, the miss counter is left in place
so the next tick retries rather than starting the streak over.

Tests:
- removes stale model via per-model health check after consecutive
  failures (replaces the single-shot expectation)
- preserves model row when an intermittent failure is followed by a
  success (covers the reset-on-success path and verifies the counter
  reset by failing twice more without crossing threshold)
- newTestHealthMonitor initializes the misses map so direct-construct
  test helpers don't nil-map-panic in the probe path

Assisted-by: Claude:claude-opus-4-7 go-vet go-test golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-13 21:57:50 +02:00
Richard Palethorpe
0245b33eab feat(realtime): Add Liquid Audio s2s model and assistant mode on talk page (#9801)
* feat(liquid-audio): add LFM2.5-Audio any-to-any backend + realtime_audio usecase

Wires LiquidAI's LFM2.5-Audio-1.5B as a self-contained Realtime API model:
single engine handles VAD, transcription, LLM, and TTS in one bidirectional
stream — drop-in alternative to a VAD+STT+LLM+TTS pipeline.

Backend
- backend/python/liquid-audio/ — new Python gRPC backend wrapping the
  `liquid-audio` package. Modes: chat / asr / tts / s2s, voice presets,
  Load/Predict/PredictStream/AudioTranscription/TTS/VAD/AudioToAudioStream/
  Free and StartFineTune/FineTuneProgress/StopFineTune. Runtime monkey-patch
  on `liquid_audio.utils.snapshot_download` so absolute local paths from
  LocalAI's gallery resolve without a HF round-trip. soundfile in place of
  torchaudio.load/save (torchcodec drags NVIDIA NPP we don't bundle).
- backend/backend.proto + pkg/grpc/{backend,client,server,base,embed,
  interface}.go — new AudioToAudioStream RPC mirroring AudioTransformStream
  (config/frame/control oneof in; typed event+pcm+meta out).
- core/services/nodes/{health_mock,inflight}_test.go — add stubs for the
  new RPC to the test fakes.

Config + capabilities
- core/config/backend_capabilities.go — UsecaseRealtimeAudio, MethodAudio
  ToAudioStream, UsecaseInfoMap entry, liquid-audio BackendCapability row.
- core/config/model_config.go — FLAG_REALTIME_AUDIO bitmask, ModalityGroups
  membership in both speech-input and audio-output groups so a lone flag
  still reads as multimodal, GetAllModelConfigUsecases entry, GuessUsecases
  branch.

Realtime endpoint
- core/http/endpoints/openai/realtime.go — extract prepareRealtimeConfig()
  so the gate is unit-testable; accept realtime_audio models and self-fill
  empty pipeline slots with the model's own name (user-pinned slots win).
- core/http/endpoints/openai/realtime_gate_test.go — six specs covering nil
  cfg, empty pipeline, legacy pipeline, self-contained realtime_audio,
  user-pinned VAD slot, and partial legacy pipeline.

UI + endpoints
- core/http/routes/ui.go — /api/pipeline-models accepts either a legacy
  VAD+STT+LLM+TTS pipeline or a realtime_audio model; surfaces a
  self_contained flag so the Talk page can collapse the four cards.
- core/http/routes/ui_api.go — realtime_audio in usecaseFilters.
- core/http/routes/ui_pipeline_models_test.go — covers both code paths.
- core/http/react-ui/src/pages/Talk.jsx — self-contained badge instead of
  the four-slot grid; rename Edit Pipeline → Edit Model Config; less
  pipeline-specific wording.
- core/http/react-ui/src/pages/Models.jsx + locales/en/models.json — new
  realtime_audio filter button + i18n.
- core/http/react-ui/src/utils/capabilities.js — CAP_REALTIME_AUDIO.
- core/http/react-ui/src/pages/FineTune.jsx — voice + validation-dataset
  fields, surfaced when backend === liquid-audio, plumbed via
  extra_options on submit/export/import.

Gallery + importer
- gallery/liquid-audio.yaml — config template with known_usecases:
  [realtime_audio, chat, tts, transcript, vad].
- gallery/index.yaml — four model entries (realtime/chat/asr/tts) keyed by
  mode option. Fixed pre-existing `transcribe` typo on the asr entry
  (loader silently dropped the unknown string → entry never surfaced as a
  transcript model).
- gallery/lfm.yaml — function block for the LFM2 Pythonic tool-call format
  `<|tool_call_start|>[name(k="v")]<|tool_call_end|>` matching
  common_chat_params_init_lfm2 in vendored llama.cpp.
- core/gallery/importers/{liquid-audio,liquid-audio_test}.go — detector
  matches LFM2-Audio HF repos (excludes -gguf mirrors); mode/voice
  preferences plumbed through to options.
- core/gallery/importers/importers.go — register LiquidAudioImporter
  before LlamaCPPImporter.
- pkg/functions/parse_lfm2_test.go — seven specs for the response/argument
  regex pair on the LFM2 pythonic format.

Build matrix
- .github/backend-matrix.yml — seven liquid-audio targets (cuda12, cuda13,
  l4t-cuda-13, hipblas, intel, cpu amd64, cpu arm64). Jetpack r36 cuda-12
  is skipped (Ubuntu 22.04 / Python 3.10 incompatible with liquid-audio's
  3.12 floor).
- backend/index.yaml — anchor + 13 image entries.
- Makefile — .NOTPARALLEL, prepare-test-extra, test-extra,
  docker-build-liquid-audio.

Docs
- .agents/plans/liquid-audio-integration.md — phased plan; PR-D (real
  any-to-any wiring via AudioToAudioStream), PR-E (mid-audio tool-call
  detector), PR-G (GGUF entries once upstream llama.cpp PR #18641 lands)
  remain.
- .agents/api-endpoints-and-auth.md — expand the capability-surface
  checklist with every place a new FLAG_* needs to be registered.

Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(realtime): function calling + history cap for any-to-any models

Three pieces, all on the realtime_audio path that just landed:

1. liquid-audio backend (backend/python/liquid-audio/backend.py):
   - _build_chat_state grows a `tools_prelude` arg.
   - new _render_tools_prelude parses request.Tools (the OpenAI Chat
     Completions function array realtime.go already serialises) and
     emits an LFM2 `<|tool_list_start|>…<|tool_list_end|>` system turn
     ahead of the user history. Mirrors gallery/lfm.yaml's `function:`
     template so the model sees the same prompt shape whether served
     via llama-cpp or here. Without this the backend silently dropped
     tools — function calling was wired end-to-end on the Go side but
     the model never saw a tool list.

2. Realtime history cap (core/http/endpoints/openai/realtime.go):
   - Session grows MaxHistoryItems int; default picked by new
     defaultMaxHistoryItems(cfg) — 6 for realtime_audio models (LFM2.5
     1.5B degrades quickly past a handful of turns), 0/unlimited for
     legacy pipelines composing larger LLMs.
   - triggerResponse runs conv.Items through trimRealtimeItems before
     building conversationHistory. Helper walks the cut left if it
     would orphan a function_call_output, so tool result + call pairs
     stay intact.
   - realtime_gate_test.go: specs for defaultMaxHistoryItems and
     trimRealtimeItems (zero cap, under cap, over cap, tool-call pair
     preservation).

3. Talk page (core/http/react-ui/src/pages/Talk.jsx):
   - Reuses the chat page's MCP plumbing — useMCPClient hook,
     ClientMCPDropdown component, same auto-connect/disconnect effect
     pattern. No bespoke tool registry, no new REST endpoints; tools
     come from whichever MCP servers the user toggles on, exactly as
     on the chat page.
   - sendSessionUpdate now passes session.tools=getToolsForLLM(); the
     update re-fires when the active server set changes mid-session.
   - New response.function_call_arguments.done handler executes via
     the hook's executeTool (which round-trips through the MCP client
     SDK), then replies with conversation.item.create
     {type:function_call_output} + response.create so the model
     completes its turn with the tool output. Mirrors chat's
     client-side agentic loop, translated to the realtime wire shape.

UI changes require a LocalAI image rebuild (Dockerfile:308-313 bakes
react-ui/dist into the runtime image). Backend.py changes can be
swapped live in /backends/<id>/backend.py + /backend/shutdown.

Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(realtime): LocalAI Assistant ("Manage Mode") for the Talk page

Mirrors the chat-page metadata.localai_assistant flow so users can ask the
realtime model what's loaded / installed / configured. Tools are run
server-side via the same in-process MCP holder that powers the chat
modality — no transport switch, no proxy, no new wire protocol.

Wire:
- core/http/endpoints/openai/realtime.go:
  - RealtimeSessionOptions{LocalAIAssistant,IsAdmin}; isCurrentUserAdmin
    helper mirrors chat.go's requireAssistantAccess (no-op when auth
    disabled, else requires auth.RoleAdmin).
  - Session grows AssistantExecutor mcpTools.ToolExecutor.
  - runRealtimeSession, when opts.LocalAIAssistant is set: gate on admin,
    fail closed if DisableLocalAIAssistant or the holder has no tools,
    DiscoverTools and inject into session.Tools, prepend
    holder.SystemPrompt() to instructions.
  - Tool-call dispatch loop: when AssistantExecutor.IsTool(name), run
    ExecuteTool inproc, append a FunctionCallOutput to conv.Items, skip
    the function_call_arguments client emit (the client can't execute
    these — it doesn't know about them). After the loop, if any
    assistant tool ran, trigger another response so the model speaks the
    result. Mirrors chat's agentic loop, driven server-side rather than
    via client round-trip.

- core/http/endpoints/openai/realtime_webrtc.go: RealtimeCallRequest
  gains `localai_assistant` (JSON omitempty). Handshake calls
  isCurrentUserAdmin and builds RealtimeSessionOptions.

- core/http/react-ui/src/pages/Talk.jsx: admin-only "Manage Mode"
  checkbox under the Tools dropdown; passes localai_assistant: true to
  realtimeApi.call's body, captured in the connect callback's deps.

Mirroring chat's pattern means the in-process MCP tools surface "just
works" for the Talk page without exposing a Streamable-HTTP MCP endpoint
(which was the alternative). Clients with their own MCP servers can
still use the existing ClientMCPDropdown path in parallel; the realtime
handler distinguishes them by AssistantExecutor.IsTool() at dispatch
time.

Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(realtime): render Manage Mode tool calls in the Talk transcript

Previously the realtime endpoint only emitted response.output_item.added
for the FunctionCall item, and Talk.jsx's switch ignored the event — so
server-side tool runs were invisible in the UI. The model would speak
the result but the user had no way to see what tool was actually
called.

realtime.go: after executing an assistant tool inproc, emit a second
output_item.added/.done pair for the FunctionCallOutput item. Mirrors
the way the chat page displays tool_call + tool_result blocks.

Talk.jsx: handle both response.output_item.added and .done. Render
FunctionCall (with arguments) and FunctionCallOutput (pretty-printed
JSON when possible) as two transcript entries — `tool_call` with the
wrench icon, `tool_result` with the clipboard icon, both in mono-space
secondary-colour. Resets streamingRef after the result so the next
assistant text delta starts a fresh transcript entry instead of
appending to the previous turn.

Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* refactor(realtime): bound the Manage Mode tool-loop + preserve assistant tools

Fallout from a review pass on the Manage Mode patches:

- Bound the server-side agentic loop. triggerResponse used to recurse on
  executedAssistantTool with no cap — a model that kept calling tools
  would blow the goroutine stack. New maxAssistantToolTurns = 10 (mirrors
  useChat.js's maxToolTurns). Public triggerResponse is now a thin shim
  over triggerResponseAtTurn(toolTurn int); recursion increments the
  counter and stops at the cap with an xlog.Warn.

- Preserve Manage Mode tools across client session.update. The handler
  used to blindly overwrite session.Tools, so toggling a client MCP
  server mid-session silently wiped the in-process admin tools. Session
  now caches the original AssistantTools slice at session creation and
  the session.update handler merges them back in (client names win on
  collision — the client is explicit).

- strconv.ParseBool for the localai_assistant query param instead of
  hand-rolled "1" || "true". Mirrors LocalAIAssistantFromMetadata.

- Talk.jsx: render both tool_call and tool_result on
  response.output_item.done instead of splitting them across .added and
  .done. The server's event pairing (added → done) stays correct; the
  UI just doesn't need to inspect both phases of the same item. One
  switch case instead of two, no behavioural change.

Out of scope (noted for follow-ups): extract a shared assistant-tools
helper between chat.go and realtime.go (duplication is small enough
that two parallel implementations stay readable for now), and an i18n
key for the Manage Mode helper text (Talk.jsx doesn't use i18n
anywhere else yet).

Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* ci(test-extra): wire liquid-audio backend smoke test

The backend ships test.py + a `make test` target and is listed in
backend-matrix.yml, so scripts/changed-backends.js already writes a
`liquid-audio=true|false` output when files under backend/python/liquid-audio/
change. The workflow just wasn't reading it.

- Expose the `liquid-audio` output on the detect-changes job
- Add a tests-liquid-audio job that runs `make` + `make test` in
  backend/python/liquid-audio, gated on the per-backend detect flag

The smoke covers Health() and LoadModel(mode:finetune); fine-tune mode
short-circuits before any HuggingFace download (backend.py:192), so the
job needs neither weights nor a GPU. The full-inference path remains
gated on LIQUID_AUDIO_MODEL_ID, which CI doesn't set.

The four new Go test files (core/gallery/importers/liquid-audio_test.go,
core/http/endpoints/openai/realtime_gate_test.go,
core/http/routes/ui_pipeline_models_test.go, pkg/functions/parse_lfm2_test.go)
are already picked up by the existing test.yml workflow via `make test` →
`ginkgo -r ./pkg/... ./core/...`; their packages all carry RunSpecs entries.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-13 21:57:27 +02:00
Andreas Egli
a2940e5d47 feat: also parse VRAM budget/usage from vulkaninfo (#9800)
Signed-off-by: Andreas Egli <github@kharan.ch>
2026-05-13 21:43:12 +02:00
LocalAI [bot]
a645c1f4aa chore: ⬆️ Update ggml-org/llama.cpp to a9883db8ee021cf16783016a60996d41820b5195 (#9796)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 21:40:31 +02:00
LocalAI [bot]
957619af53 chore: ⬆️ Update ikawrakow/ik_llama.cpp to f9a93c37e2fc021760c3c1aa99cf74c73b7591a7 (#9795)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 00:40:48 +02:00
LocalAI [bot]
ad0ab37230 docs: ⬆️ update docs version mudler/LocalAI (#9792)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 00:40:37 +02:00
LocalAI [bot]
0b81e36504 chore: ⬆️ Update antirez/ds4 to f8b4ed635d559b3a5b44bf2df6a77e21b3e9178f (#9794)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 00:40:09 +02:00
LocalAI [bot]
602866a9d8 chore: ⬆️ Update ggml-org/whisper.cpp to 338cce1e58133261753243802a0e7a430118866d (#9793)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 00:39:57 +02:00
Ettore Di Giacinto
8521af145f ci(merge): source per-arch digests from ci-cache, not local-ai-backends
Follow-up to PR #9781. v4.2.2 (run 25745181433) showed the keepalive
anchor in ci-cache wasn't enough on its own: 19 of 37 multiarch merges
still failed with "manifest not found" for the same digests we'd just
anchored.

Quay's manifest GC is per-repository. The anchor tag in ci-cache
protects the manifest copy that lives in ci-cache, but the same digest
in local-ai-backends is independently tracked and gets reaped because
nothing in local-ai-backends references it (push-by-digest=true leaves
it untagged). The merge then asks
`local-ai-backends@sha256:<digest>` and quay correctly says "not found"
in that repo, even though `ci-cache@sha256:<digest>` is alive and well.

Empirical confirmation against a live failed digest from v4.2.2:

  $ docker buildx imagetools inspect quay.io/go-skynet/ci-cache@sha256:05377fe6...
  Name: quay.io/go-skynet/ci-cache@sha256:05377fe6...
  MediaType: application/vnd.docker.distribution.manifest.v2+json
  $ docker buildx imagetools inspect quay.io/go-skynet/local-ai-backends@sha256:05377fe6...
  ERROR: ... not found

Switch the source of the quay merge step to ci-cache. The blobs the
manifest references are already accessible from local-ai-backends
(verified via direct registry HEAD: HTTP 200 from both repos — the
original push cross-mounted blobs at content-addressable storage time
and they outlive the per-repo manifest GC). buildx imagetools create
republishes the manifest into local-ai-backends, then writes the
user-facing manifest list pointing at it. End state is self-contained:
the published manifest list references child manifests by digest only,
no embedded reference to ci-cache.

Dockerhub merge step is unchanged. Dockerhub's GC isn't aggressive
enough to reap untagged manifests at the timescales we operate on
(verified: localai/localai-backends@<same digest> still resolves cleanly
after >24h).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-12 22:35:55 +00:00
LocalAI [bot]
bc4cd3dd85 feat(llama-cpp): bump to 1ec7ba0c, adapt grpc-server, expose new spec-decoding options (#9765)
* chore(llama.cpp): bump to 1ec7ba0c14f33f17e980daeeda5f35b225d41994

Picks up the upstream `spec : parallel drafting support` change
(ggml-org/llama.cpp#22838) which reshapes the speculative-decoding API
and `server_context_impl`.

Adapt the grpc-server wrapper accordingly:

  * `common_params_speculative::type` (single enum) became `types`
    (`std::vector<common_speculative_type>`). Update both the
    "default to draft when a draft model is set" branch and the
    `spec_type`/`speculative_type` option parser. The parser now also
    tolerates comma-separated lists, mirroring the upstream
    `common_speculative_types_from_names` semantics.
  * `common_params_speculative_draft::n_ctx` is gone (draft now shares
    the target context size). Keep the `draft_ctx_size` option name for
    backward compatibility and ignore the value rather than failing.
  * `server_context_impl::model` was renamed to `model_tgt`; update the
    two reranker / model-metadata call sites.

Replaces #9763. Builds cleanly under the linux/amd64 cpu-llama-cpp
target locally.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama-cpp): expose new speculative-decoding option keys

Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838)
adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative
families and beefs up the draft-model knobs. The previous bump only
adapted the API; this exposes the new fields through the grpc-server
options dictionary so model configs can drive them.

New `options:` keys (all under `backend: llama-cpp`):

ngram_mod (`ngram_mod` type):
  spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match

ngram_map_k (`ngram_map_k` type):
  spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits

ngram_map_k4v (`ngram_map_k4v` type):
  spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m /
  spec_ngram_map_k4v_min_hits

ngram lookup caches (`ngram_cache` type):
  spec_lookup_cache_static / lookup_cache_static
  spec_lookup_cache_dynamic / lookup_cache_dynamic

Draft-model tuning (active when `spec_type` is `draft`):
  draft_cache_type_k / spec_draft_cache_type_k
  draft_cache_type_v / spec_draft_cache_type_v
  draft_threads / spec_draft_threads
  draft_threads_batch / spec_draft_threads_batch
  draft_cpu_moe / spec_draft_cpu_moe          (bool flag)
  draft_n_cpu_moe / spec_draft_n_cpu_moe      (first N MoE layers on CPU)
  draft_override_tensor / spec_draft_override_tensor
    (comma-separated <tensor regex>=<buffer type>; re-implements upstream's
     static parse_tensor_buffer_overrides since it isn't exported)

`spec_type` already accepted comma-separated lists after the previous
commit, matching upstream's `common_speculative_types_from_names`.

Docs: refresh `docs/content/advanced/model-configuration.md` with
per-family tables and a note about multi-type chaining.

Builds locally with `make docker-build-llama-cpp` (linux/amd64
cpu-llama-cpp AVX variant).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(turboquant): bridge new llama.cpp spec API to the legacy fork layout

The previous commits in this series adapted backend/cpp/llama-cpp/grpc-server.cpp
to the post-#22838 (parallel drafting) llama.cpp API. The turboquant build
reuses the same grpc-server.cpp through backend/cpp/turboquant/Makefile,
which copies it into turboquant-<flavor>-build/ and runs patch-grpc-server.sh
on the copy. The fork branched before the API refactor, so it errors out on:

  * `ctx_server.impl->model_tgt` (fork still has `model`)
  * `params.speculative.{ngram_mod,ngram_map_k,ngram_map_k4v,ngram_cache}.*`
    (none of these sub-structs exist in the fork)
  * `params.speculative.draft.{cache_type_k/v, cpuparams[, _batch].n_threads,
    tensor_buft_overrides}` (fork uses the pre-#22397 flat layout)
  * `params.speculative.types` vector / `common_speculative_types_from_names`
    (fork has a scalar `type` and only the singular helper)

Approach:

1. backend/cpp/llama-cpp/grpc-server.cpp: introduce a single feature switch
   `LOCALAI_LEGACY_LLAMA_CPP_SPEC`. When defined, the two `speculative.type[s]`
   discriminations (the "default to draft when a draft model is set" branch
   and the `spec_type` / `speculative_type` option parser) fall back to the
   singular scalar form, and the entire new-option block (ngram_mod / map_k
   / map_k4v / ngram_cache / draft.{cache_type_*, cpuparams*,
   tensor_buft_overrides}) is preprocessed out. The macro is *not* defined
   in the source tree — stock llama-cpp builds get the full new API.

2. backend/cpp/turboquant/patch-grpc-server.sh: two new patch steps applied
   to the per-flavor build copy at turboquant-<flavor>-build/grpc-server.cpp:
   - substitute `ctx_server.impl->model_tgt` -> `ctx_server.impl->model`
   - inject `#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1` before the first
     `#include`, so the guarded blocks above drop out for the fork build.

   Both patches are idempotent and follow the existing sed/awk pattern in
   this script (KV cache types, `get_media_marker`, flat speculative
   renames). Stock llama-cpp's `grpc-server.cpp` is never touched.

Drop both legacy patches once the turboquant fork rebases past
ggml-org/llama.cpp#22397 / #22838.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(turboquant): close draft_ctx_size brace inside legacy guard

The previous turboquant fix wrapped the new option-handler blocks in
`#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC ... #endif` but placed the guard
in the middle of an `else if` chain — the `} else if` openings of the
new blocks were responsible for closing the previous block's brace.
With the macro defined the new blocks vanish, draft_ctx_size's `{`
loses its closer, the for-loop's `}` is consumed instead, and the
file ends with a stray opening brace — clang reports it as
`function-definition is not allowed here before '{'` on the next
top-level `int main(...)` and `expected '}' at end of input`.

Move the chain split inside the draft_ctx_size branch:

    } else if (... "draft_ctx_size") {
        // ...
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
    }                                  // legacy: chain ends here
#else
    } else if (... "spec_ngram_mod_n_min") {  // modern: chain continues
        ...
    } else if (... "draft_override_tensor") {
        ...
    }                                  // closes last branch
#endif
    }                                  // closes for-loop

Brace count is now balanced under both preprocessor branches (verified
with `tr -cd '{' | wc -c` against the patched and unpatched outputs).

Local `make docker-build-turboquant` builds the linux/amd64 cpu-llama-cpp
`turboquant-avx` variant cleanly.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ci): forward AMDGPU_TARGETS into Dockerfile.turboquant builder-prebuilt

Dockerfile.turboquant's `builder-prebuilt` stage was missing the
`ARG AMDGPU_TARGETS` / `ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}` pair that
`builder-fromsource` already has (and that `Dockerfile.llama-cpp`
mirrors across both stages). When CI uses the prebuilt base image
(quay.io/go-skynet/ci-cache:base-grpc-*, the common path) the build-arg
passed by the workflow never reaches the env inside the compile stage.

backend/cpp/llama-cpp/Makefile:38 (introduced by #9626) errors out on
hipblas builds when AMDGPU_TARGETS is empty, and the turboquant
Makefile reuses backend/cpp/llama-cpp via a sibling build dir, so the
same check fires from turboquant-fallback under BUILD_TYPE=hipblas:

  Makefile:38: *** AMDGPU_TARGETS is empty — set it to a comma-separated
  list of gfx targets e.g. gfx1100,gfx1101.  Stop.
  make: *** [Makefile:66: turboquant-fallback] Error 2

The bug is latent on master because the docker layer cache stays warm
across builds — the compile step rarely re-runs from scratch. The
llama.cpp bump in this PR invalidates the cache, so the missing env var
becomes load-bearing and the hipblas turboquant CI job fails.

Mirror the existing pattern from Dockerfile.llama-cpp.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-12 17:22:37 +02:00
LocalAI [bot]
86a7f6c9fa ci: close GC race + cascade-skip + darwin grpc gaps from v4.2.1 (#9781)
* ci: close the GC race + cascade-skip + darwin grpc gaps from v4.2.1

v4.2.1's backend.yml run (#25701862853) exposed three independent issues
on top of the singletons fix shipped in ea001995. Address all three plus
two related cleanups:

1. quay GC race in backend-merge-jobs-multiarch (12/37 merges failed with
   "manifest not found"). Even after PR #9746 split multi/single-arch
   merges, the multiarch matrix itself takes ~2h to drain at
   max-parallel: 8, and the earliest per-arch digests (push-by-digest,
   no tag) get reaped by quay's GC before the merge runs. The split
   bounded the race for multiarch; it doesn't eliminate it. Anchor each
   per-arch digest immediately to a tag in the internal ci-cache image
   (`keepalive-<run_id><tag-suffix>-<platform-tag>`). Quay won't GC
   tagged manifests. backend_merge.yml deletes the keepalive tags via
   quay REST API after publishing the user-facing manifest list.
   Cleanup is best-effort: if the quay token is not OAuth-scoped the
   merge does NOT fail, the orphan tags just persist.

2. cascade-skip on backend-merge-jobs-singlearch. v4.2.1 had 2 failed
   and 2 cancelled singlearch builds (out of 199); GHA's default
   `needs:` semantics cascade-skipped the entire singlearch merge
   matrix, so zero singleton tags were applied even though 197
   singletons built successfully. Wrap the merge `if:` in
   `!cancelled() && ...` for both multi and single arch in backend.yml
   and backend_pr.yml so partial build failures publish the successful
   tag-suffixes.

3. Darwin llama-cpp grpc-server build fails with `find_package(absl)`
   not found. Same shape as the ccache/blake3/fmt/hiredis/xxhash/zstd
   fix already in `Dependencies`: a brew cache hit restores
   `/opt/homebrew/Cellar/grpc` so `brew install grpc` no-ops, but
   abseil isn't in our Cellar cache list and never gets installed
   alongside, leaving grpc's CMake unable to resolve it. Mirror the
   `brew reinstall ccache` line with `brew reinstall grpc` to
   re-validate grpc's full transitive dep closure on every cache-hit
   run.

4. Move the four heaviest CUDA cpp builds back to bigger-runner. v4.2.1
   wall-clock: -gpu-nvidia-cuda-12-llama-cpp 5h36m,
   -gpu-nvidia-cuda-12-turboquant 6h05m,
   -gpu-nvidia-cuda-13-llama-cpp 5h37m,
   -gpu-nvidia-cuda-13-turboquant 6h05m. The cuda-12 turboquant and
   cuda-13 turboquant entries are over GHA's 6h job timeout. Phase 5.3
   of the free-tier migration (PR #9730) had explicitly flagged this
   batch as 'highest-risk' with a per-entry revert path. All other
   matrix entries (vulkan-llama-cpp ~47m, ROCm hipblas-llama-cpp ~2h,
   intel sycl-f32 ~1h49m) stay on free-tier ubuntu-latest.

Verified locally: all six edited workflow YAMLs parse cleanly. Real
verification has to come from the next tag release run.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: extract keepalive anchor + cleanup into .github/scripts/

The two inline shell blocks from the previous commit are long enough to
hurt readability of the workflow YAML and benefit from their own files
with self-contained docs. Move them to .github/scripts/:

  anchor-digest-in-cache.sh    backend_build.yml's keepalive anchor
  cleanup-keepalive-tags.sh    backend_merge.yml's best-effort cleanup

Workflow steps reduce to a single `run:` invocation each, with all the
parameter plumbing handled by env vars on the step. backend_merge.yml
also gains a sparse `actions/checkout@v6` step (sparse to .github/scripts
only) so the cleanup script is available on the runner — backend_build
already checks out for the docker build.

Net workflow diff: -36 lines across the two files. Script logic and
behavior are byte-identical to the inline version.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-12 17:22:09 +02:00
LocalAI [bot]
a57e73691d fix(ollama): accept prompt alias on /api/embed for Ollama parity (#9780)
Ollama's embedding endpoint accepts both `input` and `prompt` as the
input string value (see ollama/ollama docs/api.md#generate-embeddings).
LocalAI only accepted `input`, which broke client libraries that send
the `prompt` form.

Add `Prompt` to OllamaEmbedRequest and have GetInputStrings fall back
to it when Input is unset. Input still wins when both are provided.

Fixes #9767.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-12 17:21:20 +02:00
dependabot[bot]
a689100d61 chore(deps): bump the npm_and_yarn group across 1 directory with 3 updates (#9728)
Bumps the npm_and_yarn group with 3 updates in the /core/http/react-ui directory: [fast-uri](https://github.com/fastify/fast-uri), [hono](https://github.com/honojs/hono) and [ip-address](https://github.com/beaugunderson/ip-address).


Updates `fast-uri` from 3.1.0 to 3.1.2
- [Release notes](https://github.com/fastify/fast-uri/releases)
- [Commits](https://github.com/fastify/fast-uri/compare/v3.1.0...v3.1.2)

Updates `hono` from 4.12.14 to 4.12.18
- [Release notes](https://github.com/honojs/hono/releases)
- [Commits](https://github.com/honojs/hono/compare/v4.12.14...v4.12.18)

Updates `ip-address` from 10.1.0 to 10.2.0
- [Commits](https://github.com/beaugunderson/ip-address/commits)

---
updated-dependencies:
- dependency-name: fast-uri
  dependency-version: 3.1.2
  dependency-type: indirect
  dependency-group: npm_and_yarn
- dependency-name: hono
  dependency-version: 4.12.18
  dependency-type: indirect
  dependency-group: npm_and_yarn
- dependency-name: ip-address
  dependency-version: 10.2.0
  dependency-type: indirect
  dependency-group: npm_and_yarn
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:54:38 +02:00
Andreas Egli
03815e3b59 fix: parse vulkan VRAM from text (#9669)
* fix: parse vulkan VRAM from text

Assisted-by: opencode:gpt-5.5
Signed-off-by: Andreas Egli <github@kharan.ch>

* fix: replace string.split with streaming iteration

Assisted-by: Opencode:Gemma4
Signed-off-by: Andreas Egli <github@kharan.ch>

---------

Signed-off-by: Andreas Egli <github@kharan.ch>
2026-05-12 09:53:48 +02:00
dependabot[bot]
37991c8a18 chore(deps): bump github.com/mudler/edgevpn from 0.31.1 to 0.32.2 (#9773)
Bumps [github.com/mudler/edgevpn](https://github.com/mudler/edgevpn) from 0.31.1 to 0.32.2.
- [Release notes](https://github.com/mudler/edgevpn/releases)
- [Commits](https://github.com/mudler/edgevpn/compare/v0.31.1...v0.32.2)

---
updated-dependencies:
- dependency-name: github.com/mudler/edgevpn
  dependency-version: 0.32.2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:51:39 +02:00
dependabot[bot]
61c9b187fa chore(deps): update charset-normalizer requirement from >=3.4.0 to >=3.4.7 in /backend/python/vllm (#9779)
chore(deps): update charset-normalizer requirement

Updates the requirements on [charset-normalizer](https://github.com/jawah/charset_normalizer) to permit the latest version.
- [Release notes](https://github.com/jawah/charset_normalizer/releases)
- [Changelog](https://github.com/jawah/charset_normalizer/blob/master/CHANGELOG.md)
- [Commits](https://github.com/jawah/charset_normalizer/compare/3.4.0...3.4.7)

---
updated-dependencies:
- dependency-name: charset-normalizer
  dependency-version: 3.4.7
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:22:23 +02:00
dependabot[bot]
c66014312e chore(deps): bump github.com/fsnotify/fsnotify from 1.9.0 to 1.10.1 (#9778)
Bumps [github.com/fsnotify/fsnotify](https://github.com/fsnotify/fsnotify) from 1.9.0 to 1.10.1.
- [Release notes](https://github.com/fsnotify/fsnotify/releases)
- [Changelog](https://github.com/fsnotify/fsnotify/blob/main/CHANGELOG.md)
- [Commits](https://github.com/fsnotify/fsnotify/compare/v1.9.0...v1.10.1)

---
updated-dependencies:
- dependency-name: github.com/fsnotify/fsnotify
  dependency-version: 1.10.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:21:18 +02:00
dependabot[bot]
abc2a51641 chore(deps): update transformers requirement from >=5.0.0 to >=5.8.0 in /backend/python/transformers (#9775)
chore(deps): update transformers requirement

Updates the requirements on [transformers](https://github.com/huggingface/transformers) to permit the latest version.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](https://github.com/huggingface/transformers/compare/v5.0.0...v5.8.0)

---
updated-dependencies:
- dependency-name: transformers
  dependency-version: 5.8.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:21:05 +02:00
dependabot[bot]
cd7d163178 chore(deps): bump github.com/onsi/gomega from 1.39.1 to 1.40.0 (#9774)
Bumps [github.com/onsi/gomega](https://github.com/onsi/gomega) from 1.39.1 to 1.40.0.
- [Release notes](https://github.com/onsi/gomega/releases)
- [Changelog](https://github.com/onsi/gomega/blob/master/CHANGELOG.md)
- [Commits](https://github.com/onsi/gomega/compare/v1.39.1...v1.40.0)

---
updated-dependencies:
- dependency-name: github.com/onsi/gomega
  dependency-version: 1.40.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:20:36 +02:00
dependabot[bot]
7aac599deb chore(deps): bump github.com/anthropics/anthropic-sdk-go from 1.27.0 to 1.42.0 (#9772)
chore(deps): bump github.com/anthropics/anthropic-sdk-go

Bumps [github.com/anthropics/anthropic-sdk-go](https://github.com/anthropics/anthropic-sdk-go) from 1.27.0 to 1.42.0.
- [Release notes](https://github.com/anthropics/anthropic-sdk-go/releases)
- [Changelog](https://github.com/anthropics/anthropic-sdk-go/blob/main/CHANGELOG.md)
- [Commits](https://github.com/anthropics/anthropic-sdk-go/compare/v1.27.0...v1.42.0)

---
updated-dependencies:
- dependency-name: github.com/anthropics/anthropic-sdk-go
  dependency-version: 1.42.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:20:24 +02:00
dependabot[bot]
d75173dd2a chore(deps): bump actions/download-artifact from 4 to 8 (#9771)
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 4 to 8.
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](https://github.com/actions/download-artifact/compare/v4...v8)

---
updated-dependencies:
- dependency-name: actions/download-artifact
  dependency-version: '8'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:20:14 +02:00
dependabot[bot]
9be5310394 chore(deps): bump actions/upload-artifact from 4 to 7 (#9770)
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 7.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v4...v7)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:20:03 +02:00
dependabot[bot]
cdf50fd723 chore(deps): bump node from 25-slim to 26-slim (#9769)
Bumps node from 25-slim to 26-slim.

---
updated-dependencies:
- dependency-name: node
  dependency-version: 26-slim
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:19:51 +02:00
804 changed files with 72057 additions and 3753 deletions

View File

@@ -112,6 +112,8 @@ Add a YAML anchor definition in the `## metas` section (around line 2-300). Look
Add image entries at the end of the file, following the pattern of similar backends such as `diffusers` or `chatterbox`. Include both `latest` (production) and `master` (development) tags.
**Note on integrity:** OCI backends installed from a gallery whose `verification:` block is set are verified against a keyless-cosign policy before extraction; tarball/HTTP backends use the optional `sha256:` field. New backends do not need any extra YAML — the gallery-level `verification:` block covers every entry. See [.agents/backend-signing.md](backend-signing.md) for the producer-side CI step.
## 4. Update the Makefile
The Makefile needs to be updated in several places to support building and testing the new backend:

View File

@@ -284,7 +284,17 @@ Also bump the expected-length count in `api_instructions_test.go` and add the na
### 3. `capabilities.js` symbol (for new model-config FLAG_* flags)
If your feature needs a new `FLAG_*` usecase flag in `core/config/model_config.go` (so users can filter gallery models by it, and so `/v1/models` surfaces it), also declare the matching symbol in `core/http/react-ui/src/utils/capabilities.js`:
If your feature needs a new `FLAG_*` usecase flag in `core/config/model_config.go` (so users can filter gallery models by it, and so `/v1/models` surfaces it), you need to update **all** of:
- `Usecase<Name>` string constant in `core/config/backend_capabilities.go`
- `UsecaseInfoMap` entry mapping the string to its flag + gRPC method
- `FLAG_<NAME>` bitmask in `core/config/model_config.go`
- `GetAllModelConfigUsecases()` map entry (otherwise the YAML loader silently ignores the string)
- `ModalityGroups` membership if the flag should affect `IsMultimodal()` (e.g. realtime_audio is in both speech-input and audio-output groups so a lone flag still reads as multimodal)
- `GuessUsecases()` branch listing the backends that own this capability
- `usecaseFilters` in `core/http/routes/ui_api.go` (drives the gallery filter dropdown)
- `Models.jsx` `FILTERS` array + matching `filters.<camelCase>` i18n key in `core/http/react-ui/public/locales/en/models.json`
- `core/http/react-ui/src/utils/capabilities.js`:
```js
export const CAP_MY_CAPABILITY = 'FLAG_MY_CAPABILITY'

126
.agents/backend-signing.md Normal file
View File

@@ -0,0 +1,126 @@
# Backend image signing & verification
LocalAI verifies backend OCI images against a per-gallery keyless-cosign
policy. This page documents the trust model, the producer side
(`.github/workflows/backend_merge.yml` in this repo), and the consumer
side (`pkg/oci/cosignverify` plus the gallery YAML).
## Trust model
- **Producer:** `.github/workflows/backend_merge.yml` signs each pushed
manifest list with `cosign sign --recursive` in keyless mode after
`docker buildx imagetools create`. The signing cert is issued by
Fulcio bound to the workflow's OIDC identity. There is no long-lived
signing key. `--recursive` signs both the manifest list and every
per-arch entry — needed because our consumer resolves a tag to a
per-arch manifest before checking signatures.
- **Storage:** Signatures are written as OCI 1.1 referrers
(`--registry-referrers-mode=oci-1-1`) in the new Sigstore bundle format
(current cosign releases do this by default; no `--new-bundle-format`
flag). No `:sha256-<hex>.sig` tag clutter.
- **Consumer:** `pkg/oci/cosignverify` discovers the bundle via the
referrers API, hands it to `sigstore-go`, and verifies it against the
policy declared in the gallery YAML (`Gallery.Verification`).
- **Revocation:** Keyless cosign certs are ephemeral (10-minute Fulcio
validity), so revocation is policy-side, not CA-side. The gallery's
`verification.not_before` (RFC3339) is the kill-switch — advance it to
invalidate every signature produced before a known compromise window.
## Producer setup
`backend_merge.yml` is the workflow that joins per-arch digests into the
multi-arch manifest list users actually pull, so it's also the right place
to sign. The job needs:
- `permissions: { id-token: write, contents: read }` at the job level so
the runner can exchange its GitHub OIDC token for a Fulcio cert.
- `sigstore/cosign-installer@v3` step (current cosign releases already
default to the new bundle format).
- After each `docker buildx imagetools create`, resolve the resulting
list digest with `docker buildx imagetools inspect <tag> --format
'{{.Manifest.Digest}}'` and sign:
```sh
cosign sign --yes --recursive \
--registry-referrers-mode=oci-1-1 \
"${REGISTRY_REPO}@${DIGEST}"
```
Sign by digest, never by tag — signing by tag binds the signature to
whatever the tag points at *now*, and a subsequent tag push orphans it.
`--registry-referrers-mode=oci-1-1` is still gated behind
`COSIGN_EXPERIMENTAL=1` in cosign v2.4.x (set at the job env level in
`backend_merge.yml`). Re-evaluate when bumping the pinned cosign release
— newer versions are expected to graduate this flag and the env var can
then be dropped.
`backend_build_darwin.yml` builds and pushes single-arch darwin images
that bypass the manifest-list merge. If/when those entries get a gallery
`verification:` policy, the equivalent cosign step has to land there
too.
## Consumer setup (in `mudler/LocalAI` gallery YAML)
Once CI is signing, add a `verification:` block to the backend gallery
entry (`backend/index.yaml`):
```yaml
- name: localai
url: github:mudler/LocalAI/backend/index.yaml@master
verification:
issuer: "https://token.actions.githubusercontent.com"
identity_regex: "^https://github\\.com/mudler/LocalAI/\\.github/workflows/backend_merge\\.yml@refs/heads/master$"
# Optional revocation cutoff; advance during incident response.
# not_before: "2026-06-01T00:00:00Z"
```
Identity matching pins the OIDC subject Fulcio issued the signing cert
to. Without this, any image signed by *anyone* with a Fulcio cert would
pass — the regex is what makes a signature mean "produced by our CI".
## Strict mode
Default behaviour: OCI backends without a `verification:` block install
with a warning (logs include `installing OCI backend without signature
verification`). Tarball/HTTP backends without a `sha256` field log a
similar warning.
For production, set `LOCALAI_REQUIRE_BACKEND_INTEGRITY=1` (or pass
`--require-backend-integrity` to `local-ai run` / `local-ai backends
install` / `local-ai models install`). The warning becomes a hard error
and unverifiable backends refuse to install.
## Revocation playbook
If `backend_merge.yml` (or any workflow with `id-token: write`) is
compromised and we've shipped malicious signed images:
1. **Identify the compromise window.** Find the earliest IntegratedTime
from the bad signatures (Rekor search by `subject` filter).
2. **Set `verification.not_before`** in `backend/index.yaml` to a
timestamp just *after* that window's start.
3. **Push the YAML.** Deployed LocalAI instances pick it up on next
gallery refresh (1-hour cache in `core/gallery/gallery.go`).
4. **Fix the underlying compromise** in the workflow and re-sign images
with the new build, which will have IntegratedTime > `not_before`.
5. **Optional:** for absolute decisiveness, also rotate to a new
workflow path (`backend_merge_v2.yml`) and update `identity_regex`.
## Where the code lives
- `pkg/oci/cosignverify/` — verifier, policy, OCI referrer fetch, NotBefore enforcement.
- `pkg/downloader/uri.go``WithImageVerifier` option threaded through `DownloadFileWithContext`.
- `core/gallery/backends.go``backendDownloadOptions` builds the verifier from the gallery's policy.
- `core/config/gallery.go``Gallery.Verification` YAML schema.
- `core/cli/run.go`, `core/cli/backends.go`, `core/cli/models.go``--require-backend-integrity` flag propagation.
- `.github/workflows/backend_merge.yml` — producer-side `cosign sign --recursive` after each multi-arch manifest list push.
## Out of scope (follow-ups)
- **Signing the gallery YAML itself.** The index is fetched over HTTPS
from GitHub; we trust the host. A cosign blob signature on the YAML
would close that gap but adds key-management overhead. Revisit this
page if/when added.
- **Tarball/HTTP backend signing.** Cosign can sign arbitrary blobs, but
for now non-OCI backends keep using the `sha256:` field in YAML.

View File

@@ -15,3 +15,35 @@ Let's say the user wants to build a particular backend for a given platform. For
- Unless the user specifies that they want you to run the command, then just print it because not all agent frontends handle long running jobs well and the output may overflow your context
- The user may say they want to build AMD or ROCM instead of hipblas, or Intel instead of SYCL or NVIDIA insted of l4t or cublas. Ask for confirmation if there is ambiguity.
- Sometimes the user may need extra parameters to be added to `docker build` (e.g. `--platform` for cross-platform builds or `--progress` to view the full logs), in which case you can generate the `docker build` command directly.
## Test coverage gate
The core Go suites (`./pkg`, `./core`, plus the in-process integration suite `./tests/e2e`) are covered by a **strict, monotonic coverage ratchet**:
- `make test-coverage` — runs the suites with `covermode=atomic` instrumentation and writes a merged profile to `coverage/coverage.out`. Uses the same prerequisites as `make test`.
- **`--coverpkg` (`COVERAGE_COVERPKG = core/...,pkg/...`):** coverage is attributed to the core+pkg packages, not just the package under test. This is what lets the in-process `tests/e2e` suite (which drives the real HTTP server over loopback via `application.New`) credit the `core/http/endpoints/...` handlers it exercises — folding it in roughly doubled endpoint coverage (e.g. `endpoints/openai` 13.6% → 52%). The denominator is therefore *all* of `core`+`pkg` (minus generated proto, dropped via `COVERAGE_EXCLUDE_RE`), so the number isn't comparable to a plain per-package figure.
- **Integration suites (`COVERAGE_E2E_ROOTS = ./tests/e2e`)** run non-recursively (excludes `tests/e2e/distributed`, which needs containers) with `--label-filter=!real-models` (those need a downloaded model) against the mock backend built by `prepare-test`. `tests/integration` is deliberately excluded — it needs `make backends/local-store`, which the coverage CI job doesn't build.
- **Flake note:** folding integration tests into a *strict* gate means a hard e2e failure (or a spec that silently stops running) can fail the coverage gate, not just the test. `--flake-attempts` absorbs transient retryable failures; covermode=atomic keeps line coverage deterministic otherwise.
- **Why one ginkgo run per root (`scripts/run-coverage.sh`):** passing several recursive roots to a *single* ginkgo invocation (e.g. `ginkgo -r ./pkg ./core`) only merges **one** root's coverprofile into `--output-dir`/`--coverprofile` — the others are silently dropped. Verified with ginkgo 2.29.0: `-r ./pkg ./core` yields only `./pkg` coverage, while `-r ./core` alone yields all 34 core packages. So the script runs each root separately and concatenates the (disjoint) profiles. Don't "simplify" it back to a single multi-root invocation — that's how `core/` (including all of `core/http`, ~7.4k statements) silently vanished from the number before.
- **Build tags (`COVERAGE_TAGS`, passed via `GINKGO_TAGS`):** defaults to `debug auth`. The `auth` tag is required to compile the real (sqlite-backed) auth implementation and its ~150 `//go:build auth` tests — without it those files aren't built, the tests don't run, and the gate scores auth against a stub (~3.7% instead of ~38%). If you add new tag-gated tests, extend `COVERAGE_TAGS` or they won't count (and likely won't run in CI at all).
- `make test-coverage-check` — runs `test-coverage`, then `scripts/coverage-check.sh` fails the build if total coverage is **below** the committed baseline in `coverage-baseline.txt`. The Linux job in `.github/workflows/test.yml` runs this instead of `make test`.
- `make test-coverage-baseline` — regenerates and overwrites `coverage-baseline.txt` from the current run.
- `make install-hooks` — sets `core.hooksPath` to the versioned `.githooks/`, whose `pre-commit` runs checks scoped to what's staged: Go changes → `make lint` + `make test-coverage-check`; `core/http/react-ui/` changes → `make test-ui-coverage-check` (Playwright e2e + UI coverage gate). A commit touching neither is skipped; bypass with `git commit --no-verify`. The hook resolves golangci-lint's new-from base to `upstream/master``origin/master``master`, so it works from a fork clone where `origin/master` is stale (passed to `make lint` via `LINT_NEW_FROM`).
### React UI coverage
The React UI (`core/http/react-ui/`) has **no component/unit tests** — its only tests are the Playwright e2e specs in `e2e/`, which run against the real app served by `tests/e2e-ui/ui-test-server` (the dist is `//go:embed`ed, so the server is rebuilt per coverage run). Those specs do genuinely exercise the UI (clicks, `fill`, `setInputFiles`, `getByRole`/`getByText`, visibility/value assertions).
- `make test-ui-coverage` — builds an istanbul-instrumented bundle (`COVERAGE=true`, via `vite-plugin-istanbul` with `forceBuildInstrument: true` — the plugin skips production builds otherwise), re-embeds it into `ui-test-server` (the dist is `//go:embed`ed), runs the Playwright specs, and writes an `nyc` report to `core/http/react-ui/coverage/`. The specs import `{ test, expect }` from `e2e/coverage-fixtures.js` (re-exports Playwright's, plus harvests `window.__coverage__` into `.nyc_output/` after each test). Instrumentation is off unless `COVERAGE=true`, so dev/prod builds and plain `make test-ui-e2e` are unaffected (the fixture no-ops when `window.__coverage__` is absent).
- **Browser:** the flake dev shell ships `chromium` and exports `PLAYWRIGHT_CHROMIUM_PATH`; `playwright.config.js` uses it via `launchOptions.executablePath`, and the Makefile skips `playwright install` when it's set. This avoids Playwright's downloaded browser, which can't resolve system libs (`libglib-2.0`, …) on NixOS. In CI (no `PLAYWRIGHT_CHROMIUM_PATH`) the Makefile falls back to `playwright install --with-deps chromium`.
- The app is a React SPA, so coverage accumulates across in-app navigation within a test; a full `page.goto`/reload resets it.
- `.nycrc.json` uses `all: true`, so **every `src/**` file is in the report**, including 0%-coverage ones — that's how you spot features with no test at all (sort the HTML report or `coverage-summary.json` by line% ascending).
- **UI coverage gate:** `make test-ui-coverage-check` runs the suite then `scripts/ui-coverage-check.sh`, failing if total line coverage drops more than `UI_COVERAGE_TOLERANCE` below `core/http/react-ui/coverage-baseline.txt`. `make test-ui-coverage-baseline` regenerates the baseline. Runs in CI (`tests-ui-e2e.yml`) and pre-commit on `core/http/react-ui/` changes.
- **Why it has a tolerance (unlike the strict Go gate):** UI e2e coverage is *non-deterministic*. Specs that assert on state and end while async/lazy render work is still in flight collect those lines only when the render beats the coverage teardown — so the total drifts with machine speed/load (a fast local box reads higher than a slow CI runner), diffusely across many specs. The tolerance absorbs that drift, so set the baseline *below* the slow-CI floor, never to a fast-local `make test-ui-coverage-baseline` number, or CI flaps.
- **Raising coverage is cheap:** a *render-smoke* spec (navigate to a route, assert its header renders) mounts a lazy page and runs its full render + initial effects, capturing most of its lines in a few lines of test — see `e2e/page-render-smoke.spec.js`. Auth is disabled in the test server (`isAdmin=true`), so `RequireAdmin`/`RequireFeature` routes render without a mock. The most *deterministic* win is removing a race: make a spec `await` a rendered element before ending (see `e2e/agents.spec.js` → AgentCreate) so its lines count every run.
Rules (both gates):
- **Install the hooks:** `make install-hooks` once per clone so lint + coverage run pre-commit. Don't lean on CI for what the hook catches.
- **Don't work around the gate:** never `git commit --no-verify`, and never hand-lower a baseline or widen a tolerance to turn a red gate green. The ratchet only moves up.
- If a change drops coverage, **add tests** (sort `coverage-summary.json` by line% ascending to find untested code) rather than editing the baseline. When coverage legitimately rises, commit the regenerated baseline (`make test-coverage-baseline` / `test-ui-coverage-baseline`).
- The Go gate is **strict — no tolerance**; `covermode=atomic` keeps it deterministic. The UI gate keeps a small tolerance only because its e2e coverage isn't.

View File

@@ -50,6 +50,17 @@ Do not mix styles within a package. If you are extending tests in a package that
This is enforced by `golangci-lint` via the `forbidigo` linter (see `.golangci.yml`); calls like `t.Errorf` / `t.Fatalf` / `t.Run` / `t.Skip` / `t.Logf` are flagged. Run `make lint` locally before submitting; the same check runs in CI (`.github/workflows/lint.yml`).
## Outbound HTTP
All outbound HTTP must go through `github.com/mudler/LocalAI/pkg/httpclient` rather than the standard library's default client. Use `httpclient.New(...)` (no body deadline — safe for streaming/SSE) or `httpclient.NewWithTimeout(d, ...)` (simple request/response). Both **refuse redirects by default** and set a TLS 1.2 floor.
The reason is GHSA-3mj3-57v2-4636: the std default client follows redirects, and on a *cross-host* redirect Go forwards custom credential headers (e.g. Anthropic's `x-api-key`) to the redirect target, leaking the secret. `httpclient` fails closed instead.
- Need to follow redirects (download CDNs, registry blobs, GitHub asset URLs)? Pass `httpclient.WithFollowRedirects()` — it still strips credential headers on any cross-host hop.
- Have a custom transport (IP-pinned dialer, HTTP/2 tuning, a credential-injecting `RoundTripper`)? Pass `httpclient.WithTransport(rt)`, basing the transport on `httpclient.HardenedTransport()` to keep the TLS floor. Handed a `*http.Client` by a library? `httpclient.Harden(c)` applies the policy in place.
This is enforced by `forbidigo` (see `.golangci.yml`): `http.DefaultClient` and `http.Get`/`Post`/`PostForm`/`Head` are flagged. The `&http.Client{}` composite literal can't be matched precisely by forbidigo without also flagging legitimate `*http.Client` type references, so that form is caught by review — don't construct raw clients.
## Documentation
The project documentation is located in `docs/content`. When adding new features or changing existing functionality, it is crucial to update the documentation to reflect these changes. This helps users understand how to use the new capabilities and ensures the documentation stays relevant.

View File

@@ -68,6 +68,34 @@ go test -count=1 -timeout=30m -v ./tests/e2e-backends/...
CI does not load the model; the suite is opt-in via env vars.
## Distributed mode
ds4 supports **layer-split** distributed inference (a model too big for one host,
split by transformer layer; the GGUF must be present on every machine, each loads
only its slice). Topology is **inverted** vs llama.cpp: the coordinator listens,
workers dial in.
- **`ds4-worker` binary**: built and packaged next to `grpc-server` (`package.sh`
copies it into `package/`). Links the same engine objects plus `ds4_distributed.o`;
**no gRPC/protobuf dependency** (speaks ds4's own TCP transport), so it builds
even where `grpc-server` can't. Runs the worker serving loop (`ds4_dist_run`).
- **Coordinator wiring**: the ds4 `grpc-server` acts as coordinator when `LoadModel`
`ModelOptions.Options` (from model-YAML `options:`) carry:
- `ds4_role:coordinator` (enables distributed mode; absent → single-node, back-compat)
- `ds4_layers:0:19` (coordinator's own slice, inclusive; `N:output` includes the head)
- `ds4_listen:0.0.0.0:1234` (address workers dial into)
- `ds4_route_timeout:60` (optional; seconds Predict/PredictStream wait for the route
to form before returning gRPC `UNAVAILABLE`; default 60)
- **Worker CLI**: `local-ai worker ds4-distributed -- <ds4-worker args>` resolves the
ds4 backend and execs the packaged `ds4-worker` (raw passthrough), e.g.
`--role worker --model /models/ds4flash.gguf --layers 20:output --coordinator <host> 1234`.
Opt-in e2e in `tests/e2e-backends/backend_test.go`, gated by
`BACKEND_TEST_DS4_DISTRIBUTED=1` (plus `BACKEND_TEST_DS4_WORKER_BINARY`,
`BACKEND_TEST_DS4_WORKER_LAYERS`, `BACKEND_TEST_DS4_COORDINATOR_LAYERS`,
`BACKEND_TEST_DS4_LISTEN`). Design spec:
`docs/superpowers/specs/2026-05-30-ds4-distributed-inference-design.md`.
## Importer
`core/gallery/importers/ds4.go` (`DS4Importer`) auto-detects ds4 weights by

View File

@@ -61,6 +61,12 @@ Always check `llama.cpp` for new model configuration options that should be supp
- `reasoning_format` - Reasoning format options
- Any new flags or parameters
### Speculative Decoding Types
The `spec_type` option in `grpc-server.cpp` delegates to upstream's `common_speculative_types_from_names()`, so new speculative types added to the `common_speculative_type_from_name` map in `common/speculative.cpp` are picked up automatically with no code changes - only docs need an entry in `docs/content/advanced/model-configuration.md`. Current values: `none`, `draft-simple`, `draft-eagle3`, `draft-mtp`, `ngram-simple`, `ngram-map-k`, `ngram-map-k4v`, `ngram-mod`, `ngram-cache`.
`draft-mtp` (Multi-Token Prediction, [ggml-org/llama.cpp#22673](https://github.com/ggml-org/llama.cpp/pull/22673)) does not need a separate draft GGUF: when `spec_type` includes `draft-mtp` and `draftmodel` is empty, the upstream server creates an MTP context off the target model itself. LocalAI's gRPC layer needs no changes for this — it works through the existing `params.speculative.types` plumbing and the derived `cparams.n_rs_seq = params.speculative.need_n_rs_seq()` in `common_context_params_to_llama`.
### Implementation Guidelines
1. **Feature Parity**: Always aim for feature parity with llama.cpp's implementation

View File

@@ -4,6 +4,7 @@
.devcontainer
models
backends
volumes
examples/chatbot-ui/models
backend/go/image/stablediffusion-ggml/build/
backend/go/*/build
@@ -21,3 +22,27 @@ __pycache__
# backend virtual environments
**/venv
backend/python/**/source
# In-place llama.cpp clone + per-variant build copies. The Makefile
# clones llama.cpp itself at the pinned LLAMA_VERSION; if a stale
# local checkout is COPY'd into the image, the `llama.cpp:` target
# sees the directory and skips re-cloning, so grpc-server.cpp ends
# up compiled against whatever (likely older) commit the host had.
backend/cpp/llama-cpp/llama.cpp
backend/cpp/llama-cpp-*-build
# Rust backend build output (sources are tracked; target/ is generated)
backend/rust/*/target
# Local-only artifacts that bloat the build context but the image never needs.
# Saved image tarballs, locally-installed backends, the host-built binary, and
# assorted tool/scratch dirs. None of these are git-tracked.
backend-images
local-backends
local-ai
.crush
protoc
tests
# Installed via npm inside the build stage; no need to ship the host copy.
**/node_modules

60
.githooks/pre-commit Executable file
View File

@@ -0,0 +1,60 @@
#!/usr/bin/env sh
#
# LocalAI pre-commit hook. Install it (once per clone) with:
#
# make install-hooks
#
# Runs only the checks relevant to what's staged:
# - Go files -> make lint + make test-coverage-check
# - core/http/react-ui -> make test-ui-coverage-check (Playwright e2e + gate)
# A commit touching neither is skipped entirely (docs/YAML/etc. can't change
# lint findings, Go coverage, or the UI).
#
# To bypass for a single commit (e.g. a WIP checkpoint): git commit --no-verify
set -eu
repo_root="$(git rev-parse --show-toplevel)"
cd "$repo_root"
staged="$(git diff --cached --name-only --diff-filter=ACMRD)"
go_changed=0
ui_changed=0
if echo "$staged" | grep -qE '\.go$'; then go_changed=1; fi
if echo "$staged" | grep -qE '^core/http/react-ui/'; then ui_changed=1; fi
if [ "$go_changed" -eq 0 ] && [ "$ui_changed" -eq 0 ]; then
echo "pre-commit: no Go or React UI changes staged — skipping."
exit 0
fi
if [ "$go_changed" -eq 1 ]; then
# Resolve the ref golangci-lint's new-from-merge-base should compare
# against. .golangci.yml pins origin/master, which is correct in CI
# (origin == the canonical repo) but wrong from a fork clone, where
# origin/master lags behind and lint would report the whole upstream
# backlog. Prefer upstream/master, then origin/master, then master.
lint_base=""
for ref in upstream/master origin/master master; do
if git rev-parse --verify --quiet "${ref}^{commit}" >/dev/null 2>&1; then
lint_base="$ref"
break
fi
done
echo "pre-commit ▶ golangci-lint (make lint${lint_base:+, new-from $lint_base})"
make lint LINT_NEW_FROM="$lint_base"
echo "pre-commit ▶ coverage gate (make test-coverage-check) — builds and runs the"
echo " pkg/core suites plus tests/e2e; can take a few minutes."
make test-coverage-check
fi
if [ "$ui_changed" -eq 1 ]; then
echo "pre-commit ▶ React UI e2e + coverage gate (make test-ui-coverage-check) —"
echo " rebuilds the UI + ui-test-server, runs the Playwright specs, and"
echo " fails if line coverage regressed; can take a couple of minutes."
make test-ui-coverage-check
fi
echo "pre-commit ✓ all relevant checks passed"

View File

@@ -278,6 +278,19 @@ include:
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-liquid-audio'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "liquid-audio"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
@@ -389,7 +402,12 @@ include:
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-llama-cpp'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-12-amd64'
runs-on: 'ubuntu-latest'
# bigger-runner: cold builds for this entry consistently take 5h+ on
# ubuntu-latest (observed 5h36m on v4.2.1). Move back to bigger-runner
# so the build finishes well within GHA's 6h job timeout. Phase 5.3 of
# the free-tier migration (PR #9730) flipped this to ubuntu-latest as
# a 'highest-risk batch' with explicit per-entry revert.
runs-on: 'bigger-runner'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "llama-cpp"
@@ -403,7 +421,9 @@ include:
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-turboquant'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-12-amd64'
runs-on: 'ubuntu-latest'
# bigger-runner: same rationale as -gpu-nvidia-cuda-12-llama-cpp above
# (observed 6h5m wall-clock on v4.2.1, just past the 6h job timeout).
runs-on: 'bigger-runner'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "turboquant"
@@ -670,6 +690,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-rfdetr-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "rfdetr-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
@@ -683,6 +716,32 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-crispasr'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-parakeet-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "parakeet-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
@@ -801,6 +860,19 @@ include:
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-liquid-audio'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "liquid-audio"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -899,7 +971,9 @@ include:
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-llama-cpp'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-amd64'
runs-on: 'ubuntu-latest'
# bigger-runner: cold builds for this entry take 5h+ on ubuntu-latest
# (observed 5h37m on v4.2.1). Same rationale as the cuda-12 variant.
runs-on: 'bigger-runner'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "llama-cpp"
@@ -913,7 +987,8 @@ include:
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-turboquant'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-amd64'
runs-on: 'ubuntu-latest'
# bigger-runner: observed 6h5m wall-clock on v4.2.1 — at the GHA timeout.
runs-on: 'bigger-runner'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "turboquant"
@@ -1078,6 +1153,19 @@ include:
backend: "vibevoice"
dockerfile: "./backend/Dockerfile.python"
context: "./"
- build-type: 'l4t'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-liquid-audio'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
ubuntu-version: '2404'
backend: "liquid-audio"
dockerfile: "./backend/Dockerfile.python"
context: "./"
- build-type: 'l4t'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -1442,6 +1530,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-rfdetr-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "rfdetr-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -1455,6 +1556,19 @@ include:
backend: "sam3-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-rfdetr-cpp'
base-image: "ubuntu:24.04"
ubuntu-version: '2404'
runs-on: 'ubuntu-24.04-arm'
backend: "rfdetr-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -1468,6 +1582,32 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-crispasr'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-parakeet-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "parakeet-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -1481,6 +1621,32 @@ include:
backend: "whisper"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-crispasr'
base-image: "ubuntu:24.04"
ubuntu-version: '2404'
runs-on: 'ubuntu-24.04-arm'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-parakeet-cpp'
base-image: "ubuntu:24.04"
ubuntu-version: '2404'
runs-on: 'ubuntu-24.04-arm'
backend: "parakeet-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -1719,6 +1885,19 @@ include:
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-liquid-audio'
runs-on: 'ubuntu-latest'
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "liquid-audio"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2167,6 +2346,19 @@ include:
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'intel'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-liquid-audio'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "liquid-audio"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'intel'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2560,6 +2752,74 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# rfdetr-cpp
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-cpu-rfdetr-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "rfdetr-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f32-rfdetr-cpp'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "rfdetr-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f16'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f16-rfdetr-cpp'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "rfdetr-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-rfdetr-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "rfdetr-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/arm64'
platform-tag: 'arm64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-rfdetr-cpp'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "rfdetr-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2640,6 +2900,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2204'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-arm64-rfdetr-cpp'
base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
runs-on: 'ubuntu-24.04-arm'
backend: "rfdetr-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2204'
# whisper
- build-type: ''
cuda-major-version: ""
@@ -2655,6 +2928,20 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: '-cpu-crispasr'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
@@ -2669,6 +2956,20 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/arm64'
platform-tag: 'arm64'
tag-latest: 'auto'
tag-suffix: '-cpu-crispasr'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2682,6 +2983,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f32-crispasr'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f16'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2695,6 +3009,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f16'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f16-crispasr'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2709,6 +3036,20 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-crispasr'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2723,6 +3064,20 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/arm64'
platform-tag: 'arm64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-crispasr'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "0"
@@ -2736,6 +3091,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2204'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-arm64-crispasr'
base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
runs-on: 'ubuntu-24.04-arm'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2204'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2749,6 +3117,128 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-crispasr'
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
runs-on: 'ubuntu-latest'
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# parakeet-cpp
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: '-cpu-parakeet-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "parakeet-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/arm64'
platform-tag: 'arm64'
tag-latest: 'auto'
tag-suffix: '-cpu-parakeet-cpp'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "parakeet-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f32-parakeet-cpp'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "parakeet-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f16'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f16-parakeet-cpp'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "parakeet-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-parakeet-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "parakeet-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/arm64'
platform-tag: 'arm64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-parakeet-cpp'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "parakeet-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-arm64-parakeet-cpp'
base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
runs-on: 'ubuntu-24.04-arm'
backend: "parakeet-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2204'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-parakeet-cpp'
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
runs-on: 'ubuntu-latest'
skip-drivers: 'false'
backend: "parakeet-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# acestep-cpp
- build-type: ''
cuda-major-version: ""
@@ -3493,6 +3983,20 @@ include:
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: '-cpu-liquid-audio'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "liquid-audio"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
@@ -3767,6 +4271,14 @@ includeDarwin:
tag-suffix: "-metal-darwin-arm64-whisper"
build-type: "metal"
lang: "go"
- backend: "crispasr"
tag-suffix: "-metal-darwin-arm64-crispasr"
build-type: "metal"
lang: "go"
- backend: "parakeet-cpp"
tag-suffix: "-metal-darwin-arm64-parakeet-cpp"
build-type: "metal"
lang: "go"
- backend: "acestep-cpp"
tag-suffix: "-metal-darwin-arm64-acestep-cpp"
build-type: "metal"

View File

@@ -3,6 +3,7 @@ package main
import (
"context"
"encoding/json"
"errors"
"fmt"
"os"
"strconv"
@@ -113,6 +114,17 @@ func main() {
fmt.Println("Searching for trending models on HuggingFace...")
rawModels, err := client.GetTrending(searchTerm, limit)
if err != nil {
if errors.Is(err, hfapi.ErrRateLimited) {
fmt.Printf("HuggingFace API is rate limited after retries, skipping this run: %v\n", err)
writeSummary(AddedModelSummary{
SearchTerm: searchTerm,
TotalFound: 0,
ModelsAdded: 0,
Quantization: quantization,
ProcessingTime: time.Since(startTime).String(),
})
return
}
fmt.Fprintf(os.Stderr, "Error fetching models: %v\n", err)
os.Exit(1)
}
@@ -277,4 +289,3 @@ func truncateString(s string, maxLen int) string {
}
return s[:maxLen] + "..."
}

46
.github/scripts/anchor-digest-in-cache.sh vendored Executable file
View File

@@ -0,0 +1,46 @@
#!/usr/bin/env bash
# Anchor a backend per-arch digest in quay.io/go-skynet/ci-cache so quay's
# garbage collector won't reap the manifest before backend_merge.yml runs.
#
# Context: backend_build.yml pushes by canonical digest only
# (push-by-digest=true). Unreferenced manifests on quay can be reaped within
# ~1-2h, but backend-merge-jobs runs only after the *entire* per-arch build
# matrix drains (max-parallel: 8 × dozens of entries → ~2h+). Without an
# anchoring tag, the earliest digests are gone by the time `imagetools create`
# tries to read them, producing "manifest not found" merge failures.
#
# We tag the digest under our internal ci-cache image; quay does not GC tagged
# manifests. The user-facing manifest list still references the original
# digest in local-ai-backends. backend_merge.yml deletes the anchor tag after
# the user-facing manifest is published — see cleanup-keepalive-tags.sh.
#
# Required env:
# GITHUB_RUN_ID - current workflow run id (set automatically by GHA)
# TAG_SUFFIX - matrix entry's tag-suffix (e.g. -gpu-nvidia-cuda-12-vllm)
# PLATFORM_TAG - amd64 / arm64 / single (single = singleton matrix entry)
# DIGEST - canonical content digest from build step (sha256:...)
#
# Optional env:
# ANCHOR_IMAGE - target image (default: quay.io/go-skynet/ci-cache)
# SOURCE_IMAGE - source image (default: quay.io/go-skynet/local-ai-backends)
# GITHUB_STEP_SUMMARY - if set, an anchored-by line is appended to it
set -euo pipefail
: "${GITHUB_RUN_ID:?}"
: "${TAG_SUFFIX:?}"
: "${PLATFORM_TAG:?}"
: "${DIGEST:?}"
anchor_image="${ANCHOR_IMAGE:-quay.io/go-skynet/ci-cache}"
source_image="${SOURCE_IMAGE:-quay.io/go-skynet/local-ai-backends}"
tag="keepalive-${GITHUB_RUN_ID}${TAG_SUFFIX}-${PLATFORM_TAG}"
docker buildx imagetools create \
-t "${anchor_image}:${tag}" \
"${source_image}@${DIGEST}"
echo "anchored ${DIGEST} as ${anchor_image}:${tag}"
if [[ -n "${GITHUB_STEP_SUMMARY:-}" ]]; then
echo "anchored \`${DIGEST}\` as \`${anchor_image}:${tag}\`" >> "${GITHUB_STEP_SUMMARY}"
fi

49
.github/scripts/cleanup-keepalive-tags.sh vendored Executable file
View File

@@ -0,0 +1,49 @@
#!/usr/bin/env bash
# Best-effort cleanup of the keepalive anchor tags written by
# anchor-digest-in-cache.sh. Called from backend_merge.yml after the
# user-facing manifest list has been published.
#
# Quay's docker registry v2 doesn't allow tag deletes — only digest deletes.
# The proper delete is the quay REST API, which requires an OAuth-scoped
# token. We try QUAY_TOKEN as a bearer token: if the secret is an OAuth app
# token (typical for service accounts) the delete succeeds; otherwise this
# is a soft no-op and the tag persists until manually pruned.
#
# Cleanup failure MUST NOT fail the merge — the merge has already produced
# the user-facing manifest list at this point and the keepalive tags are
# pure overhead. We always exit 0.
#
# Required env:
# GITHUB_RUN_ID - current workflow run id (set automatically by GHA)
# TAG_SUFFIX - matrix entry's tag-suffix (e.g. -gpu-nvidia-cuda-12-vllm)
# QUAY_TOKEN - bearer token for quay's REST API
#
# Optional env:
# QUAY_REPO - target repo (default: go-skynet/ci-cache)
# PLATFORM_TAGS - space-separated list of platform-tag values to try
# (default: "amd64 arm64 single")
# We don't know which platform-tag(s) exist for this
# tag-suffix without an extra API call, so we just try
# all three and ignore 404s for the ones that don't.
set -uo pipefail
: "${GITHUB_RUN_ID:?}"
: "${TAG_SUFFIX:?}"
: "${QUAY_TOKEN:?}"
quay_repo="${QUAY_REPO:-go-skynet/ci-cache}"
platform_tags="${PLATFORM_TAGS:-amd64 arm64 single}"
for plat in $platform_tags; do
tag="keepalive-${GITHUB_RUN_ID}${TAG_SUFFIX}-${plat}"
url="https://quay.io/api/v1/repository/${quay_repo}/tag/${tag}"
http=$(curl -sS -o /dev/null -w '%{http_code}' \
-X DELETE -H "Authorization: Bearer ${QUAY_TOKEN}" "$url" || echo "000")
case "$http" in
204|200) echo "deleted $tag" ;;
404) echo "not present: $tag" ;;
401|403) echo "auth not OAuth-scoped (http $http) for $tag - skipping; orphan tag will persist" ;;
*) echo "unexpected http $http deleting $tag - skipping" ;;
esac
done
exit 0

View File

@@ -154,7 +154,13 @@ jobs:
# digest only — no tags are applied at build time.
backend-merge-jobs-multiarch:
needs: [generate-matrix, backend-jobs-multiarch]
if: needs.generate-matrix.outputs['has-merges-multiarch'] == 'true'
# !cancelled() lets the merge run even when a few build legs failed.
# Without it, GHA's default `needs:` cascade skips the entire merge
# matrix on a single failed/cancelled cell. We still want to publish
# the manifest lists for tag-suffixes whose legs all succeeded.
# Observed in v4.2.1: 2 singlearch build failures cascade-skipped all
# ~199 singlearch merge entries.
if: ${{ !cancelled() && needs.generate-matrix.outputs['has-merges-multiarch'] == 'true' }}
uses: ./.github/workflows/backend_merge.yml
with:
tag-latest: ${{ matrix.tag-latest }}
@@ -170,7 +176,8 @@ jobs:
backend-merge-jobs-singlearch:
needs: [generate-matrix, backend-jobs-singlearch]
if: needs.generate-matrix.outputs['has-merges-singlearch'] == 'true'
# See note on backend-merge-jobs-multiarch above for !cancelled().
if: ${{ !cancelled() && needs.generate-matrix.outputs['has-merges-singlearch'] == 'true' }}
uses: ./.github/workflows/backend_merge.yml
with:
tag-latest: ${{ matrix.tag-latest }}

View File

@@ -228,6 +228,16 @@ jobs:
digest="${{ steps.build.outputs.digest }}"
touch "/tmp/digests/${digest#sha256:}"
# See .github/scripts/anchor-digest-in-cache.sh for why this is needed
# and how it interacts with backend_merge.yml's cleanup step.
- name: Anchor digest in ci-cache so quay GC won't reap before merge
if: github.event_name != 'pull_request'
env:
TAG_SUFFIX: ${{ inputs.tag-suffix }}
PLATFORM_TAG: ${{ inputs.platform-tag || 'single' }}
DIGEST: ${{ steps.build.outputs.digest }}
run: .github/scripts/anchor-digest-in-cache.sh
# Artifact name uses a `--` separator between tag-suffix and platform-tag
# to avoid prefix collisions during the merge job's pattern-based download.
# Tag-suffixes are not prefix-disjoint (e.g. -gpu-nvidia-cuda-12-vllm is a
@@ -237,7 +247,7 @@ jobs:
# platform-tag (single-arch entries) keeps the artifact name non-trailing.
- name: Upload digest artifact
if: github.event_name != 'pull_request'
uses: actions/upload-artifact@v4
uses: actions/upload-artifact@v7
with:
name: digests${{ inputs.tag-suffix }}--${{ inputs.platform-tag || 'single' }}
path: /tmp/digests/*

View File

@@ -116,6 +116,13 @@ jobs:
# already), we don't have to chase missing dylibs one at a time.
# The downloads cache makes the reinstall fast (~5s on a hit).
brew reinstall ccache
# Same pattern for grpc: its CMake config (used by the llama-cpp
# `grpc-server` target) does find_package(absl). The cache restores
# /opt/homebrew/Cellar/grpc so brew above no-ops the install, but
# abseil isn't in our Cellar cache list and never gets installed
# alongside, leaving grpc's CMake unable to resolve it. Reinstalling
# grpc re-validates and pulls abseil in, mirroring the ccache fix.
brew reinstall grpc
# The brew cache restores the Cellar dirs but NOT the bin symlinks
# at /opt/homebrew/bin/*. brew install above sees the Cellar present
# and decides "already installed" without re-linking, so on a cache-

View File

@@ -31,15 +31,36 @@ on:
jobs:
merge:
runs-on: ubuntu-latest
# id-token: write is required for keyless cosign — the workflow
# exchanges the GitHub OIDC token for a short-lived Fulcio cert that
# signs each pushed manifest. Without this permission the runner
# cannot mint the token, and `cosign sign` fails with "no token".
permissions:
contents: read
id-token: write
env:
quay_username: ${{ secrets.quayUsername }}
# cosign v2.4.x still gates --registry-referrers-mode=oci-1-1 behind
# this flag. Without it, signing fails with:
# invalid argument "oci-1-1" for "--registry-referrers-mode" flag:
# in order to use mode "oci-1-1", you must set COSIGN_EXPERIMENTAL=1
COSIGN_EXPERIMENTAL: '1'
steps:
# Sparse checkout: the merge job needs `.github/scripts/` (for the
# keepalive cleanup script) but none of the source tree.
- name: Checkout (.github/scripts only)
uses: actions/checkout@v6
with:
sparse-checkout: |
.github/scripts
sparse-checkout-cone-mode: false
# `--` separator anchors the glob so we don't over-match sibling
# backends whose tag-suffix happens to be a prefix of ours
# (e.g. -cpu-vllm vs -cpu-vllm-omni). Must stay in sync with the
# upload-artifact name in backend_build.yml.
- name: Download digests
uses: actions/download-artifact@v4
uses: actions/download-artifact@v8
with:
pattern: digests${{ inputs.tag-suffix }}--*
merge-multiple: true
@@ -48,6 +69,16 @@ jobs:
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@master
# cosign signs each pushed manifest list with --recursive so the
# index and every per-arch entry get an attached Sigstore bundle.
# Recent cosign releases always emit the new bundle format, so
# there's no extra CLI flag to opt into it.
- name: Install cosign
if: github.event_name != 'pull_request'
uses: sigstore/cosign-installer@v3
with:
cosign-release: 'v2.4.1'
- name: Login to DockerHub
if: github.event_name != 'pull_request'
uses: docker/login-action@v4
@@ -79,6 +110,25 @@ jobs:
latest=${{ inputs.tag-latest }}
suffix=${{ inputs.tag-suffix }},onlatest=true
# Source from ci-cache, not local-ai-backends.
#
# The build job pushes per-arch manifests to local-ai-backends with
# push-by-digest=true (no tag), then anchors a tagged copy into
# ci-cache so the manifest can be retrieved hours later when this
# merge runs. Quay's manifest GC, however, is per-repository: the
# anchor tag in ci-cache protects the manifest there, but the same
# digest in local-ai-backends has no tag in *that* repo and gets
# reaped independently. Sourcing local-ai-backends@<digest> here
# then fails with "manifest not found" — exactly the regression
# we hit on v4.2.2 (19/37 multiarch merges failed).
#
# ci-cache@<digest> resolves because we anchored it there. buildx
# imagetools create copies the manifest into local-ai-backends
# (cross-repo within the same registry, blobs already cross-mounted
# from the original push so no transfer needed) and publishes the
# manifest list with the user-facing tags. The resulting manifest
# list is fully self-contained in local-ai-backends — child digests
# only, no embedded references to ci-cache.
- name: Create manifest list and push (quay)
if: github.event_name != 'pull_request'
working-directory: /tmp/digests
@@ -92,11 +142,25 @@ jobs:
' <<< "$DOCKER_METADATA_OUTPUT_JSON")
if [ -z "$tags" ]; then
echo "No quay.io tags from docker/metadata-action; skipping quay merge"
else
# shellcheck disable=SC2086
docker buildx imagetools create $tags \
$(printf 'quay.io/go-skynet/local-ai-backends@sha256:%s ' *)
exit 0
fi
# shellcheck disable=SC2086
docker buildx imagetools create $tags \
$(printf 'quay.io/go-skynet/ci-cache@sha256:%s ' *)
# Resolve the manifest-list digest (any tag points at it) so
# cosign can sign by digest. Signing by tag would leave the
# signature orphaned the next time the tag moves.
first_tag=$(jq -cr '
.tags | map(select(startswith("quay.io/"))) | .[0]
' <<< "$DOCKER_METADATA_OUTPUT_JSON")
digest=$(docker buildx imagetools inspect "$first_tag" --format '{{.Manifest.Digest}}')
# --recursive walks the list and signs every per-arch entry
# too — clients that resolve a tag to a platform-specific
# manifest before checking signatures need the per-arch
# signatures, not just the list-level one.
cosign sign --yes --recursive \
--registry-referrers-mode=oci-1-1 \
"quay.io/go-skynet/local-ai-backends@${digest}"
- name: Create manifest list and push (dockerhub)
if: github.event_name != 'pull_request'
@@ -111,11 +175,18 @@ jobs:
' <<< "$DOCKER_METADATA_OUTPUT_JSON")
if [ -z "$tags" ]; then
echo "No dockerhub tags from docker/metadata-action; skipping dockerhub merge"
else
# shellcheck disable=SC2086
docker buildx imagetools create $tags \
$(printf 'localai/localai-backends@sha256:%s ' *)
exit 0
fi
# shellcheck disable=SC2086
docker buildx imagetools create $tags \
$(printf 'localai/localai-backends@sha256:%s ' *)
first_tag=$(jq -cr '
.tags | map(select(startswith("localai/"))) | .[0]
' <<< "$DOCKER_METADATA_OUTPUT_JSON")
digest=$(docker buildx imagetools inspect "$first_tag" --format '{{.Manifest.Digest}}')
cosign sign --yes --recursive \
--registry-referrers-mode=oci-1-1 \
"localai/localai-backends@${digest}"
- name: Inspect manifest
if: github.event_name != 'pull_request'
@@ -126,6 +197,15 @@ jobs:
docker buildx imagetools inspect "$first_tag"
fi
# See .github/scripts/cleanup-keepalive-tags.sh for why this is
# best-effort and what the failure modes are.
- name: Cleanup keepalive tags in ci-cache
if: github.event_name != 'pull_request' && success()
env:
TAG_SUFFIX: ${{ inputs.tag-suffix }}
QUAY_TOKEN: ${{ secrets.quayPassword }}
run: .github/scripts/cleanup-keepalive-tags.sh
- name: Job summary
if: github.event_name != 'pull_request'
run: |

View File

@@ -104,7 +104,9 @@ jobs:
# backend_merge.yml's push-side steps are all gated on
# github.event_name != 'pull_request', so on a PR the merge job would
# do nothing. Skip it entirely to avoid spinning up an empty runner.
if: github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-multiarch'] == 'true'
# !cancelled() lets the merge run even when a few build legs fail —
# see the matching note in backend.yml.
if: ${{ !cancelled() && github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-multiarch'] == 'true' }}
uses: ./.github/workflows/backend_merge.yml
with:
tag-latest: ${{ matrix.tag-latest }}
@@ -118,7 +120,7 @@ jobs:
backend-merge-jobs-singlearch:
needs: [generate-matrix, backend-jobs-singlearch]
if: github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-singlearch'] == 'true'
if: ${{ !cancelled() && github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-singlearch'] == 'true' }}
uses: ./.github/workflows/backend_merge.yml
with:
tag-latest: ${{ matrix.tag-latest }}

View File

@@ -30,6 +30,14 @@ jobs:
variable: "WHISPER_CPP_VERSION"
branch: "master"
file: "backend/go/whisper/Makefile"
- repository: "CrispStrobe/CrispASR"
variable: "CRISPASR_VERSION"
branch: "main"
file: "backend/go/crispasr/Makefile"
- repository: "mudler/parakeet.cpp"
variable: "PARAKEET_VERSION"
branch: "master"
file: "backend/go/parakeet-cpp/Makefile"
- repository: "leejet/stable-diffusion.cpp"
variable: "STABLEDIFFUSION_GGML_VERSION"
branch: "master"
@@ -50,6 +58,10 @@ jobs:
variable: "SAM3_VERSION"
branch: "main"
file: "backend/go/sam3-cpp/Makefile"
- repository: "mudler/rf-detr.cpp"
variable: "RFDETR_VERSION"
branch: "main"
file: "backend/go/rfdetr-cpp/Makefile"
- repository: "predict-woo/qwen3-tts.cpp"
variable: "QWEN3TTS_CPP_VERSION"
branch: "main"

View File

@@ -151,7 +151,11 @@
ubuntu-codename: 'noble'
core-image-merge:
if: github.repository == 'mudler/LocalAI'
# !cancelled(): without it, GHA's default `needs:` cascade skips the
# merge whenever any matrix cell of the parent build fails or is
# cancelled. Same fix as backend.yml's merge jobs — we still want to
# publish the manifest list for tag-suffixes whose legs all succeeded.
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
needs: core-image-build
uses: ./.github/workflows/image_merge.yml
with:
@@ -164,7 +168,7 @@
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
gpu-vulkan-image-merge:
if: github.repository == 'mudler/LocalAI'
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
needs: core-image-build
uses: ./.github/workflows/image_merge.yml
with:
@@ -175,7 +179,91 @@
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
# Single-arch server-image merges. Same conceptual fix as the backend
# singletons in PR #9781: image_build.yml pushes by canonical digest
# only, so without a downstream merge step there's no tag for consumers
# (no :latest-gpu-nvidia-cuda-12, no :v<X>-gpu-nvidia-cuda-12, etc.).
# Each merge job needs only its parent build matrix and is filtered by
# tag-suffix in image_merge.yml's artifact-download pattern.
gpu-nvidia-cuda-12-image-merge:
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
needs: core-image-build
uses: ./.github/workflows/image_merge.yml
with:
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12'
secrets:
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
gpu-nvidia-cuda-13-image-merge:
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
needs: core-image-build
uses: ./.github/workflows/image_merge.yml
with:
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13'
secrets:
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
gpu-intel-image-merge:
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
needs: core-image-build
uses: ./.github/workflows/image_merge.yml
with:
tag-latest: 'auto'
tag-suffix: '-gpu-intel'
secrets:
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
gpu-hipblas-image-merge:
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
needs: hipblas-jobs
uses: ./.github/workflows/image_merge.yml
with:
tag-latest: 'auto'
tag-suffix: '-gpu-hipblas'
secrets:
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
nvidia-l4t-arm64-image-merge:
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
needs: gh-runner
uses: ./.github/workflows/image_merge.yml
with:
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-arm64'
secrets:
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
nvidia-l4t-arm64-cuda-13-image-merge:
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
needs: gh-runner
uses: ./.github/workflows/image_merge.yml
with:
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-arm64-cuda-13'
secrets:
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
gh-runner:
if: github.repository == 'mudler/LocalAI'
uses: ./.github/workflows/image_build.yml

View File

@@ -106,6 +106,7 @@ jobs:
type=ref,event=branch
type=semver,pattern={{raw}}
type=sha
type=raw,value={{branch}}-{{date 'X'}}-{{sha}},enable={{is_default_branch}}
flavor: |
latest=${{ inputs.tag-latest }}
suffix=${{ inputs.tag-suffix }},onlatest=true
@@ -185,11 +186,28 @@ jobs:
digest="${{ steps.build.outputs.digest }}"
touch "/tmp/digests/${digest#sha256:}"
# See .github/scripts/anchor-digest-in-cache.sh for why this is needed
# and how it interacts with image_merge.yml's cleanup step. Mirrors the
# same anchor in backend_build.yml — quay's per-repo manifest GC reaps
# untagged manifests in local-ai before the merge runs.
- name: Anchor digest in ci-cache so quay GC won't reap before merge
if: github.event_name != 'pull_request'
env:
TAG_SUFFIX: ${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}
PLATFORM_TAG: ${{ inputs.platform-tag || 'single' }}
DIGEST: ${{ steps.build.outputs.digest }}
SOURCE_IMAGE: quay.io/go-skynet/local-ai
run: .github/scripts/anchor-digest-in-cache.sh
- name: Upload digest artifact
if: github.event_name != 'pull_request'
uses: actions/upload-artifact@v4
uses: actions/upload-artifact@v7
with:
name: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}-${{ inputs.platform-tag }}
# `--` separator + 'single' placeholder for empty platform-tag
# same pattern as backend_build.yml. Prevents prefix collisions
# in the merge-side glob (e.g. -nvidia-l4t-arm64 is a prefix of
# -nvidia-l4t-arm64-cuda-13).
name: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}--${{ inputs.platform-tag || 'single' }}
path: /tmp/digests/*
if-no-files-found: error
retention-days: 1

View File

@@ -33,10 +33,22 @@ jobs:
env:
quay_username: ${{ secrets.quayUsername }}
steps:
- name: Download digests
uses: actions/download-artifact@v4
# Sparse checkout: needed for .github/scripts/ (the keepalive cleanup
# script). Skips the rest of the source tree.
- name: Checkout (.github/scripts only)
uses: actions/checkout@v6
with:
pattern: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}-*
sparse-checkout: |
.github/scripts
sparse-checkout-cone-mode: false
- name: Download digests
uses: actions/download-artifact@v8
with:
# `--` separator anchors the glob so we don't over-match sibling
# tag-suffixes (e.g. -nvidia-l4t-arm64 vs -nvidia-l4t-arm64-cuda-13).
# Must stay in sync with image_build.yml's upload-artifact name.
pattern: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}--*
merge-multiple: true
path: /tmp/digests
@@ -68,10 +80,18 @@ jobs:
type=ref,event=branch
type=semver,pattern={{raw}}
type=sha
type=raw,value={{branch}}-{{date 'X'}}-{{sha}},enable={{is_default_branch}}
flavor: |
latest=${{ inputs.tag-latest }}
suffix=${{ inputs.tag-suffix }},onlatest=true
# Source from ci-cache, not local-ai. See backend_merge.yml for the
# detailed rationale — quay's manifest GC is per-repository, so the
# untagged digest in local-ai gets reaped while the same content lives
# tagged under ci-cache (anchored by image_build.yml). buildx imagetools
# create copies the manifest into local-ai (blobs already cross-mounted)
# and publishes the manifest list with user-facing tags. End state in
# local-ai is self-contained; no embedded reference to ci-cache.
- name: Create manifest list and push (quay)
working-directory: /tmp/digests
run: |
@@ -82,7 +102,7 @@ jobs:
else
# shellcheck disable=SC2086
docker buildx imagetools create $tags \
$(printf 'quay.io/go-skynet/local-ai@sha256:%s ' *)
$(printf 'quay.io/go-skynet/ci-cache@sha256:%s ' *)
fi
- name: Create manifest list and push (dockerhub)
@@ -107,6 +127,15 @@ jobs:
docker buildx imagetools inspect "$first_tag"
fi
# See .github/scripts/cleanup-keepalive-tags.sh for the best-effort
# semantics — fails soft when the registry credential isn't OAuth-scoped.
- name: Cleanup keepalive tags in ci-cache
if: github.event_name != 'pull_request' && success()
env:
TAG_SUFFIX: ${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}
QUAY_TOKEN: ${{ secrets.quayPassword }}
run: .github/scripts/cleanup-keepalive-tags.sh
- name: Job summary
run: |
set -euo pipefail

View File

@@ -18,7 +18,7 @@ jobs:
if: ${{ github.actor != 'dependabot[bot]' }}
- name: Run Gosec Security Scanner
if: ${{ github.actor != 'dependabot[bot]' }}
uses: securego/gosec@v2.22.9
uses: securego/gosec@v2.27.1
with:
# we let the report trigger content trigger a failure using the GitHub Security features.
args: '-no-fail -fmt sarif -out results.sarif ./...'

View File

@@ -11,7 +11,7 @@ jobs:
if: github.repository == 'mudler/LocalAI'
runs-on: ubuntu-latest
steps:
- uses: actions/stale@b5d41d4e1d5dceea10e7104786b73624c18a190f # v9
- uses: actions/stale@eb5cf3af3ac0a1aa4c9c45633dd1ae542a27a899 # v9
with:
stale-issue-message: 'This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days.'
stale-pr-message: 'This PR is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 10 days.'

View File

@@ -28,6 +28,7 @@ jobs:
qwen-asr: ${{ steps.detect.outputs.qwen-asr }}
nemo: ${{ steps.detect.outputs.nemo }}
voxcpm: ${{ steps.detect.outputs.voxcpm }}
liquid-audio: ${{ steps.detect.outputs.liquid-audio }}
llama-cpp-quantization: ${{ steps.detect.outputs.llama-cpp-quantization }}
llama-cpp: ${{ steps.detect.outputs.llama-cpp }}
ik-llama-cpp: ${{ steps.detect.outputs.ik-llama-cpp }}
@@ -36,6 +37,7 @@ jobs:
sglang: ${{ steps.detect.outputs.sglang }}
acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }}
qwen3-tts-cpp: ${{ steps.detect.outputs.qwen3-tts-cpp }}
rfdetr-cpp: ${{ steps.detect.outputs.rfdetr-cpp }}
vibevoice-cpp: ${{ steps.detect.outputs.vibevoice-cpp }}
localvqe: ${{ steps.detect.outputs.localvqe }}
voxtral: ${{ steps.detect.outputs.voxtral }}
@@ -44,6 +46,7 @@ jobs:
speaker-recognition: ${{ steps.detect.outputs.speaker-recognition }}
sherpa-onnx: ${{ steps.detect.outputs.sherpa-onnx }}
whisper: ${{ steps.detect.outputs.whisper }}
parakeet-cpp: ${{ steps.detect.outputs.parakeet-cpp }}
steps:
- name: Checkout repository
uses: actions/checkout@v6
@@ -447,6 +450,32 @@ jobs:
run: |
make --jobs=5 --output-sync=target -C backend/python/voxcpm
make --jobs=5 --output-sync=target -C backend/python/voxcpm test
# liquid-audio: LFM2.5-Audio any-to-any backend. The CI smoke test
# exercises Health() and LoadModel(mode:finetune) — fine-tune mode
# short-circuits before pulling weights (backend.py:192), so no
# HuggingFace download or GPU is needed. The full-inference path is
# gated on LIQUID_AUDIO_MODEL_ID, which we don't set here.
tests-liquid-audio:
needs: detect-changes
if: needs.detect-changes.outputs.liquid-audio == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Dependencies
run: |
sudo apt-get update
sudo apt-get install -y build-essential ffmpeg
sudo apt-get install -y ca-certificates cmake curl patch python3-pip
# Install UV
curl -LsSf https://astral.sh/uv/install.sh | sh
pip install --user --no-cache-dir grpcio-tools==1.64.1
- name: Test liquid-audio
run: |
make --jobs=5 --output-sync=target -C backend/python/liquid-audio
make --jobs=5 --output-sync=target -C backend/python/liquid-audio test
tests-llama-cpp-quantization:
needs: detect-changes
if: needs.detect-changes.outputs.llama-cpp-quantization == 'true' || needs.detect-changes.outputs.run-all == 'true'
@@ -605,6 +634,26 @@ jobs:
- name: Build whisper backend image and run transcription gRPC e2e tests
run: |
make test-extra-backend-whisper-transcription
# Parakeet ASR via the parakeet-cpp backend (C++/ggml port of NeMo
# Parakeet). Drives AudioTranscription (offline, with word timestamps) on
# tdt_ctc-110m + the JFK 11s clip.
tests-parakeet-cpp-grpc-transcription:
needs: detect-changes
if: needs.detect-changes.outputs.parakeet-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
- name: Build parakeet-cpp backend image and run transcription gRPC e2e tests
run: |
make test-extra-backend-parakeet-cpp-transcription
# VITS TTS via the sherpa-onnx backend. Drives both TTS (file write) and
# TTSStream (PCM chunks) on the e2e-backends harness.
tests-sherpa-onnx-grpc-tts:
@@ -816,6 +865,42 @@ jobs:
- name: Test qwen3-tts-cpp
run: |
make --jobs=5 --output-sync=target -C backend/go/qwen3-tts-cpp test
# Per-backend smoke for rfdetr-cpp: builds the .so + Go binary and runs
# `make -C backend/go/rfdetr-cpp test`. test.sh fetches the small (~20 MB)
# rfdetr-nano-q8_0 GGUF from the published mudler/rfdetr-cpp-nano HF repo
# via curl and synthesises a tiny PNG to exercise the wire protocol.
tests-rfdetr-cpp:
needs: detect-changes
if: needs.detect-changes.outputs.rfdetr-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Dependencies
run: |
sudo apt-get update
sudo apt-get install -y build-essential cmake curl libopenblas-dev
- name: Setup Go
uses: actions/setup-go@v5
- name: Display Go version
run: go version
- name: Proto Dependencies
run: |
# Install protoc
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
PATH="$PATH:$HOME/go/bin" make protogen-go
- name: Build rfdetr-cpp
run: |
make --jobs=5 --output-sync=target -C backend/go/rfdetr-cpp
- name: Test rfdetr-cpp
run: |
make --jobs=5 --output-sync=target -C backend/go/rfdetr-cpp test
# Per-backend smoke for vibevoice-cpp: builds the .so + Go binary and
# runs `make -C backend/go/vibevoice-cpp test`. test.sh auto-downloads
# the published mudler/vibevoice.cpp-models bundle (TTS Q8_0 + ASR Q4_K

View File

@@ -53,9 +53,22 @@ jobs:
node-version: '22'
- name: Build React UI
run: make react-ui
- name: Test
# Runs the core suite with coverage and fails if total coverage dropped
# below the committed baseline (coverage-baseline.txt). The gate is
# strict — any decrease fails. Raise the baseline with
# `make test-coverage-baseline` and commit it when coverage rises.
- name: Test (with coverage gate)
run: |
PATH="$PATH:/root/go/bin" make --jobs 5 --output-sync=target test
PATH="$PATH:/root/go/bin" make --jobs 5 --output-sync=target test-coverage-check
- name: Upload coverage report
if: ${{ always() }}
uses: actions/upload-artifact@v4
with:
name: coverage-linux
path: |
coverage/coverage.out
coverage/coverage.html
if-no-files-found: ignore
- name: Setup tmate session if tests fail
if: ${{ failure() }}
uses: mxschmitt/action-tmate@v3.23

View File

@@ -37,6 +37,10 @@ jobs:
uses: actions/setup-node@v6
with:
node-version: '22'
- name: Setup Bun
uses: oven-sh/setup-bun@v2
with:
bun-version: '1.3.11'
- name: Proto Dependencies
run: |
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
@@ -48,16 +52,12 @@ jobs:
run: |
sudo apt-get update
sudo apt-get install -y build-essential libopus-dev
- name: Build UI test server
run: PATH="$PATH:$HOME/go/bin" make build-ui-test-server
- name: Install Playwright
working-directory: core/http/react-ui
run: |
npm install
npx playwright install --with-deps chromium
- name: Run Playwright tests
working-directory: core/http/react-ui
run: npx playwright test
# Builds an instrumented UI bundle, runs the Playwright specs, and fails
# if line coverage regressed beyond the jitter tolerance (the gate is
# in `make test-ui-coverage-check`). PLAYWRIGHT_CHROMIUM_PATH is unset
# here, so scripts/ensure-playwright-browser.sh installs Chromium via apt.
- name: Run UI e2e + coverage gate
run: PATH="$PATH:$HOME/go/bin" make test-ui-coverage-check
- name: Upload Playwright report
if: ${{ failure() }}
uses: actions/upload-artifact@v7
@@ -65,6 +65,14 @@ jobs:
name: playwright-report
path: core/http/react-ui/playwright-report/
retention-days: 7
- name: Upload UI coverage report
if: ${{ always() }}
uses: actions/upload-artifact@v7
with:
name: ui-coverage
path: core/http/react-ui/coverage/
if-no-files-found: ignore
retention-days: 7
- name: Setup tmate session if tests fail
if: ${{ failure() }}
uses: mxschmitt/action-tmate@v3.23

14
.gitignore vendored
View File

@@ -26,6 +26,10 @@ go-bert
LocalAI
/local-ai
/local-ai-launcher
# Root-level build artifacts when running `go build ./...` against
# Go backend packages whose main lives under backend/go/.
/cloud-proxy
/local-store
# prevent above rules from omitting the helm chart
!charts/*
# prevent above rules from omitting the api/localai folder
@@ -66,10 +70,17 @@ docs/static/gallery.html
# per-developer customization files for the development container
.devcontainer/customization/*
# Coverage profiles (the committed baseline is coverage-baseline.txt)
/coverage/
# React UI build artifacts (keep placeholder dist/index.html)
core/http/react-ui/node_modules/
core/http/react-ui/dist
# React UI coverage (vite-plugin-istanbul + nyc, via `make test-ui-coverage`)
core/http/react-ui/.nyc_output/
core/http/react-ui/coverage/
# Extracted backend binaries for container-based testing
local-backends/
@@ -77,3 +88,6 @@ local-backends/
tests/e2e-ui/ui-test-server
core/http/react-ui/playwright-report/
core/http/react-ui/test-results/
# Local worktrees
.worktrees/

View File

@@ -46,8 +46,81 @@ linters:
msg: 'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.Fail. See .agents/coding-style.md.'
- pattern: '^t\.FailNow$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.FailNow. See .agents/coding-style.md.'
# In-process config should flow through ApplicationConfig / kong-bound
# CLI flags, not via os.Getenv. The CLI layer is the legitimate
# env→struct boundary (kong's `env:"..."` tag); anything deeper that
# reads env directly leaks process state into business logic and
# makes flags impossible to test or override per-request. Backend
# subprocesses, the system/capabilities probe, and a few places that
# read non-LocalAI env vars (HOME, PATH, AUTH_TOKEN passed by parent)
# are exempt — see linters.exclusions.rules below.
- pattern: '^os\.(Getenv|LookupEnv|Environ)$'
msg: 'Plumb config through ApplicationConfig (or the relevant CLI struct) instead of reading env directly. CLI entry points (core/cli/) bind env vars via kong''s `env:` tag — that is the only sanctioned env→struct boundary. See .agents/coding-style.md.'
# Outbound HTTP must go through pkg/httpclient, which refuses redirects
# by default and sets a TLS floor. The std-library default client and
# the http.Get/Post/... convenience helpers follow redirects (up to 10)
# and, on a cross-host redirect, forward custom credential headers such
# as Anthropic's x-api-key to the redirect target — leaking the secret
# (GHSA-3mj3-57v2-4636). forbidigo can't precisely match the
# `&http.Client{}` composite literal without also flagging legitimate
# `*http.Client` type references, so that form is enforced by
# convention + review; these two patterns catch the implicit-default
# client, which is the common footgun.
- pattern: '^http\.DefaultClient$'
msg: 'Use pkg/httpclient (httpclient.New / NewWithTimeout) instead of http.DefaultClient — the std client follows redirects and leaks credential headers cross-host (GHSA-3mj3-57v2-4636). See .agents/coding-style.md.'
- pattern: '^http\.(Get|Post|PostForm|Head)$'
msg: 'Use pkg/httpclient (httpclient.New / NewWithTimeout) instead of http.Get/Post/PostForm/Head — these use http.DefaultClient, which follows redirects and leaks credential headers cross-host (GHSA-3mj3-57v2-4636). See .agents/coding-style.md.'
exclusions:
paths:
# Upstream whisper.cpp source tree fetched by the whisper backend Makefile.
- 'backend/go/whisper/sources'
- 'docs/'
rules:
# CLI entry points: kong's `env:"..."` tag is the legitimate env→struct
# boundary, and a handful of subcommands legitimately propagate values
# to spawned subprocesses (LLAMACPP_GRPC_SERVERS, MLX hostfile, ...).
- path: ^core/cli/
text: 'os\.(Getenv|LookupEnv|Environ)'
linters: [forbidigo]
# Backend subprocesses are independent binaries with their own env
# surface; they're not "in-process config" of the LocalAI server.
- path: ^backend/
text: 'os\.(Getenv|LookupEnv|Environ)'
linters: [forbidigo]
# System capability probe reads HOME, PATH-style vars to discover
# GPUs, default paths, etc. — not LocalAI config.
- path: ^pkg/system/
text: 'os\.(Getenv|LookupEnv|Environ)'
linters: [forbidigo]
# gRPC server reads AUTH_TOKEN passed in by the parent process at spawn
# time; model.Loader sets/inherits env to communicate with subprocesses.
- path: ^pkg/grpc/
text: 'os\.(Getenv|LookupEnv|Environ)'
linters: [forbidigo]
- path: ^pkg/model/
text: 'os\.(Getenv|LookupEnv|Environ)'
linters: [forbidigo]
# Top-level main binaries (local-ai, launcher) are entry points.
- path: ^cmd/
text: 'os\.(Getenv|LookupEnv|Environ)'
linters: [forbidigo]
# Tests legitimately read $HOME, $TMPDIR, and gating env vars
# (LOCALAI_COSIGN_LIVE, etc.) to skip live-network specs.
- path: _test\.go$
text: 'os\.(Getenv|LookupEnv|Environ)'
linters: [forbidigo]
# pkg/httpclient is the sanctioned home for outbound HTTP clients; it
# necessarily references net/http directly.
- path: ^pkg/httpclient/
text: 'http\.(DefaultClient|Get|Post|PostForm|Head)'
linters: [forbidigo]
# Tests drive local httptest servers where redirect/TLS hardening is
# irrelevant; the std client is fine there.
- path: _test\.go$
text: 'http\.(DefaultClient|Get|Post|PostForm|Head)'
linters: [forbidigo]
# Vendored upstream whisper.cpp Go bindings are a separate module and
# cannot import pkg/httpclient.
- path: ^backend/go/whisper/sources/
text: 'http\.(DefaultClient|Get|Post|PostForm|Head)'
linters: [forbidigo]

View File

@@ -31,9 +31,11 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
| [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |
| [.agents/adding-gallery-models.md](.agents/adding-gallery-models.md) | Adding GGUF models from HuggingFace to the model gallery |
| [.agents/localai-assistant-mcp.md](.agents/localai-assistant-mcp.md) | LocalAI Assistant chat modality — adding admin tools to the in-process MCP server, editing skill prompts, keeping REST + MCP + skills in sync |
| [.agents/backend-signing.md](.agents/backend-signing.md) | Backend OCI image signing (keyless cosign + sigstore-go) — producer-side CI setup, consumer-side gallery `verification:` block, strict mode (`LOCALAI_REQUIRE_BACKEND_INTEGRITY`), revocation via `not_before` |
## Quick Reference
- **Git hooks & coverage gates**: Run `make install-hooks` once per clone so the pre-commit lint + coverage gates run. **Never bypass them with `git commit --no-verify`, and never lower a coverage baseline or widen a gate's tolerance to turn a red gate green** — the coverage ratchet only moves up. If a change drops coverage, add tests to raise it (e.g. render-smoke specs). See [.agents/building-and-testing.md](.agents/building-and-testing.md).
- **Logging**: Use `github.com/mudler/xlog` (same API as slog)
- **Go style**: Prefer `any` over `interface{}`
- **Comments**: Explain *why*, not *what*

View File

@@ -198,6 +198,7 @@ For AI-assisted development, see [`AGENTS.md`](AGENTS.md) (or the equivalent [`C
- Prefer modern Go idioms — for example, use `any` instead of `interface{}`.
- Use [`golangci-lint`](https://golangci-lint.run) to catch common issues before submitting a PR.
- Run `make install-hooks` once per clone to enable the pre-commit hook: Go changes run `make lint` + the coverage gate (`make test-coverage-check`); `core/http/react-ui/` changes run the Playwright e2e suite (`make test-ui`). Bypass a single commit with `git commit --no-verify`.
- Use [`github.com/mudler/xlog`](https://github.com/mudler/xlog) for logging (same API as `slog`). Do not use `fmt.Println` or the standard `log` package for operational logging.
- Use tab indentation for Go files (as defined in `.editorconfig`).
@@ -265,6 +266,12 @@ The e2e tests run LocalAI in a Docker container and exercise the API:
make test-e2e
```
### React UI tests and coverage
The React UI (`core/http/react-ui/`) is covered by Playwright e2e specs, gated by a **monotonic line-coverage ratchet** (`make test-ui-coverage-check`, run in CI and pre-commit). The metric is non-deterministic — a fast local box reads higher than a slow CI runner for the same code — so a small tolerance is unavoidable.
**If your change lowers UI coverage, raise it back by adding specs — do not widen the tolerance or hand-lower the baseline.** A *render-smoke* spec (navigate to a page, assert its header is visible) cheaply covers an entire lazy page. See `core/http/react-ui/e2e/page-render-smoke.spec.js` and the full policy in [.agents/building-and-testing.md](.agents/building-and-testing.md#react-ui-coverage).
### Running E2E container tests
These tests build a standard LocalAI Docker image and run it with pre-configured model configs to verify that most endpoints work correctly:

View File

@@ -305,7 +305,7 @@ EOT
###################################
# Build React UI
FROM node:25-slim AS react-ui-builder
FROM node:26-slim AS react-ui-builder
WORKDIR /app
COPY core/http/react-ui/package*.json ./
RUN npm install

176
Makefile
View File

@@ -1,5 +1,5 @@
# Disable parallel execution for backend builds
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio
GOCMD=go
GOTEST=$(GOCMD) test
@@ -69,10 +69,41 @@ else
GORELEASER=$(shell which goreleaser)
endif
TEST_PATHS?=./api/... ./pkg/... ./core/...
TEST_PATHS?=./api/... ./pkg/... ./core/... ./backend/go/cloud-proxy/... ./backend/go/local-store/...
## Coverage output and the committed baseline that CI compares against.
## The gate is strict: total coverage must never decrease (no tolerance).
## covermode=atomic makes line coverage deterministic regardless of test
## ordering or flake retries, so there is no run-to-run jitter to absorb.
COVERAGE_DIR?=$(abspath ./coverage)
COVERAGE_PROFILE?=$(COVERAGE_DIR)/coverage.out
COVERAGE_BASELINE?=coverage-baseline.txt
## Coverage is collected one recursive root at a time and merged (see
## scripts/run-coverage.sh): passing several recursive roots to a single
## ginkgo invocation only keeps one root's coverprofile. Mirrors TEST_PATHS
## minus ./api (which doesn't exist).
COVERAGE_ROOTS?=./pkg ./core
## Build tags for the coverage build. `auth` is required to compile the real
## auth implementation and its ~150 `//go:build auth` tests (otherwise they're
## invisible and the gate scores auth against a stub). `debug` matches `test`.
COVERAGE_TAGS?=debug auth
## Coverage is attributed to these packages via --coverpkg, so the in-process
## integration suites (COVERAGE_E2E_ROOTS) credit the core/http handlers they
## drive over HTTP — not just their own test package.
COVERAGE_COVERPKG?=github.com/mudler/LocalAI/core/...,github.com/mudler/LocalAI/pkg/...
## In-process integration suites folded into coverage. Run non-recursively
## (excludes tests/e2e/distributed, which needs containers) with the mock
## backend built by prepare-test. real-models specs need a downloaded model,
## so they're filtered out. NOTE: tests/integration is intentionally NOT here —
## it needs the local-store backend built (`make backends/local-store`), which
## the coverage CI job doesn't do.
COVERAGE_E2E_ROOTS?=./tests/e2e
COVERAGE_E2E_LABELS?=!real-models
## Drop generated protobuf from the denominator (it has no tests by design).
COVERAGE_EXCLUDE_RE?=grpc/proto/.*[.]pb[.]go
.PHONY: all test build vendor lint lint-all
.PHONY: all test test-coverage test-coverage-baseline test-coverage-check test-ui test-ui-coverage-baseline test-ui-coverage-check install-hooks build vendor lint lint-all
all: help
@@ -170,6 +201,36 @@ test: prepare-test
OPUS_SHIM_LIBRARY=$(abspath ./pkg/opus/shim/libopusshim.so) \
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) --fail-fast -v -r $(TEST_PATHS)
## Runs the core suite ($(TEST_PATHS)) with statement-coverage instrumentation
## and writes a merged profile to $(COVERAGE_PROFILE). Deliberately omits
## --fail-fast so a single failure doesn't truncate the coverage number, and
## uses covermode=atomic so the result is deterministic. Prints the total.
test-coverage: prepare-test
@echo 'Running tests with coverage'
GINKGO_TAGS="$(COVERAGE_TAGS)" \
COVERAGE_COVERPKG="$(COVERAGE_COVERPKG)" \
COVERAGE_E2E_ROOTS="$(COVERAGE_E2E_ROOTS)" \
COVERAGE_E2E_LABELS="$(COVERAGE_E2E_LABELS)" \
COVERAGE_EXCLUDE_RE='$(COVERAGE_EXCLUDE_RE)' \
OPUS_SHIM_LIBRARY=$(abspath ./pkg/opus/shim/libopusshim.so) \
scripts/run-coverage.sh $(COVERAGE_DIR) $(COVERAGE_PROFILE) $(TEST_FLAKES) $(COVERAGE_ROOTS)
@$(GOCMD) tool cover -html=$(COVERAGE_PROFILE) -o $(COVERAGE_DIR)/coverage.html
@$(GOCMD) tool cover -func=$(COVERAGE_PROFILE) | tail -n1
## Writes the current total coverage to $(COVERAGE_BASELINE). Run this (and
## commit the result) whenever a change legitimately raises coverage so the
## ratchet moves up. Never lower it by hand.
test-coverage-baseline: test-coverage
@$(GOCMD) tool cover -func=$(COVERAGE_PROFILE) | awk '/^total:/{gsub(/%/,"",$$NF); print $$NF}' > $(COVERAGE_BASELINE)
@echo "Saved coverage baseline: $$(cat $(COVERAGE_BASELINE))%"
## CI gate: fails if total coverage dropped more than COVERAGE_TOLERANCE
## (default 0.5pp) below the committed baseline. A small tolerance absorbs the
## run-to-run jitter from the in-process tests/e2e suite folded in via
## --coverpkg (timing-dependent which handler lines execute).
test-coverage-check: test-coverage
@scripts/coverage-check.sh $(COVERAGE_PROFILE) $(COVERAGE_BASELINE)
########################################################
## Lint
########################################################
@@ -185,12 +246,17 @@ test: prepare-test
## everything else automatically, so new packages are scanned by default.
LINT_EXCLUDE_DIRS_RE=/(backend/go/(piper|silero-vad|llm)|cmd/launcher)(/|$$)
## Set LINT_NEW_FROM to a git ref to override .golangci.yml's
## new-from-merge-base (origin/master). Useful from a fork clone where
## origin/master is stale relative to the canonical repo — the pre-commit
## hook passes the resolved upstream ref here so local lint matches CI.
LINT_NEW_FROM?=
lint:
@command -v golangci-lint >/dev/null 2>&1 || { \
echo 'golangci-lint not installed. Install: go install github.com/golangci/golangci-lint/v2/cmd/golangci-lint@latest'; \
exit 1; \
}
golangci-lint run $$(go list -e -f '{{.Dir}}' ./... | grep -vE '$(LINT_EXCLUDE_DIRS_RE)')
golangci-lint run $(if $(LINT_NEW_FROM),--new-from-merge-base=$(LINT_NEW_FROM),) $$(go list -e -f '{{.Dir}}' ./... | grep -vE '$(LINT_EXCLUDE_DIRS_RE)')
## Like `lint` but reports every issue, including the pre-existing baseline
## that `lint` ignores via .golangci.yml's new-from-merge-base. Use this to
@@ -202,6 +268,17 @@ lint-all:
}
golangci-lint run --new=false --new-from-merge-base= --new-from-rev= $$(go list -e -f '{{.Dir}}' ./... | grep -vE '$(LINT_EXCLUDE_DIRS_RE)')
########################################################
## Git hooks
########################################################
## Points git at the versioned .githooks/ directory so the pre-commit hook
## (lint + coverage gate) runs locally. Run once per clone. Undo with:
## `git config --unset core.hooksPath`. Skip a single commit with
## `git commit --no-verify`.
install-hooks:
git config core.hooksPath .githooks
@echo 'Installed git hooks: core.hooksPath -> .githooks (pre-commit runs lint + test-coverage-check on Go changes)'
########################################################
## E2E AIO tests (uses standard image with pre-configured models)
########################################################
@@ -232,13 +309,20 @@ run-e2e-aio: protogen-go
@echo 'Running e2e AIO tests'
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e-aio
# Distributed architecture e2e (PostgreSQL + NATS via testcontainers).
# Includes NatsJWT specs (JWT-enabled NATS). Requires Docker.
# VLLMMultinode is excluded here; use test-e2e-vllm-multinode for that.
test-e2e-distributed: protogen-go
@echo 'Running distributed e2e tests (label Distributed, incl. NatsJWT)'
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter='Distributed && !VLLMMultinode' --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e/distributed
# vLLM multi-node DP smoke (CPU). Builds local-ai:tests and the
# cpu-vllm backend from the current working tree, then drives a
# head + headless follower via testcontainers-go and asserts a chat
# completion. BuildKit caches both images, so re-runs only rebuild
# what changed. The test lives under tests/e2e/distributed and is
# selected by the VLLMMultinode label so it doesn't run alongside
# the other distributed-suite tests by default.
# test-e2e-distributed.
test-e2e-vllm-multinode: docker-build-e2e extract-backend-vllm protogen-go
@echo 'Running e2e vLLM multi-node DP test'
LOCALAI_IMAGE=local-ai \
@@ -268,12 +352,13 @@ prepare-e2e:
run-e2e-image:
docker run -p 5390:8080 -e MODELS_PATH=/models -e THREADS=1 -e DEBUG=true -d --rm -v $(TEST_DIR):/models --name e2e-tests-$(RANDOM) localai-tests
test-e2e: build-mock-backend prepare-e2e run-e2e-image
test-e2e: build-mock-backend build-cloud-proxy-backend prepare-e2e run-e2e-image
@echo 'Running e2e tests'
BUILD_TYPE=$(BUILD_TYPE) \
LOCALAI_API=http://$(E2E_BRIDGE_IP):5390 \
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e
$(MAKE) clean-mock-backend
$(MAKE) clean-cloud-proxy-backend
$(MAKE) teardown-e2e
docker rmi localai-tests
@@ -463,6 +548,7 @@ prepare-test-extra: protogen-python
$(MAKE) -C backend/python/vllm-omni
$(MAKE) -C backend/python/sglang
$(MAKE) -C backend/python/vibevoice
$(MAKE) -C backend/python/liquid-audio
$(MAKE) -C backend/python/moonshine
$(MAKE) -C backend/python/pocket-tts
$(MAKE) -C backend/python/qwen-tts
@@ -479,6 +565,7 @@ prepare-test-extra: protogen-python
$(MAKE) -C backend/python/insightface
$(MAKE) -C backend/python/speaker-recognition
$(MAKE) -C backend/rust/kokoros kokoros-grpc
$(MAKE) -C backend/go/rfdetr-cpp
test-extra: prepare-test-extra
$(MAKE) -C backend/python/transformers test
@@ -488,6 +575,7 @@ test-extra: prepare-test-extra
$(MAKE) -C backend/python/vllm test
$(MAKE) -C backend/python/vllm-omni test
$(MAKE) -C backend/python/vibevoice test
$(MAKE) -C backend/python/liquid-audio test
$(MAKE) -C backend/python/moonshine test
$(MAKE) -C backend/python/pocket-tts test
$(MAKE) -C backend/python/qwen-tts test
@@ -504,6 +592,7 @@ test-extra: prepare-test-extra
$(MAKE) -C backend/python/insightface test
$(MAKE) -C backend/python/speaker-recognition test
$(MAKE) -C backend/rust/kokoros test
$(MAKE) -C backend/go/rfdetr-cpp test
##
## End-to-end gRPC tests that exercise a built backend container image.
@@ -909,6 +998,19 @@ test-extra-backend-whisper-transcription: docker-build-whisper
BACKEND_TEST_CAPS=health,load,transcription \
$(MAKE) test-extra-backend
## Audio transcription wrapper for the parakeet-cpp (parakeet.cpp ggml port)
## backend. Mirrors test-extra-backend-whisper-transcription: drives the
## AudioTranscription / AudioTranscriptionStream RPCs against a published
## Parakeet GGUF using the JFK 11s clip from whisper.cpp's CI samples. Not
## part of the default test suite - run explicitly once the pinned model URL
## is reachable.
test-extra-backend-parakeet-cpp-transcription: docker-build-parakeet-cpp
BACKEND_IMAGE=local-ai-backend:parakeet-cpp \
BACKEND_TEST_MODEL_URL=https://huggingface.co/mudler/parakeet-cpp-gguf/resolve/main/tdt_ctc-110m-f16.gguf \
BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
BACKEND_TEST_CAPS=health,load,transcription \
$(MAKE) test-extra-backend
## LocalVQE audio transform (joint AEC + noise suppression + dereverb).
## Exercises the audio_transform capability end-to-end: batch transform
## of a real WAV fixture and bidi streaming of synthetic silent frames.
@@ -1062,10 +1164,13 @@ BACKEND_DS4 = ds4|ds4|.|false|false
# Golang backends
BACKEND_PIPER = piper|golang|.|false|true
BACKEND_LOCAL_STORE = local-store|golang|.|false|true
BACKEND_CLOUD_PROXY = cloud-proxy|golang|.|false|true
BACKEND_HUGGINGFACE = huggingface|golang|.|false|true
BACKEND_SILERO_VAD = silero-vad|golang|.|false|true
BACKEND_STABLEDIFFUSION_GGML = stablediffusion-ggml|golang|.|--progress=plain|true
BACKEND_WHISPER = whisper|golang|.|false|true
BACKEND_CRISPASR = crispasr|golang|.|false|true
BACKEND_PARAKEET_CPP = parakeet-cpp|golang|.|false|true
BACKEND_VOXTRAL = voxtral|golang|.|false|true
BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true
BACKEND_QWEN3_TTS_CPP = qwen3-tts-cpp|golang|.|false|true
@@ -1092,6 +1197,7 @@ BACKEND_SGLANG = sglang|python|.|false|true
BACKEND_DIFFUSERS = diffusers|python|.|--progress=plain|true
BACKEND_CHATTERBOX = chatterbox|python|.|false|true
BACKEND_VIBEVOICE = vibevoice|python|.|--progress=plain|true
BACKEND_LIQUID_AUDIO = liquid-audio|python|.|--progress=plain|true
BACKEND_MOONSHINE = moonshine|python|.|false|true
BACKEND_POCKET_TTS = pocket-tts|python|.|false|true
BACKEND_QWEN_TTS = qwen-tts|python|.|false|true
@@ -1114,6 +1220,7 @@ BACKEND_KOKOROS = kokoros|rust|.|false|true
# C++ backends (Go wrapper with purego)
BACKEND_SAM3_CPP = sam3-cpp|golang|.|false|true
BACKEND_RFDETR_CPP = rfdetr-cpp|golang|.|false|true
# Helper function to build docker image for a backend
# Usage: $(call docker-build-backend,BACKEND_NAME,DOCKERFILE_TYPE,BUILD_CONTEXT,PROGRESS_FLAG,NEEDS_BACKEND_ARG)
@@ -1146,10 +1253,13 @@ $(eval $(call generate-docker-build-target,$(BACKEND_TURBOQUANT)))
$(eval $(call generate-docker-build-target,$(BACKEND_DS4)))
$(eval $(call generate-docker-build-target,$(BACKEND_PIPER)))
$(eval $(call generate-docker-build-target,$(BACKEND_LOCAL_STORE)))
$(eval $(call generate-docker-build-target,$(BACKEND_CLOUD_PROXY)))
$(eval $(call generate-docker-build-target,$(BACKEND_HUGGINGFACE)))
$(eval $(call generate-docker-build-target,$(BACKEND_SILERO_VAD)))
$(eval $(call generate-docker-build-target,$(BACKEND_STABLEDIFFUSION_GGML)))
$(eval $(call generate-docker-build-target,$(BACKEND_WHISPER)))
$(eval $(call generate-docker-build-target,$(BACKEND_CRISPASR)))
$(eval $(call generate-docker-build-target,$(BACKEND_PARAKEET_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_VOXTRAL)))
$(eval $(call generate-docker-build-target,$(BACKEND_OPUS)))
$(eval $(call generate-docker-build-target,$(BACKEND_RERANKERS)))
@@ -1169,6 +1279,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SGLANG)))
$(eval $(call generate-docker-build-target,$(BACKEND_DIFFUSERS)))
$(eval $(call generate-docker-build-target,$(BACKEND_CHATTERBOX)))
$(eval $(call generate-docker-build-target,$(BACKEND_VIBEVOICE)))
$(eval $(call generate-docker-build-target,$(BACKEND_LIQUID_AUDIO)))
$(eval $(call generate-docker-build-target,$(BACKEND_MOONSHINE)))
$(eval $(call generate-docker-build-target,$(BACKEND_POCKET_TTS)))
$(eval $(call generate-docker-build-target,$(BACKEND_QWEN_TTS)))
@@ -1191,13 +1302,14 @@ $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP_QUANTIZATION)))
$(eval $(call generate-docker-build-target,$(BACKEND_TINYGRAD)))
$(eval $(call generate-docker-build-target,$(BACKEND_KOKOROS)))
$(eval $(call generate-docker-build-target,$(BACKEND_SAM3_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_RFDETR_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
# Pattern rule for docker-save targets
docker-save-%: backend-images
docker save local-ai-backend:$* -o backend-images/$*.tar
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy
########################################################
### Mock Backend for E2E Tests
@@ -1209,6 +1321,12 @@ build-mock-backend: protogen-go
clean-mock-backend:
rm -f tests/e2e/mock-backend/mock-backend
build-cloud-proxy-backend: protogen-go
$(GOCMD) build -o tests/e2e/mock-backend/cloud-proxy ./backend/go/cloud-proxy
clean-cloud-proxy-backend:
rm -f tests/e2e/mock-backend/cloud-proxy
########################################################
### UI E2E Test Server
########################################################
@@ -1219,6 +1337,50 @@ build-ui-test-server: build-mock-backend react-ui protogen-go
test-ui-e2e: build-ui-test-server
cd core/http/react-ui && npm install && npx playwright install --with-deps chromium && npx playwright test
## Optional Playwright worker count for the UI e2e targets below. Pass
## UI_TEST_WORKERS=N (e.g. `make test-ui-coverage UI_TEST_WORKERS=20`) to
## override Playwright's default (cores/2). Empty by default so Playwright
## picks its own worker count.
UI_TEST_WORKERS ?=
PLAYWRIGHT_WORKERS_FLAG = $(if $(UI_TEST_WORKERS),--workers=$(UI_TEST_WORKERS),)
## Fast Playwright e2e run used by the pre-commit hook on React UI changes.
## Force-rebuilds the (non-instrumented) dist so the suite tests the working
## tree — not a stale dist the `react-ui` skip-guard would leave — re-embeds
## it into ui-test-server, and runs the specs. Uses the nix-provided browser
## when PLAYWRIGHT_CHROMIUM_PATH is set (flake dev shell), else falls back to
## downloading it as `test-ui-e2e` does.
test-ui: build-mock-backend protogen-go
cd core/http/react-ui && bun install && bun run build
$(GOCMD) build -o tests/e2e-ui/ui-test-server ./tests/e2e-ui
cd core/http/react-ui && sh $(CURDIR)/scripts/ensure-playwright-browser.sh && bunx playwright test $(PLAYWRIGHT_WORKERS_FLAG)
## React UI code coverage from the Playwright e2e suite. Builds a
## NON-instrumented bundle with source maps (COVERAGE_V8=true), re-embeds it
## into the ui-test-server (the dist is //go:embed'ed at compile time), runs the
## Playwright specs which collect native Chromium V8 coverage (PW_V8_COVERAGE=1)
## — far cheaper than istanbul's build-time counters (~40% faster end-to-end) —
## convert it to istanbul via v8-to-istanbul in the coverage fixture, and write
## an nyc report to core/http/react-ui/coverage/. Removes the dist afterwards so
## normal builds aren't served source-mapped assets. (The legacy istanbul path
## still exists: `bun run build:coverage` + unset PW_V8_COVERAGE.)
test-ui-coverage: build-mock-backend protogen-go
trap 'rm -rf "$(CURDIR)/core/http/react-ui/dist"' EXIT; \
( cd core/http/react-ui && bun install && bun run build:coverage-v8 ) && \
$(GOCMD) build -o tests/e2e-ui/ui-test-server ./tests/e2e-ui && \
( cd core/http/react-ui && rm -rf .nyc_output coverage && \
sh $(CURDIR)/scripts/ensure-playwright-browser.sh && \
PW_V8_COVERAGE=1 bunx playwright test $(PLAYWRIGHT_WORKERS_FLAG) && bun run coverage:report )
## UI coverage baseline (committed) and the strict gate that compares against
## it — the React mirror of test-coverage-baseline / test-coverage-check.
test-ui-coverage-baseline: test-ui-coverage
@node -e 'const fs=require("fs");process.stdout.write(String(JSON.parse(fs.readFileSync("core/http/react-ui/coverage/coverage-summary.json")).total.lines.pct))' > core/http/react-ui/coverage-baseline.txt
@echo "Saved UI coverage baseline: $$(cat core/http/react-ui/coverage-baseline.txt)% lines"
test-ui-coverage-check: test-ui-coverage
sh $(CURDIR)/scripts/ui-coverage-check.sh core/http/react-ui/coverage/coverage-summary.json core/http/react-ui/coverage-baseline.txt
test-ui-e2e-docker:
docker build -t localai-ui-e2e -f tests/e2e-ui/Dockerfile .
docker run --rm localai-ui-e2e

View File

@@ -31,12 +31,18 @@
**LocalAI** is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.
- **Drop-in API compatibility** — OpenAI, Anthropic, ElevenLabs APIs
- **36+ backends** — llama.cpp, vLLM, transformers, whisper, diffusers, MLX...
- **Any hardware** — NVIDIA, AMD, Intel, Apple Silicon, Vulkan, or CPU-only
- **Multi-user ready** — API key auth, user quotas, role-based access
- **Built-in AI agents** — autonomous agents with tool use, RAG, MCP, and skills
- **Privacy-first** — your data never leaves your infrastructure
**A small core, not a bundle.** Each backend wraps a best-in-class engine (llama.cpp, vLLM, whisper.cpp, stable-diffusion, MLX...) in its own image, pulled only when a model needs it. You install nothing you don't use.
- **Composable by design**: backends are separate and pulled on demand, so you install only what your model needs
- **Open and extensible**: load any model, or build your own backend in any language against an open interface
- **Drop-in API compatibility**: OpenAI, Anthropic, and ElevenLabs APIs across every backend
- **Any model, any modality**: LLMs, vision, voice, image, and video behind one API
- **Any hardware**: NVIDIA, AMD, Intel, Apple Silicon, Vulkan, or CPU-only
- **Multi-user ready**: API key auth, user quotas, role-based access
- **Built-in AI agents**: autonomous agents with tool use, RAG, MCP, and skills
- **Privacy-first**: your data never leaves your infrastructure
![A small LocalAI core with backends (llama.cpp, vLLM, MLX, whisper.cpp, stable-diffusion, kokoro, parakeet.cpp...) plugged in as separate on-demand images](docs/static/images/diagrams/composable-core.png)
Created by [Ettore Di Giacinto](https://github.com/mudler) and maintained by the [LocalAI team](#team).
@@ -149,8 +155,10 @@ For more details, see the [Getting Started guide](https://localai.io/basics/gett
## Latest News
- **April 2026**: [Voice recognition](https://github.com/mudler/LocalAI/pull/9500), [Face recognition, identification & liveness detection](https://github.com/mudler/LocalAI/pull/9480), [Ollama API compatibility](https://github.com/mudler/LocalAI/pull/9284), [Video generation in stable-diffusion.ggml](https://github.com/mudler/LocalAI/pull/9420), [Backend versioning with auto-upgrade](https://github.com/mudler/LocalAI/pull/9315), [Pin models & load-on-demand toggle](https://github.com/mudler/LocalAI/pull/9309), [Universal model importer](https://github.com/mudler/LocalAI/pull/9466), new backends: [sglang](https://github.com/mudler/LocalAI/pull/9359), [ik-llama-cpp](https://github.com/mudler/LocalAI/pull/9326), [TurboQuant](https://github.com/mudler/LocalAI/pull/9355), [sam.cpp](https://github.com/mudler/LocalAI/pull/9288), [Kokoros](https://github.com/mudler/LocalAI/pull/9212), [qwen3tts.cpp](https://github.com/mudler/LocalAI/pull/9316), [tinygrad multimodal](https://github.com/mudler/LocalAI/pull/9364)
- **March 2026**: [Agent management](https://github.com/mudler/LocalAI/pull/8820), [New React UI](https://github.com/mudler/LocalAI/pull/8772), [WebRTC](https://github.com/mudler/LocalAI/pull/8790), [MLX-distributed via P2P and RDMA](https://github.com/mudler/LocalAI/pull/8801), [MCP Apps, MCP Client-side](https://github.com/mudler/LocalAI/pull/8947)
- **May 2026**: **LocalAI 4.3.0** - `llama.cpp` [prompt cache on by default](https://github.com/mudler/LocalAI/pull/9925) (repeated system prompts collapse from minutes to seconds), [keyless cosign signing of backend OCI images](https://github.com/mudler/LocalAI/pull/9823), [per-API-key + per-user usage attribution](https://github.com/mudler/LocalAI/pull/9920), Distributed v3 with [per-request replica routing](https://github.com/mudler/LocalAI/pull/9968). [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.3.0)
- **May 2026**: **LocalAI 4.2.0** - LocalAI sees and hears: [voice recognition](https://github.com/mudler/LocalAI/pull/9500), [face recognition + antispoofing liveness](https://github.com/mudler/LocalAI/pull/9480), speaker diarization. Plus [drop-in Ollama API](https://github.com/mudler/LocalAI/pull/9284), [video generation](https://github.com/mudler/LocalAI/pull/9420), redesigned UI with i18n + admin-configurable branding, vLLM at feature parity with llama.cpp, and 11 new backends. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.2.0)
- **April 2026**: **LocalAI 4.1.0** - LocalAI becomes a control tower: distributed cluster mode with VRAM-aware smart routing + autoscaling, multi-user platform with OIDC and API keys, per-user quotas with predictive analytics, in-UI fine-tuning with TRL (auto-export to GGUF), on-the-fly quantization backend, visual pipeline editor. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.1.0)
- **March 2026**: **LocalAI 4.0.0** - native agentic orchestration with the new [Agenthub](https://agenthub.localai.io) community hub, full React UI rewrite with Canvas mode, [MCP Apps + client-side](https://github.com/mudler/LocalAI/pull/8947) with tool streaming, [WebRTC realtime audio](https://github.com/mudler/LocalAI/pull/8790), [MLX-distributed](https://github.com/mudler/LocalAI/pull/8801). [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.0.0)
- **February 2026**: [Realtime API for audio-to-audio with tool calling](https://github.com/mudler/LocalAI/pull/6245), [ACE-Step 1.5 support](https://github.com/mudler/LocalAI/pull/8396)
- **January 2026**: **LocalAI 3.10.0** — Anthropic API support, Open Responses API, video & image generation (LTX-2), unified GPU backends, tool streaming, Moonshine, Pocket-TTS. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v3.10.0)
- **December 2025**: [Dynamic Memory Resource reclaimer](https://github.com/mudler/LocalAI/pull/7583), [Automatic multi-GPU model fitting (llama.cpp)](https://github.com/mudler/LocalAI/pull/7584), [Vibevoice backend](https://github.com/mudler/LocalAI/pull/7494)
@@ -236,11 +244,22 @@ A huge thank you to our generous sponsors who support this project covering CI e
<a href="https://www.spectrocloud.com/" target="blank">
<img height="200" src="https://github.com/user-attachments/assets/72eab1dd-8b93-4fc0-9ade-84db49f24962">
</a>
</p>
<details>
<summary>
Past sponsors
</summary>
<p align="center">
<a href="https://www.premai.io/" target="blank">
<img height="200" src="https://github.com/mudler/LocalAI/assets/2420543/42e4ca83-661e-4f79-8e46-ae43689683d6"> <br>
</a>
</p>
</details>
### Individual sponsors
A special thanks to individual sponsors, a full list is on [GitHub](https://github.com/sponsors/mudler) and [buymeacoffee](https://buymeacoffee.com/mudler). Special shout out to [drikster80](https://github.com/drikster80) for being generous. Thank you everyone!

View File

@@ -117,6 +117,12 @@ ARG CUDA_DOCKER_ARCH
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
ARG CMAKE_ARGS
ENV CMAKE_ARGS=${CMAKE_ARGS}
# AMDGPU_TARGETS must be forwarded into the env here too — backend/cpp/llama-cpp/Makefile
# (which the turboquant Makefile reuses via a sibling build dir) errors out when the var
# is empty on a hipblas build, and the prebuilt path is what CI exercises most of the
# time. The builder-fromsource stage above already does this; mirror it here.
ARG AMDGPU_TARGETS
ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}
ARG TARGETARCH
ARG TARGETVARIANT

View File

@@ -37,6 +37,22 @@ service Backend {
rpc Rerank(RerankRequest) returns (RerankResult) {}
// TokenClassify runs a token-classification (NER) model on the
// supplied text and returns each detected entity span. Used by the
// PII redactor's optional NER tier — the regex tier still handles
// formatted hits cheaply, while this catches names, locations, and
// other unformatted PII that regex misses.
rpc TokenClassify(TokenClassifyRequest) returns (TokenClassifyResponse) {}
// Score evaluates the model's joint log-probability of each
// supplied candidate continuation given a shared prompt. The
// prompt's KV cache is computed once and reused across candidates.
// Used for routing-policy multi-label classification, reranking,
// calibrated confidence, and reward-model scoring — any task where
// the consumer wants the model's confidence in a pre-specified
// continuation rather than a generated one.
rpc Score(ScoreRequest) returns (ScoreResponse) {}
rpc GetMetrics(MetricsRequest) returns (MetricsResponse);
rpc VAD(VADRequest) returns (VADResponse) {}
@@ -48,6 +64,11 @@ service Backend {
rpc AudioTransform(AudioTransformRequest) returns (AudioTransformResult) {}
rpc AudioTransformStream(stream AudioTransformFrameRequest) returns (stream AudioTransformFrameResponse) {}
// AudioToAudioStream is the bidirectional any-to-any S2S RPC. Backends
// that load a speech-to-speech model consume input audio frames and emit
// interleaved audio + transcript + tool-call deltas as typed events.
// Backends without S2S support return UNIMPLEMENTED.
rpc AudioToAudioStream(stream AudioToAudioRequest) returns (stream AudioToAudioResponse) {}
rpc ModelMetadata(ModelOptions) returns (ModelMetadataResponse) {}
@@ -63,6 +84,23 @@ service Backend {
rpc QuantizationProgress(QuantizationProgressRequest) returns (stream QuantizationProgressUpdate) {}
rpc StopQuantization(QuantizationStopRequest) returns (Result) {}
// Forward proxies a raw HTTP request to an upstream provider. The
// cloud-proxy backend implements this for passthrough-mode model
// configs: the client wire format is preserved end-to-end (no
// translation through internal proto), which means new provider
// fields work the day they ship. Translation-mode proxies use the
// standard Predict/PredictStream RPCs instead. Backends that don't
// support this return UNIMPLEMENTED.
//
// The request is bidirectionally streamed so large bodies can flow
// without buffering. In practice the first ForwardRequest carries
// path, method, headers, and the initial body chunk; subsequent
// messages append body chunks. The first ForwardReply carries the
// upstream status and response headers; subsequent messages stream
// body chunks (SSE frames or chunked transfer). Cancellation of the
// gRPC context closes the upstream connection.
rpc Forward(stream ForwardRequest) returns (stream ForwardReply) {}
}
// Define the empty request
@@ -76,6 +114,76 @@ message MetricsResponse {
int32 prompt_tokens_processed = 5;
}
// TokenClassifyRequest carries the text to classify plus an optional
// score threshold. The transformers backend interprets threshold as
// the minimum confidence to include in the response; 0 = include all.
message TokenClassifyRequest {
string text = 1;
float threshold = 2;
}
// TokenClassifyEntity is one detected entity span. Byte offsets are
// into the original UTF-8 text — start..end is a half-open range that
// addresses the substring corresponding to entity_group.
//
// entity_group follows HuggingFace's aggregated-tag convention (e.g.
// "PER", "LOC", "ORG", or a PII-specific label like "EMAIL" /
// "SSN" depending on the model). The redactor's per-pattern action
// map keys off this string.
message TokenClassifyEntity {
string entity_group = 1;
int32 start = 2;
int32 end = 3;
float score = 4;
string text = 5;
}
message TokenClassifyResponse {
repeated TokenClassifyEntity entities = 1;
}
// ScoreRequest carries one shared prompt and one or more continuations
// to score against it. The backend tokenises the prompt once and reuses
// the resulting KV cache across all candidates in this request.
message ScoreRequest {
string prompt = 1;
repeated string candidates = 2;
// Return per-token logprobs for each candidate when true. Default
// false to keep the wire response small; the joint log_prob field
// covers the common ranking case.
bool include_token_logprobs = 3;
// When true, the response also populates length_normalized_log_prob
// (joint log-prob divided by candidate token count). Useful when
// candidates differ in length and the consumer wants a per-token
// measure comparable across them (PMI-style scoring).
bool length_normalize = 4;
}
// CandidateScore is one row in the ScoreResponse, matching by index
// the candidate in ScoreRequest.candidates.
message CandidateScore {
// Sum of log P(token_i | prompt, candidate_token_<i) across the
// candidate's tokens. The primary ranking signal.
double log_prob = 1;
// log_prob / num_tokens — populated when length_normalize=true on
// the request.
double length_normalized_log_prob = 2;
// Per-token detail — populated when include_token_logprobs=true.
repeated TokenLogProb tokens = 3;
// Number of tokens the backend tokenised this candidate into, after
// any backend-specific normalisation (e.g. leading-space handling).
int32 num_tokens = 4;
}
message TokenLogProb {
string token = 1;
double log_prob = 2;
}
message ScoreResponse {
repeated CandidateScore candidates = 1;
}
message RerankRequest {
string query = 1;
repeated string documents = 2;
@@ -320,6 +428,25 @@ message ModelOptions {
// applied verbatim to the backend's engine constructor (e.g. vLLM AsyncEngineArgs).
// Unknown keys produce an error at LoadModel time.
string EngineArgs = 73;
// Proxy carries the cloud-proxy backend's per-model configuration.
// Empty for non-proxy backends.
ProxyOptions Proxy = 74;
}
// ProxyOptions configures the cloud-proxy backend. UpstreamURL and
// Mode are always meaningful; Provider only matters in translate mode.
// The two api_key_* fields are mutually exclusive and resolved by the
// backend at LoadModel — core forwards the references rather than the
// plaintext key.
message ProxyOptions {
string upstream_url = 1;
string mode = 2;
string provider = 3;
string api_key_env = 4;
string api_key_file = 5;
string upstream_model = 6;
int32 request_timeout_seconds = 7;
}
message Result {
@@ -410,6 +537,15 @@ message TTSRequest {
string dst = 3;
string voice = 4;
optional string language = 5;
// instructions is a free-form, per-request style/voice description (maps to
// the OpenAI `instructions` field). Backends that support expressive synthesis
// (e.g. Qwen3-TTS CustomVoice/VoiceDesign) prefer this over the static YAML
// option when set; backends that don't simply ignore it.
optional string instructions = 6;
// params carries optional, backend-specific per-request generation parameters
// (e.g. Chatterbox exaggeration/cfg_weight/temperature). Values are strings and
// coerced by the backend; unset leaves the backend's configured defaults.
map<string, string> params = 7;
}
message VADRequest {
@@ -768,6 +904,93 @@ message AudioTransformFrameResponse {
int64 frame_index = 2;
}
// === AudioToAudioStream messages =========================================
//
// Bidirectional stream between the LocalAI core and an any-to-any audio
// model. The client opens the stream with a Config payload, then alternates
// Frame (input audio) and Control (turn boundaries, function-call results,
// session updates) payloads. The server streams back typed events: audio
// frames carry PCM in `pcm`; transcript / tool-call deltas carry JSON in
// `meta`; the stream ends with a `response.done` (success) or `error` event.
message AudioToAudioRequest {
oneof payload {
AudioToAudioConfig config = 1;
AudioToAudioFrame frame = 2;
AudioToAudioControl control = 3;
}
}
message AudioToAudioConfig {
// PCM format for client→server audio. 0 => backend default
// (16 kHz for the LFM2-Audio Conformer encoder).
int32 input_sample_rate = 1;
// Preferred server→client audio rate. 0 => backend default
// (24 kHz for the LFM2-Audio vocoder).
int32 output_sample_rate = 2;
// Optional system prompt override. Empty => backend chooses based on
// mode (e.g. "Respond with interleaved text and audio.").
string system_prompt = 3;
// Optional baked-voice id. Models that only ship a fixed set of
// voices (e.g. LFM2-Audio: us_male/us_female/uk_male/uk_female) match
// this against their voice table; an empty string keeps the default.
string voice = 4;
// JSON-encoded array of tool definitions in OpenAI Chat Completions
// format. Empty => no tools.
string tools = 5;
// Free-form sampling / decoding parameters (temperature, top_k,
// max_new_tokens, audio_top_k, etc).
map<string, string> params = 6;
// True => reset any session-scoped state before processing further
// frames on this stream. The first Config implicitly resets.
bool reset = 7;
}
message AudioToAudioFrame {
// Raw PCM s16le mono at config.input_sample_rate. Empty pcm + end_of_input
// is a valid "user finished speaking" marker without trailing audio.
bytes pcm = 1;
// Marks the last frame of a user turn. The backend may begin emitting
// a response immediately after seeing this.
bool end_of_input = 2;
}
message AudioToAudioControl {
// Free-form control event names. Initial set:
// "input_audio_buffer.commit" — user finished speaking
// "response.cancel" — abort in-flight generation
// "conversation.item.create" — inject a non-audio item (e.g.
// function_call_output as JSON in
// `payload`)
// "session.update" — re-configure mid-stream
string event = 1;
// Event-specific JSON payload.
bytes payload = 2;
}
message AudioToAudioResponse {
// Event identifies what this frame carries. Mirrors the OpenAI Realtime
// API server-event names where applicable. Initial set:
// "response.audio.delta"
// "response.audio_transcript.delta"
// "response.function_call_arguments.delta"
// "response.function_call_arguments.done"
// "response.done"
// "error"
string event = 1;
// Populated when event = response.audio.delta.
bytes pcm = 2;
// Populated alongside pcm to identify its rate. 0 => same as the
// session's negotiated output_sample_rate.
int32 sample_rate = 3;
// JSON payload for non-PCM events (transcript chunk, tool args, error
// body).
bytes meta = 4;
// Monotonic per-stream counter, useful for client reordering and
// debugging.
int64 sequence = 5;
}
message ModelMetadataResponse {
bool supports_thinking = 1;
string rendered_template = 2; // The rendered chat template with enable_thinking=true (empty if not applicable)
@@ -910,3 +1133,32 @@ message QuantizationStopRequest {
string job_id = 1;
}
// ForwardHeader is one HTTP header on the request or response. Headers
// like Authorization are typically injected by the backend (from the
// resolved API key) rather than passed through from the client.
message ForwardHeader {
string name = 1;
string value = 2;
}
// ForwardRequest is a streamed HTTP request to the upstream. First
// message carries path/method/headers; subsequent messages carry
// body_chunk only. All fields except body_chunk are honoured on the
// first message and ignored thereafter.
message ForwardRequest {
string path = 1; // e.g. "/v1/chat/completions" — appended to the model's upstream_url
string method = 2; // usually "POST"
repeated ForwardHeader headers = 3;
bytes body_chunk = 4;
}
// ForwardReply is a streamed HTTP response from the upstream. First
// message carries status/headers; subsequent messages carry body_chunk
// only. SSE responses arrive as a sequence of body_chunk frames; the
// caller is responsible for any parsing.
message ForwardReply {
int32 status = 1;
repeated ForwardHeader headers = 2;
bytes body_chunk = 3;
}

View File

@@ -2,6 +2,7 @@ ds4/
build/
package/
grpc-server
ds4-worker
*.o
backend.pb.cc
backend.pb.h

View File

@@ -60,6 +60,11 @@ elseif(DS4_GPU STREQUAL "cpu")
set(DS4_OBJS "${DS4_DIR}/ds4_cpu.o")
endif()
# ds4.c now references ds4_distributed.c (distributed inference was split into
# its own translation unit upstream). It is a single GPU-agnostic object shared
# by every GPU mode, so link it in regardless of DS4_GPU.
list(APPEND DS4_OBJS "${DS4_DIR}/ds4_distributed.o")
add_executable(${TARGET}
grpc-server.cpp
dsml_parser.cpp
@@ -99,3 +104,36 @@ if(DS4_NATIVE)
target_compile_options(${TARGET} PRIVATE -march=native)
endif()
endif()
# ds4-worker: standalone distributed worker. Links the same ds4 engine objects
# (including ds4_distributed.o) but has NO gRPC/protobuf dependency - it speaks
# ds4's own TCP transport via ds4_dist_run(). Buildable wherever the engine
# objects build, even on hosts without protobuf/grpc dev headers.
add_executable(ds4-worker worker_main.c)
target_include_directories(ds4-worker PRIVATE ${DS4_DIR})
foreach(obj ${DS4_OBJS})
target_sources(ds4-worker PRIVATE ${obj})
set_source_files_properties(${obj} PROPERTIES EXTERNAL_OBJECT TRUE GENERATED TRUE)
endforeach()
# worker_main.c is C, but the engine objects built by nvcc (ds4_cuda.o) and the
# Metal path (ds4_metal.o, Obj-C++) reference the C++ runtime (libstdc++). Force
# the C++ linker driver so those symbols resolve; the C driver would not link
# libstdc++ and the CUDA/Metal builds fail with undefined std:: references.
set_target_properties(ds4-worker PROPERTIES LINKER_LANGUAGE CXX)
target_link_libraries(ds4-worker PRIVATE Threads::Threads m)
if(DS4_GPU STREQUAL "cuda")
target_link_libraries(ds4-worker PRIVATE CUDA::cudart CUDA::cublas)
elseif(DS4_GPU STREQUAL "metal")
target_link_libraries(ds4-worker PRIVATE ${FOUNDATION_LIB} ${METAL_LIB})
elseif(DS4_GPU STREQUAL "cpu")
target_compile_definitions(ds4-worker PRIVATE DS4_NO_GPU)
endif()
if(DS4_NATIVE)
if(APPLE)
target_compile_options(ds4-worker PRIVATE -mcpu=native)
else()
target_compile_options(ds4-worker PRIVATE -march=native)
endif()
endif()

View File

@@ -1,10 +1,10 @@
# ds4 backend Makefile.
#
# Upstream pin lives below as DS4_VERSION?= so the bump-deps bot
# Upstream pin lives below as DS4_VERSION?=477c0e82e2699b35a65fd0a1ed6fe66b41087dfe
# (.github/bump_deps.sh) can find and update it - matches the
# llama-cpp / ik-llama-cpp / turboquant convention.
DS4_VERSION?=ae302c2fa18cc6d9aefc021d0f27ae03c9ad2fc0
DS4_VERSION?=477c0e82e2699b35a65fd0a1ed6fe66b41087dfe
DS4_REPO?=https://github.com/antirez/ds4
CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
@@ -18,16 +18,19 @@ UNAME_S := $(shell uname -s)
CMAKE_ARGS ?= -DCMAKE_BUILD_TYPE=Release
# ds4_distributed.o is a GPU-agnostic translation unit that ds4.c/ds4_cpu.o now
# reference (upstream split distributed inference into its own .c). The same
# object is shared by every GPU mode, so it is appended unconditionally below.
ifeq ($(BUILD_TYPE),cublas)
CMAKE_ARGS += -DDS4_GPU=cuda
DS4_OBJ_TARGET := ds4.o ds4_cuda.o
DS4_OBJ_TARGET := ds4.o ds4_cuda.o ds4_distributed.o
else ifeq ($(UNAME_S),Darwin)
CMAKE_ARGS += -DDS4_GPU=metal
DS4_OBJ_TARGET := ds4.o ds4_metal.o
DS4_OBJ_TARGET := ds4.o ds4_metal.o ds4_distributed.o
else
# CPU reference path (Linux only - macOS CPU path is broken by VM bug per ds4 README).
CMAKE_ARGS += -DDS4_GPU=cpu
DS4_OBJ_TARGET := ds4_cpu.o
DS4_OBJ_TARGET := ds4_cpu.o ds4_distributed.o
endif
ifneq ($(NATIVE),true)
@@ -52,17 +55,18 @@ ds4:
# the right per-platform compile flags (Objective-C/Metal on Darwin, nvcc on Linux+CUDA).
ds4/ds4.o: ds4
ifeq ($(BUILD_TYPE),cublas)
+$(MAKE) -C ds4 ds4.o ds4_cuda.o
+$(MAKE) -C ds4 ds4.o ds4_cuda.o ds4_distributed.o
else ifeq ($(UNAME_S),Darwin)
+$(MAKE) -C ds4 ds4.o ds4_metal.o
+$(MAKE) -C ds4 ds4.o ds4_metal.o ds4_distributed.o
else
+$(MAKE) -C ds4 ds4_cpu.o
+$(MAKE) -C ds4 ds4_cpu.o ds4_distributed.o
endif
grpc-server: ds4/ds4.o
mkdir -p $(BUILD_DIR)
cd $(BUILD_DIR) && cmake $(CMAKE_ARGS) $(CURRENT_MAKEFILE_DIR) && cmake --build . --config Release -j $(JOBS)
cp $(BUILD_DIR)/grpc-server grpc-server
cp $(BUILD_DIR)/ds4-worker ds4-worker
package: grpc-server
bash package.sh
@@ -71,7 +75,7 @@ test:
@echo "ds4 backend: e2e coverage at tests/e2e-backends/ (BACKEND_BINARY mode)"
clean:
rm -rf $(BUILD_DIR) grpc-server package
rm -rf $(BUILD_DIR) grpc-server ds4-worker package
if [ -d ds4 ]; then $(MAKE) -C ds4 clean; fi
purge: clean

View File

@@ -23,8 +23,11 @@ extern "C" {
#include <atomic>
#include <chrono>
#include <climits>
#include <csignal>
#include <cstdlib>
#include <cstring>
#include <ctime>
#include <iostream>
#include <memory>
#include <mutex>
@@ -51,6 +54,12 @@ ds4_session *g_session = nullptr;
int g_ctx_size = 32768;
std::string g_kv_cache_dir; // empty disables disk cache
// Distributed coordinator state. g_distributed is set true when LoadModel is
// given 'ds4_role:coordinator'; generation then waits for the worker route to
// form before running. Single-node behavior is unchanged when unset.
bool g_distributed = false;
int g_route_timeout_sec = 60;
std::atomic<Server *> g_server{nullptr};
// Parse a "key:value" option string. Returns empty when no colon.
@@ -60,6 +69,77 @@ static std::pair<std::string, std::string> split_option(const std::string &opt)
return {opt.substr(0, colon), opt.substr(colon + 1)};
}
// Parse a positive base-10 integer. Returns false (without throwing) on empty,
// trailing garbage, non-positive, or overflow - unlike std::stoi.
static bool parse_positive_int(const std::string &s, int *out) {
if (s.empty()) return false;
char *end = nullptr;
long v = std::strtol(s.c_str(), &end, 10);
if (!end || *end != '\0' || v <= 0 || v > INT_MAX) return false;
*out = static_cast<int>(v);
return true;
}
// Parse a ds4 layer spec "START:END" or "START:output" into the engine's
// distributed layer fields. Returns false on malformed input.
static bool parse_layers_spec(const std::string &spec, ds4_distributed_layers *out) {
auto colon = spec.find(':');
if (colon == std::string::npos) return false;
std::string lhs = spec.substr(0, colon);
std::string rhs = spec.substr(colon + 1);
if (lhs.empty() || rhs.empty()) return false;
char *end = nullptr;
long start = std::strtol(lhs.c_str(), &end, 10);
if (!end || *end != '\0' || start < 0) return false;
out->start = static_cast<uint32_t>(start);
out->has_output = false;
if (rhs == "output") {
out->has_output = true;
out->end = out->start; // engine treats has_output as "through final layer"
} else {
long e = std::strtol(rhs.c_str(), &end, 10);
if (!end || *end != '\0' || e < start) return false;
out->end = static_cast<uint32_t>(e);
}
out->set = true;
return true;
}
// When acting as a distributed coordinator, block until the worker route
// covers all layers (ds4_session_distributed_route_ready == 1) or the timeout
// elapses. Returns an empty string on success, or an error message to return
// to the client. No-op when not distributed.
//
// Takes the g_engine_mu lock by reference and RELEASES it during each poll
// sleep. The wait can span up to g_route_timeout_sec seconds while workers
// connect; holding g_engine_mu the whole time would block the Status/Health
// readiness probes (they also lock g_engine_mu), making LocalAI's loader treat
// a still-starting worker as hung.
static std::string wait_route_ready(std::unique_lock<std::mutex> &lock) {
if (!g_distributed) return "";
char err[256] = {0};
const int deadline_polls = g_route_timeout_sec * 10; // 100ms per poll
for (int i = 0; i <= deadline_polls; ++i) {
int ready = ds4_session_distributed_route_ready(g_session, err, sizeof(err));
if (ready == 1) return "";
if (ready < 0) {
return std::string("ds4 distributed route error: ") +
(err[0] ? err : "unknown");
}
// Release the lock while sleeping so Status/Health and other RPCs can
// interleave during worker startup.
lock.unlock();
struct timespec ts = {0, 100L * 1000L * 1000L}; // 100ms
nanosleep(&ts, nullptr);
lock.lock();
// A concurrent Free() may have torn down the engine while we slept.
if (!g_engine || !g_session) {
return "ds4: model unloaded while waiting for distributed route";
}
}
return "ds4 distributed route incomplete: workers not connected (layers uncovered)";
}
static void append_token_text(ds4_engine *engine, int token, std::string &out) {
size_t len = 0;
const char *text = ds4_token_text(engine, token, &len);
@@ -377,6 +457,11 @@ public:
backend::Result *result) override {
std::lock_guard<std::mutex> lock(g_engine_mu);
// Reset distributed state so a model swap (a second LoadModel without
// ds4_role) doesn't inherit a stale coordinator configuration.
g_distributed = false;
g_route_timeout_sec = 60;
if (g_engine) {
if (g_session) { ds4_session_free(g_session); g_session = nullptr; }
ds4_engine_close(g_engine);
@@ -394,12 +479,23 @@ public:
std::string mtp_path;
int mtp_draft = 0;
float mtp_margin = 3.0f;
std::string ds4_role, ds4_layers, ds4_listen;
for (const auto &opt : request->options()) {
auto [k, v] = split_option(opt);
if (k == "mtp_path") mtp_path = v;
else if (k == "mtp_draft") mtp_draft = std::stoi(v);
else if (k == "mtp_margin") mtp_margin = std::stof(v);
else if (k == "kv_cache_dir") g_kv_cache_dir = v;
else if (k == "ds4_role") ds4_role = v;
else if (k == "ds4_layers") ds4_layers = v;
else if (k == "ds4_listen") ds4_listen = v;
else if (k == "ds4_route_timeout") {
if (!parse_positive_int(v, &g_route_timeout_sec)) {
result->set_success(false);
result->set_message("ds4: ds4_route_timeout must be a positive integer");
return GStatus::OK;
}
}
}
g_kv_cache.SetDir(g_kv_cache_dir);
@@ -422,6 +518,49 @@ public:
opt.backend = DS4_BACKEND_CUDA;
#endif
// Coordinator wiring. 'ds4_role:coordinator' enables layer-split
// distributed inference: this process listens on ds4_listen and owns
// the ds4_layers slice; workers dial in (see `local-ai worker
// ds4-distributed`). Absent ds4_role => unchanged single-node path.
// Must be static: opt.distributed.listen_host is a const char* the
// engine retains past this call, so it cannot point at a local that
// goes out of scope (otherwise a future "simplify to local" refactor
// reintroduces a dangling pointer).
static std::string s_listen_host;
if (ds4_role == "coordinator") {
if (ds4_layers.empty() || ds4_listen.empty()) {
result->set_success(false);
result->set_message("ds4: ds4_role:coordinator requires ds4_layers and ds4_listen");
return GStatus::OK;
}
// host:port for IPv4/hostname; IPv6 literals are unsupported (the
// first colon would split inside the address).
auto host_port = split_option(ds4_listen); // "host:port" -> {host, port}
if (host_port.second.empty()) {
result->set_success(false);
result->set_message("ds4: ds4_listen must be host:port");
return GStatus::OK;
}
int listen_port = 0;
if (!parse_positive_int(host_port.second, &listen_port)) {
result->set_success(false);
result->set_message("ds4: ds4_listen port must be a positive integer");
return GStatus::OK;
}
ds4_distributed_layers layers = {};
if (!parse_layers_spec(ds4_layers, &layers)) {
result->set_success(false);
result->set_message("ds4: invalid ds4_layers (want START:END or START:output)");
return GStatus::OK;
}
s_listen_host = host_port.first;
opt.distributed.role = DS4_DISTRIBUTED_COORDINATOR;
opt.distributed.layers = layers;
opt.distributed.listen_host = s_listen_host.c_str();
opt.distributed.listen_port = listen_port;
g_distributed = true;
}
int rc = ds4_engine_open(&g_engine, &opt);
if (rc != 0 || !g_engine) {
result->set_success(false);
@@ -458,10 +597,13 @@ public:
GStatus Predict(ServerContext *, const backend::PredictOptions *request,
backend::Reply *reply) override {
std::lock_guard<std::mutex> lock(g_engine_mu);
std::unique_lock<std::mutex> lock(g_engine_mu);
if (!g_engine || !g_session) {
return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
}
if (std::string route_err = wait_route_ready(lock); !route_err.empty()) {
return GStatus(StatusCode::UNAVAILABLE, route_err);
}
ds4_tokens prompt = {};
build_prompt(g_engine, request, &prompt);
int n_predict = request->tokens() > 0 ? request->tokens() : 256;
@@ -554,10 +696,13 @@ public:
GStatus PredictStream(ServerContext *, const backend::PredictOptions *request,
ServerWriter<backend::Reply> *writer) override {
std::lock_guard<std::mutex> lock(g_engine_mu);
std::unique_lock<std::mutex> lock(g_engine_mu);
if (!g_engine || !g_session) {
return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
}
if (std::string route_err = wait_route_ready(lock); !route_err.empty()) {
return GStatus(StatusCode::UNAVAILABLE, route_err);
}
ds4_tokens prompt = {};
build_prompt(g_engine, request, &prompt);
int n_predict = request->tokens() > 0 ? request->tokens() : 256;

View File

@@ -5,7 +5,8 @@ REPO_ROOT="${CURDIR}/../../.."
mkdir -p "$CURDIR/package/lib"
cp -avf "$CURDIR/grpc-server" "$CURDIR/package/"
cp -rfv "$CURDIR/run.sh" "$CURDIR/package/"
cp -avf "$CURDIR/ds4-worker" "$CURDIR/package/"
cp -rfv "$CURDIR/run.sh" "$CURDIR/package/"
UNAME_S=$(uname -s)
if [ "$UNAME_S" = "Darwin" ]; then

View File

@@ -0,0 +1,126 @@
// ds4-worker: standalone distributed worker for the LocalAI ds4 backend.
//
// A ds4 distributed worker owns a slice of the model's transformer layers,
// dials the coordinator, and serves activations for its slice. It does NOT
// speak backend.proto - it speaks ds4's own TCP transport via ds4_dist_run().
// This binary is intentionally minimal (no HTTP/web/kvstore/linenoise): it
// only needs the engine objects + ds4_distributed.o, which the backend already
// builds. It is launched by `local-ai worker ds4-distributed`.
//
// Usage:
// ds4-worker --role worker --model <gguf> --layers 20:output \
// --coordinator <host> <port> [--cpu|--cuda|--metal] [-c CTX] [-t N]
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <signal.h>
#include <limits.h>
#include "ds4.h"
#include "ds4_distributed.h"
static const char *need_arg(int *i, int argc, char **argv, const char *flag) {
if (*i + 1 >= argc) {
fprintf(stderr, "ds4-worker: missing value for %s\n", flag);
exit(2);
}
return argv[++(*i)];
}
static int parse_int_arg(const char *s, const char *flag) {
char *end = NULL;
long v = strtol(s, &end, 10);
if (!s[0] || *end || v <= 0 || v > INT_MAX) {
fprintf(stderr, "ds4-worker: invalid value for %s: %s\n", flag, s);
exit(2);
}
return (int)v;
}
static ds4_backend default_backend(void) {
#if defined(DS4_NO_GPU)
return DS4_BACKEND_CPU;
#elif defined(__APPLE__)
return DS4_BACKEND_METAL;
#else
return DS4_BACKEND_CUDA;
#endif
}
int main(int argc, char **argv) {
signal(SIGPIPE, SIG_IGN);
ds4_engine_options opt = {0};
opt.backend = default_backend();
int ctx_size = 32768;
for (int i = 1; i < argc; i++) {
const char *arg = argv[i];
if (!strcmp(arg, "-h") || !strcmp(arg, "--help")) {
fprintf(stdout, "ds4-worker: standalone ds4 distributed worker\n");
ds4_dist_usage(stdout);
fprintf(stdout, " -m, --model PATH model GGUF (the worker loads only its --layers slice)\n");
fprintf(stdout, " -c, --ctx N context size (default 32768)\n");
fprintf(stdout, " -t, --threads N CPU threads\n");
fprintf(stdout, " --cpu|--cuda|--metal backend override\n");
return 0;
}
char dist_err[256] = {0};
ds4_dist_cli_parse_result dist_parse =
ds4_dist_parse_cli_arg(arg, &i, argc, argv, &opt.distributed,
dist_err, sizeof(dist_err));
if (dist_parse == DS4_DIST_CLI_ERROR) {
fprintf(stderr, "ds4-worker: %s\n",
dist_err[0] ? dist_err : "invalid distributed option");
return 2;
}
if (dist_parse == DS4_DIST_CLI_MATCHED) continue;
if (!strcmp(arg, "-m") || !strcmp(arg, "--model")) {
opt.model_path = need_arg(&i, argc, argv, arg);
} else if (!strcmp(arg, "-c") || !strcmp(arg, "--ctx")) {
ctx_size = parse_int_arg(need_arg(&i, argc, argv, arg), arg);
} else if (!strcmp(arg, "-t") || !strcmp(arg, "--threads")) {
opt.n_threads = parse_int_arg(need_arg(&i, argc, argv, arg), arg);
} else if (!strcmp(arg, "--cpu")) {
opt.backend = DS4_BACKEND_CPU;
} else if (!strcmp(arg, "--cuda")) {
opt.backend = DS4_BACKEND_CUDA;
} else if (!strcmp(arg, "--metal")) {
opt.backend = DS4_BACKEND_METAL;
} else {
fprintf(stderr, "ds4-worker: unknown option: %s\n", arg);
return 2;
}
}
if (opt.distributed.role != DS4_DISTRIBUTED_WORKER) {
fprintf(stderr, "ds4-worker: --role worker is required\n");
return 2;
}
if (!opt.model_path) {
fprintf(stderr, "ds4-worker: --model is required\n");
return 2;
}
char prep_err[256] = {0};
if (ds4_dist_prepare_engine_options(&opt.distributed, &opt,
prep_err, sizeof(prep_err)) != 0) {
fprintf(stderr, "ds4-worker: %s\n", prep_err);
return 2;
}
ds4_engine *engine = NULL;
if (ds4_engine_open(&engine, &opt) != 0 || !engine) {
fprintf(stderr, "ds4-worker: failed to open engine\n");
return 1;
}
ds4_dist_generation_options gen = {0};
gen.ctx_size = ctx_size;
int rc = ds4_dist_run(engine, &opt.distributed, &gen);
ds4_engine_close(engine);
return rc;
}

View File

@@ -1,5 +1,5 @@
IK_LLAMA_VERSION?=eb570eb96689c235933b813693ca28ab9d3d26de
IK_LLAMA_VERSION?=1520eda980564241434b791ce2bbbd128c4be9ea
LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp
CMAKE_ARGS?=

View File

@@ -1,5 +1,5 @@
LLAMA_VERSION?=389ff61d77b5c71cec0cf92fe4e5d01ace80b797
LLAMA_VERSION?=7c158fbb4aec1bdc9c81d6ca0e785139f4826fae
LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
CMAKE_ARGS?=

View File

@@ -32,10 +32,14 @@
#include <grpcpp/health_check_service_interface.h>
#include <grpcpp/security/server_credentials.h>
#include <regex>
#include <algorithm>
#include <atomic>
#include <cmath>
#include <cstdlib>
#include <fstream>
#include <iterator>
#include <list>
#include <map>
#include <mutex>
#include <signal.h>
#include <thread>
@@ -118,6 +122,40 @@ static std::string base64_encode_bytes(const unsigned char* data, size_t len) {
bool loaded_model; // TODO: add a mutex for this, but happens only once loading the model
// Score bypasses the slot loop (see the comment on Score below) so it
// must not run concurrently with any slot-loop RPC. These counters
// are a defence-in-depth tripwire — ModelConfig.Validate already
// rejects llama-cpp configs that mix score with chat/completion/
// embeddings, so a healthy deployment never trips them. seq_cst is
// load-bearing for the increment-then-check pattern below.
static std::atomic<int> slot_loop_inflight{0};
static std::atomic<int> score_inflight{0};
// Increment-then-check, not check-then-increment: two simultaneous
// racers both observe the other's increment and both abort cleanly.
// Reversed, both could see zero and proceed.
struct conflict_guard {
std::atomic<int>& self;
conflict_guard(const char* rpc, std::atomic<int>& self_, std::atomic<int>& other, const char* other_name)
: self(self_) {
self.fetch_add(1, std::memory_order_seq_cst);
int o = other.load(std::memory_order_seq_cst);
if (o > 0) {
fprintf(stderr,
"FATAL: %s called with %s=%d. The llama-cpp backend cannot "
"service Score and slot-loop RPCs concurrently — Score "
"bypasses the slot loop and races the llama_context. Bind "
"Score-using features to a model dedicated to scoring "
"(known_usecases: [score] with no chat/completion/embeddings).\n",
rpc, other_name, o);
std::abort();
}
}
~conflict_guard() {
self.fetch_sub(1, std::memory_order_seq_cst);
}
};
static std::function<void(int)> shutdown_handler;
static std::atomic_flag is_terminating = ATOMIC_FLAG_INIT;
@@ -443,10 +481,24 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
// Draft model for speculative decoding
if (!request->draftmodel().empty()) {
params.speculative.draft.mparams.path = request->draftmodel();
// Default to draft type if a draft model is set but no explicit type
// Default to draft type if a draft model is set but no explicit type.
// Upstream (post ggml-org/llama.cpp#22838) made the speculative type a
// vector; the turboquant fork still uses the legacy scalar. The
// LOCALAI_LEGACY_LLAMA_CPP_SPEC macro is injected by
// backend/cpp/turboquant/patch-grpc-server.sh for fork builds only.
// Upstream renamed COMMON_SPECULATIVE_TYPE_DRAFT -> ..._DRAFT_SIMPLE
// in ggml-org/llama.cpp#22964; the fork still uses the old name.
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
if (params.speculative.type == COMMON_SPECULATIVE_TYPE_NONE) {
params.speculative.type = COMMON_SPECULATIVE_TYPE_DRAFT;
}
#else
const bool no_spec_type = params.speculative.types.empty() ||
(params.speculative.types.size() == 1 && params.speculative.types[0] == COMMON_SPECULATIVE_TYPE_NONE);
if (no_spec_type) {
params.speculative.types = { COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE };
}
#endif
}
// params.model_alias ??
@@ -500,16 +552,33 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
params.warmup = true;
// no_op_offload: disable host tensor op offload (default: false)
params.no_op_offload = false;
// kv_unified: enable unified KV cache (default: false)
params.kv_unified = false;
// n_ctx_checkpoints: max context checkpoints per slot (default: 8)
params.n_ctx_checkpoints = 8;
// llama memory fit fails if we don't provide a buffer for tensor overrides
const size_t ntbo = llama_max_tensor_buft_overrides();
while (params.tensor_buft_overrides.size() < ntbo) {
params.tensor_buft_overrides.push_back({nullptr, nullptr});
}
// kv_unified: enable unified KV cache. Upstream's server auto-enables this
// when the slot count is auto (-np <0), bumping n_parallel to 4 alongside.
// LocalAI keeps n_parallel=1 by default, which would skip that auto path
// and leave kv_unified=false. We flip the default to true here so the
// server-side prompt cache (cache_idle_slots) is actually usable on the
// single-slot path that LocalAI ships with: without it, idle slots are
// never persisted across requests and the prompt cache is dead weight.
// Users can opt out with `options: [ "kv_unified:false" ]`.
params.kv_unified = true;
// n_ctx_checkpoints: max context checkpoints per slot. Match upstream's
// default (32); the previous LocalAI-specific 8 was unnecessarily tight
// and limits partial-prefix recovery without a clear memory rationale.
params.n_ctx_checkpoints = 32;
// cache_idle_slots: save and clear idle slot KV to the prompt cache on
// task switch. Upstream default is true; the server auto-disables it if
// kv_unified=false or cache_ram_mib=0, so flipping kv_unified above is
// what actually unlocks it.
params.cache_idle_slots = true;
// checkpoint_min_step: minimum spacing between context checkpoints in
// tokens (0 disables the minimum). Match upstream's default (256). This
// field was renamed from `checkpoint_every_nt` in llama.cpp; the semantics
// also shifted from a fixed cadence to a minimum spacing. The turboquant
// fork branched before the field existed, so skip it on the legacy path
// (LOCALAI_LEGACY_LLAMA_CPP_SPEC is injected by patch-grpc-server.sh).
#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC
params.checkpoint_min_step = 256;
#endif
// decode options. Options are in form optname:optvale, or if booleans only optname.
for (int i = 0; i < request->options_size(); i++) {
@@ -668,15 +737,215 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
try {
params.n_ctx_checkpoints = std::stoi(optval_str);
} catch (const std::exception& e) {
// If conversion fails, keep default value (8)
// If conversion fails, keep default value (32)
}
}
// --- server-side idle-slot prompt cache toggle (upstream --cache-idle-slots) ---
// Saves the slot's KV state into the host-side prompt cache on task
// switch so a later request with the same prefix can warm-load it.
// Auto-disabled by the server if kv_unified=false or cache_ram=0.
} else if (!strcmp(optname, "cache_idle_slots") || !strcmp(optname, "idle_slots_cache")) {
if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
params.cache_idle_slots = true;
} else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
params.cache_idle_slots = false;
}
#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC
// --- minimum context-checkpoint spacing (upstream -cms / --checkpoint-min-step) ---
// 0 disables the minimum-spacing gate. Old option names (`checkpoint_every_nt`,
// `checkpoint_every_n_tokens`) are kept as aliases for backward compatibility
// with existing user configs: upstream renamed the field and shifted its
// semantics from a fixed cadence to a minimum spacing.
//
// Gated out for the turboquant fork, which lacks common_params::
// checkpoint_min_step. The leading `}` closing the cache_idle_slots
// branch is removed with this block; the next `} else if` (n_ubatch)
// then closes cache_idle_slots, so braces stay balanced under both
// preprocessor branches.
} else if (!strcmp(optname, "checkpoint_min_step") || !strcmp(optname, "checkpoint_min_spacing") ||
!strcmp(optname, "checkpoint_every_nt") || !strcmp(optname, "checkpoint_every_n_tokens")) {
if (optval != NULL) {
try {
params.checkpoint_min_step = std::stoi(optval_str);
} catch (const std::exception& e) {
// If conversion fails, keep default value (256)
}
}
#endif
// --- physical batch size (upstream -ub / --ubatch-size) ---
// Note: line ~482 already aliases n_ubatch to n_batch as a default; this
// option lets users decouple the two (useful for embeddings/rerank).
} else if (!strcmp(optname, "n_ubatch") || !strcmp(optname, "ubatch")) {
if (optval != NULL) {
try { params.n_ubatch = std::stoi(optval_str); } catch (...) {}
}
// --- main-model batch threads (upstream -tb / --threads-batch) ---
} else if (!strcmp(optname, "threads_batch") || !strcmp(optname, "n_threads_batch")) {
if (optval != NULL) {
try {
int n = std::stoi(optval_str);
if (n <= 0) n = (int)std::thread::hardware_concurrency();
params.cpuparams_batch.n_threads = n;
} catch (...) {}
}
// --- pooling type for embeddings (upstream --pooling) ---
} else if (!strcmp(optname, "pooling_type") || !strcmp(optname, "pooling")) {
if (optval != NULL) {
if (optval_str == "none") params.pooling_type = LLAMA_POOLING_TYPE_NONE;
else if (optval_str == "mean") params.pooling_type = LLAMA_POOLING_TYPE_MEAN;
else if (optval_str == "cls") params.pooling_type = LLAMA_POOLING_TYPE_CLS;
else if (optval_str == "last") params.pooling_type = LLAMA_POOLING_TYPE_LAST;
else if (optval_str == "rank") params.pooling_type = LLAMA_POOLING_TYPE_RANK;
// unknown values silently leave UNSPECIFIED (auto-detect)
}
// --- llama log verbosity threshold (upstream -lv / --verbosity) ---
} else if (!strcmp(optname, "verbosity")) {
if (optval != NULL) {
try { params.verbosity = std::stoi(optval_str); } catch (...) {}
}
// --- O_DIRECT model loading (upstream --direct-io) ---
} else if (!strcmp(optname, "direct_io") || !strcmp(optname, "use_direct_io")) {
if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
params.use_direct_io = true;
} else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
params.use_direct_io = false;
}
// --- embedding normalization (upstream --embd-normalize) ---
// -1 none, 0 max-abs, 1 taxicab, 2 L2 (default), >2 p-norm
} else if (!strcmp(optname, "embd_normalize") || !strcmp(optname, "embedding_normalize")) {
if (optval != NULL) {
try { params.embd_normalize = std::stoi(optval_str); } catch (...) {}
}
// --- reasoning parser (upstream --reasoning-format) ---
// Picks the parser for <think> blocks emitted by reasoning models.
// none / auto / deepseek / deepseek-legacy
} else if (!strcmp(optname, "reasoning_format")) {
if (optval != NULL) {
if (optval_str == "none") params.reasoning_format = COMMON_REASONING_FORMAT_NONE;
else if (optval_str == "auto") params.reasoning_format = COMMON_REASONING_FORMAT_AUTO;
else if (optval_str == "deepseek") params.reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK;
else if (optval_str == "deepseek-legacy" || optval_str == "deepseek_legacy")
params.reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK_LEGACY;
// unknown values silently keep the upstream default (DEEPSEEK)
}
// --- reasoning budget (upstream --reasoning-budget) ---
// -1 unlimited, 0 disabled, >0 token budget for thinking blocks.
// Distinct from per-request `enable_thinking` (chat_template_kwargs).
} else if (!strcmp(optname, "enable_reasoning") || !strcmp(optname, "reasoning_budget")) {
if (optval != NULL) {
try { params.enable_reasoning = std::stoi(optval_str); } catch (...) {}
}
// --- prefill assistant turn (upstream --no-prefill-assistant) ---
} else if (!strcmp(optname, "prefill_assistant")) {
if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
params.prefill_assistant = true;
} else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
params.prefill_assistant = false;
}
// --- mmproj GPU offload (upstream --no-mmproj-offload, inverted) ---
} else if (!strcmp(optname, "mmproj_use_gpu") || !strcmp(optname, "mmproj_offload")) {
if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
params.mmproj_use_gpu = true;
} else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
params.mmproj_use_gpu = false;
}
// --- per-image vision token budget (upstream --image-min/max-tokens) ---
} else if (!strcmp(optname, "image_min_tokens")) {
if (optval != NULL) {
try { params.image_min_tokens = std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "image_max_tokens")) {
if (optval != NULL) {
try { params.image_max_tokens = std::stoi(optval_str); } catch (...) {}
}
// --- main-model tensor buffer overrides (upstream --override-tensor) ---
// Format: <tensor regex>=<buffer type>,<tensor regex>=<buffer type>,...
// Mirrors the existing `draft_override_tensor` parser below.
} else if (!strcmp(optname, "override_tensor") || !strcmp(optname, "tensor_buft_overrides")) {
ggml_backend_load_all();
std::map<std::string, ggml_backend_buffer_type_t> buft_list;
for (size_t i = 0; i < ggml_backend_dev_count(); ++i) {
auto * dev = ggml_backend_dev_get(i);
auto * buft = ggml_backend_dev_buffer_type(dev);
if (buft) {
buft_list[ggml_backend_buft_name(buft)] = buft;
}
}
static std::list<std::string> override_names;
std::string cur;
auto flush = [&](const std::string & spec) {
auto pos = spec.find('=');
if (pos == std::string::npos) return;
const std::string name = spec.substr(0, pos);
const std::string type = spec.substr(pos + 1);
auto it = buft_list.find(type);
if (it == buft_list.end()) return; // unknown buffer type: ignore
override_names.push_back(name);
params.tensor_buft_overrides.push_back(
{override_names.back().c_str(), it->second});
};
for (char c : optval_str) {
if (c == ',') { if (!cur.empty()) { flush(cur); cur.clear(); } }
else { cur.push_back(c); }
}
if (!cur.empty()) flush(cur);
// Speculative decoding options
} else if (!strcmp(optname, "spec_type") || !strcmp(optname, "speculative_type")) {
auto type = common_speculative_type_from_name(optval_str);
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
// Fork only knows a single scalar `type`. Take the first comma-
// separated value and assign it via the singular helper.
std::string first = optval_str;
const auto comma = first.find(',');
if (comma != std::string::npos) first = first.substr(0, comma);
auto type = common_speculative_type_from_name(first);
if (type != COMMON_SPECULATIVE_TYPE_COUNT) {
params.speculative.type = type;
}
#else
// Upstream switched to a vector of types (comma-separated for multi-type
// chaining via common_speculative_types_from_names). We keep accepting a
// single value here, but also tolerate comma-separated lists.
//
// ggml-org/llama.cpp#22964 also renamed the registered names from
// underscore- to dash-separated form, and replaced the bare
// `draft`/`eagle3` aliases with `draft-simple`/`draft-eagle3`. We
// normalize each token here so existing model configs keep working.
auto normalize_spec_name = [](std::string s) -> std::string {
std::replace(s.begin(), s.end(), '_', '-');
if (s == "draft") return "draft-simple";
if (s == "eagle3") return "draft-eagle3";
return s;
};
std::vector<std::string> names;
std::string item;
for (char c : optval_str) {
if (c == ',') {
if (!item.empty()) { names.push_back(normalize_spec_name(item)); item.clear(); }
} else {
item.push_back(c);
}
}
if (!item.empty()) names.push_back(normalize_spec_name(item));
auto parsed = common_speculative_types_from_names(names);
if (!parsed.empty()) {
params.speculative.types = parsed;
}
#endif
} else if (!strcmp(optname, "spec_n_max") || !strcmp(optname, "draft_max")) {
if (optval != NULL) {
try { params.speculative.draft.n_max = std::stoi(optval_str); } catch (...) {}
@@ -710,10 +979,155 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
try { params.speculative.draft.n_gpu_layers = std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "draft_ctx_size")) {
if (optval != NULL) {
try { params.speculative.draft.n_ctx = std::stoi(optval_str); } catch (...) {}
}
// The draft context size is no longer a separate field upstream: the draft
// shares the target context size. Accept the option for backward
// compatibility but silently ignore it.
// Everything below relies on struct shape introduced in ggml-org/llama.cpp#22838
// (parallel drafting): `ngram_mod`, `ngram_map_k`, `ngram_map_k4v`,
// `ngram_cache`, and the `draft.{cache_type_*, cpuparams*, tensor_buft_overrides}`
// fields. The turboquant fork branched before that, so its build defines
// LOCALAI_LEGACY_LLAMA_CPP_SPEC via patch-grpc-server.sh and these option
// keys become unrecognized (silently dropped, like any unknown opt) for it.
//
// The `#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC` / `#else` split below sits at the
// closing-brace position of the `draft_ctx_size` branch on purpose: in the
// legacy build the chain ends here (the brace closes draft_ctx_size), and in
// the modern build the chain continues with `} else if (...)` instead, so the
// brace count stays balanced under both branches of the preprocessor.
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
}
#else
// --- ngram_mod family (upstream --spec-ngram-mod-*) ---
} else if (!strcmp(optname, "spec_ngram_mod_n_min")) {
if (optval != NULL) {
try { params.speculative.ngram_mod.n_min = std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_mod_n_max")) {
if (optval != NULL) {
try { params.speculative.ngram_mod.n_max = std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_mod_n_match")) {
if (optval != NULL) {
try { params.speculative.ngram_mod.n_match = std::stoi(optval_str); } catch (...) {}
}
// --- ngram_map_k family (upstream --spec-ngram-map-k-*) ---
} else if (!strcmp(optname, "spec_ngram_map_k_size_n")) {
if (optval != NULL) {
try { params.speculative.ngram_map_k.size_n = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_map_k_size_m")) {
if (optval != NULL) {
try { params.speculative.ngram_map_k.size_m = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_map_k_min_hits")) {
if (optval != NULL) {
try { params.speculative.ngram_map_k.min_hits = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
// --- ngram_map_k4v family (upstream --spec-ngram-map-k4v-*) ---
} else if (!strcmp(optname, "spec_ngram_map_k4v_size_n")) {
if (optval != NULL) {
try { params.speculative.ngram_map_k4v.size_n = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_map_k4v_size_m")) {
if (optval != NULL) {
try { params.speculative.ngram_map_k4v.size_m = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_map_k4v_min_hits")) {
if (optval != NULL) {
try { params.speculative.ngram_map_k4v.min_hits = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
// --- ngram lookup caches (upstream --lookup-cache-static / -dynamic) ---
} else if (!strcmp(optname, "spec_lookup_cache_static") || !strcmp(optname, "lookup_cache_static")) {
params.speculative.ngram_cache.lookup_cache_static = optval_str;
} else if (!strcmp(optname, "spec_lookup_cache_dynamic") || !strcmp(optname, "lookup_cache_dynamic")) {
params.speculative.ngram_cache.lookup_cache_dynamic = optval_str;
// --- draft model KV cache types (upstream --spec-draft-type-k / -v) ---
} else if (!strcmp(optname, "draft_cache_type_k") || !strcmp(optname, "spec_draft_cache_type_k")) {
params.speculative.draft.cache_type_k = kv_cache_type_from_str(optval_str);
} else if (!strcmp(optname, "draft_cache_type_v") || !strcmp(optname, "spec_draft_cache_type_v")) {
params.speculative.draft.cache_type_v = kv_cache_type_from_str(optval_str);
// --- draft model thread counts (upstream --spec-draft-threads / -batch) ---
} else if (!strcmp(optname, "draft_threads") || !strcmp(optname, "spec_draft_threads")) {
if (optval != NULL) {
try {
int n = std::stoi(optval_str);
if (n <= 0) n = (int)std::thread::hardware_concurrency();
params.speculative.draft.cpuparams.n_threads = n;
} catch (...) {}
}
} else if (!strcmp(optname, "draft_threads_batch") || !strcmp(optname, "spec_draft_threads_batch")) {
if (optval != NULL) {
try {
int n = std::stoi(optval_str);
if (n <= 0) n = (int)std::thread::hardware_concurrency();
params.speculative.draft.cpuparams_batch.n_threads = n;
} catch (...) {}
}
// --- draft model MoE on CPU (upstream --spec-draft-cpu-moe / --spec-draft-n-cpu-moe) ---
} else if (!strcmp(optname, "draft_cpu_moe") || !strcmp(optname, "spec_draft_cpu_moe")) {
// Bool-style flag: optval may be missing, "true"/"1"/"yes" enables.
const bool enable = (optval == NULL) ||
optval_str == "true" || optval_str == "1" || optval_str == "yes" ||
optval_str == "on" || optval_str == "enabled";
if (enable) {
params.speculative.draft.tensor_buft_overrides.push_back(llm_ffn_exps_cpu_override());
}
} else if (!strcmp(optname, "draft_n_cpu_moe") || !strcmp(optname, "spec_draft_n_cpu_moe")) {
if (optval != NULL) {
try {
int n = std::stoi(optval_str);
if (n < 0) n = 0;
// Keep override-name storage alive for the lifetime of the params struct
// (mirrors upstream arg.cpp behavior with a function-local static).
static std::list<std::string> buft_overrides_draft;
for (int i = 0; i < n; ++i) {
buft_overrides_draft.push_back(llm_ffn_exps_block_regex(i));
params.speculative.draft.tensor_buft_overrides.push_back(
{buft_overrides_draft.back().c_str(), ggml_backend_cpu_buffer_type()});
}
} catch (...) {}
}
// --- draft model tensor buffer overrides (upstream --spec-draft-override-tensor) ---
} else if (!strcmp(optname, "draft_override_tensor") || !strcmp(optname, "spec_draft_override_tensor")) {
// Format: <tensor regex>=<buffer type>,<tensor regex>=<buffer type>,...
// We replicate upstream's parse_tensor_buffer_overrides (static in arg.cpp).
ggml_backend_load_all();
std::map<std::string, ggml_backend_buffer_type_t> buft_list;
for (size_t i = 0; i < ggml_backend_dev_count(); ++i) {
auto * dev = ggml_backend_dev_get(i);
auto * buft = ggml_backend_dev_buffer_type(dev);
if (buft) {
buft_list[ggml_backend_buft_name(buft)] = buft;
}
}
static std::list<std::string> draft_override_names;
std::string cur;
auto flush = [&](const std::string & spec) {
auto pos = spec.find('=');
if (pos == std::string::npos) return;
const std::string name = spec.substr(0, pos);
const std::string type = spec.substr(pos + 1);
auto it = buft_list.find(type);
if (it == buft_list.end()) return; // unknown buffer type: ignore
draft_override_names.push_back(name);
params.speculative.draft.tensor_buft_overrides.push_back(
{draft_override_names.back().c_str(), it->second});
};
for (char c : optval_str) {
if (c == ',') { if (!cur.empty()) { flush(cur); cur.clear(); } }
else { cur.push_back(c); }
}
if (!cur.empty()) flush(cur);
}
#endif // LOCALAI_LEGACY_LLAMA_CPP_SPEC — closes the `else`/`#ifdef` opened at draft_ctx_size
}
// Set params.n_parallel from environment variable if not set via options (fallback)
@@ -753,6 +1167,26 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
params.kv_overrides.back().key[0] = 0;
}
// tensor_buft_overrides sentinel termination (mirrors upstream common/arg.cpp).
// Real entries are pushed during option parsing; here we pad/terminate so the
// model loader sees back().pattern == nullptr (GGML_ASSERT at common.cpp:1543)
// and so llama_params_fit has the placeholder slots it requires.
{
const size_t ntbo = llama_max_tensor_buft_overrides();
while (params.tensor_buft_overrides.size() < ntbo) {
params.tensor_buft_overrides.push_back({nullptr, nullptr});
}
}
// The draft tensor_buft_overrides are only populated under the modern
// (post-#22838) layout, whose population code is itself gated by
// LOCALAI_LEGACY_LLAMA_CPP_SPEC above. The turboquant fork lacks
// common_params_speculative::draft entirely, so skip the sentinel there too.
#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC
if (!params.speculative.draft.tensor_buft_overrides.empty()) {
params.speculative.draft.tensor_buft_overrides.push_back({nullptr, nullptr});
}
#endif
// TODO: Add yarn
if (!request->tensorsplit().empty()) {
@@ -1071,6 +1505,7 @@ public:
if (params_base.model.path.empty()) {
return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "Model not loaded");
}
conflict_guard guard("PredictStream", slot_loop_inflight, score_inflight, "score_inflight");
json data = parse_options(true, request, params_base, ctx_server.get_llama_context());
@@ -1509,6 +1944,17 @@ public:
body_json["chat_template_kwargs"]["enable_thinking"] = (et_it->second == "true");
}
// Pass reasoning_effort via chat_template_kwargs too: the lever
// jinja templates like gpt-oss (Harmony) / LFM2.5 read, distinct
// from enable_thinking which those templates ignore.
auto re_it = metadata.find("reasoning_effort");
if (re_it != metadata.end() && !re_it->second.empty()) {
if (!body_json.contains("chat_template_kwargs")) {
body_json["chat_template_kwargs"] = json::object();
}
body_json["chat_template_kwargs"]["reasoning_effort"] = re_it->second;
}
// Debug: Print full body_json before template processing (includes messages, tools, tool_choice, etc.)
SRV_DBG("[CONVERSATION DEBUG] PredictStream: Full body_json before oaicompat_chat_params_parse:\n%s\n", body_json.dump(2).c_str());
@@ -1769,7 +2215,15 @@ public:
// content element — attaching to both would duplicate the first
// token since oaicompat_msg_diffs is the same for both.
json first_res_json = first_result->to_json();
if (first_res_json.is_array()) {
// Upstream llama.cpp (ggml-org/llama.cpp#23884) now emits an initial
// "begin" partial whose to_json() returns null, used only to signal the
// HTTP layer to flush 200 status headers before any token. gRPC has no
// such concept, so there is nothing to emit — the real tokens arrive in
// the loop below. Feeding this null into build_reply_from_json would
// throw (uncaught) and surface as a generic RPC error.
if (first_res_json.is_null()) {
// skip the begin-of-stream marker
} else if (first_res_json.is_array()) {
for (const auto & res : first_res_json) {
auto reply = build_reply_from_json(res, first_result.get());
// Skip chat deltas for role-init elements (have "role" in
@@ -1799,7 +2253,10 @@ public:
}
json res_json = result->to_json();
if (res_json.is_array()) {
if (res_json.is_null()) {
// begin-of-stream marker (see note above) — nothing to emit
continue;
} else if (res_json.is_array()) {
for (const auto & res : res_json) {
auto reply = build_reply_from_json(res, result.get());
bool is_role_init = res.contains("choices") && !res["choices"].empty() &&
@@ -1830,6 +2287,7 @@ public:
if (params_base.model.path.empty()) {
return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "Model not loaded");
}
conflict_guard guard("Predict", slot_loop_inflight, score_inflight, "score_inflight");
json data = parse_options(true, request, params_base, ctx_server.get_llama_context());
data["stream"] = false;
@@ -2290,6 +2748,17 @@ public:
body_json["chat_template_kwargs"]["enable_thinking"] = (predict_et_it->second == "true");
}
// Pass reasoning_effort via chat_template_kwargs too: the lever
// jinja templates like gpt-oss (Harmony) / LFM2.5 read, distinct
// from enable_thinking which those templates ignore.
auto predict_re_it = predict_metadata.find("reasoning_effort");
if (predict_re_it != predict_metadata.end() && !predict_re_it->second.empty()) {
if (!body_json.contains("chat_template_kwargs")) {
body_json["chat_template_kwargs"] = json::object();
}
body_json["chat_template_kwargs"]["reasoning_effort"] = predict_re_it->second;
}
// Debug: Print full body_json before template processing (includes messages, tools, tool_choice, etc.)
SRV_DBG("[CONVERSATION DEBUG] Predict: Full body_json before oaicompat_chat_params_parse:\n%s\n", body_json.dump(2).c_str());
@@ -2588,6 +3057,7 @@ public:
if (params_base.model.path.empty()) {
return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "Model not loaded");
}
conflict_guard guard("Embedding", slot_loop_inflight, score_inflight, "score_inflight");
json body = parse_options(false, request, params_base, ctx_server.get_llama_context());
body["stream"] = false;
@@ -2610,7 +3080,9 @@ public:
}
}
int embd_normalize = 2; // default to Euclidean/L2 norm
// Honor the load-time embd_normalize set via options:embd_normalize.
// -1 none, 0 max-abs, 1 taxicab, 2 L2 (default), >2 p-norm.
int embd_normalize = params_base.embd_normalize;
// create and queue the task
auto rd = ctx_server.get_response_reader();
{
@@ -2693,6 +3165,8 @@ public:
return grpc::Status(grpc::StatusCode::INVALID_ARGUMENT, "\"documents\" must be a non-empty string array");
}
conflict_guard guard("Rerank", slot_loop_inflight, score_inflight, "score_inflight");
// Create and queue the task
auto rd = ctx_server.get_response_reader();
{
@@ -2704,7 +3178,7 @@ public:
tasks.reserve(documents.size());
for (size_t i = 0; i < documents.size(); i++) {
auto tmp = format_prompt_rerank(ctx_server.impl->model, ctx_server.impl->vocab, ctx_server.impl->mctx, request->query(), documents[i]);
auto tmp = format_prompt_rerank(ctx_server.impl->model_tgt, ctx_server.impl->vocab, ctx_server.impl->mctx, request->query(), documents[i]);
server_task task = server_task(SERVER_TASK_TYPE_RERANK);
task.id = rd.queue_tasks.get_new_id();
task.index = i;
@@ -2765,12 +3239,218 @@ public:
return grpc::Status::OK;
}
// Score returns the model's joint log-probability of each candidate
// continuation given a shared prompt.
//
// WHY bypass the slot/task queue: upstream server_context exposes
// get_llama_context as "main thread only" and the slot loop's
// update_slots() owns the context whenever a task is in flight.
// No public synchronization primitive is available — so Score is
// unsafe to call concurrently with active generation through this
// backend. In practice routing-classifier calls happen before the
// request is routed to a generation backend, so the model used
// for Score is typically idle. Concurrent Score calls are
// serialised by a local mutex; KV-cache state is isolated behind
// a dedicated sequence ID cleared between candidates.
//
// A patch to server-context.cpp that adds SERVER_TASK_TYPE_SCORE
// and routes scoring through the slot loop would be the correct
// long-term fix; tracked as a follow-up.
//
// Perf TODO (measured: ~450 ms warm for 3 candidates on Arch-
// Router-1.5B Q4_K_M + Intel SYCL): the current loop re-decodes
// `prompt + candidate` from scratch for every candidate, throwing
// away the prompt's KV cache between iterations. A smarter
// version would:
// 1. Decode just the prompt once into score_seq_id.
// 2. Snapshot/cp that sequence (llama_memory_seq_cp) into a
// per-candidate sequence id.
// 3. For each candidate, decode only its tokens onto the copy
// (continuing from the saved prompt state), read logits.
// 4. llama_memory_seq_rm the copy.
// Estimated speedup: 3-candidate calls 450 ms -> ~150-200 ms,
// 6-candidate calls 630 ms -> ~220 ms. Single source-file change,
// no proto / Go-side changes needed. Worth doing once routing is
// wired into the middleware and Score is on the hot path of every
// chat request.
grpc::Status Score(ServerContext* context, const backend::ScoreRequest* request, backend::ScoreResponse* response) override {
auto auth = checkAuth(context);
if (!auth.ok()) return auth;
if (params_base.model.path.empty()) {
return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "Model not loaded");
}
if (request->candidates_size() == 0) {
return grpc::Status(grpc::StatusCode::INVALID_ARGUMENT, "candidates must be non-empty");
}
// Tripwire against the slot loop. Acquired before score_mutex
// so it fires even when this Score is queued behind another.
conflict_guard guard("Score", score_inflight, slot_loop_inflight, "slot_loop_inflight");
// Serialise concurrent Score calls. The slot loop is still
// free to race with us — see the class comment above.
static std::mutex score_mutex;
std::lock_guard<std::mutex> score_lock(score_mutex);
llama_context * lctx = ctx_server.get_llama_context();
if (lctx == nullptr) {
return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "llama context unavailable (sleeping?)");
}
const llama_vocab * vocab = ctx_server.impl->vocab;
const int32_t n_vocab = llama_vocab_n_tokens(vocab);
const int32_t n_ctx = llama_n_ctx(lctx);
llama_memory_t mem = llama_get_memory(lctx);
// The KV-cache is sized to seq_to_stream.size() at load
// (typically equal to n_slots, often 1). Sequence IDs must
// be in [0, n_seq_max), so we can't pick a high-value
// "private" ID — we have to share with the slot. We clear
// the cache before AND after each candidate to keep
// scoring isolated from whatever state the slot held, and
// the static mutex above guarantees no other Score call is
// racing in the meantime. The slot loop is still free to
// race (see comment on this method) — Score must not run
// concurrently with generation through this backend.
const llama_seq_id score_seq_id = 0;
llama_memory_seq_rm(mem, score_seq_id, -1, -1);
// Tokenize the shared prompt once with add_special=true so
// BOS is prepended when the model requires it. parse_special
// keeps chat-template markers in the prompt intact.
const std::string prompt = request->prompt();
std::vector<llama_token> prompt_tokens = common_tokenize(vocab, prompt, /*add_special=*/true, /*parse_special=*/true);
const int32_t prompt_len = (int32_t) prompt_tokens.size();
for (int ci = 0; ci < request->candidates_size(); ci++) {
const std::string & candidate_text = request->candidates(ci);
// Re-tokenize prompt + candidate as a single string. BPE
// merges across the boundary can shift the tokenization
// versus tokenize(prompt) ++ tokenize(candidate), so we
// find the divergence point against prompt_tokens.
std::vector<llama_token> full_tokens = common_tokenize(vocab, prompt + candidate_text, /*add_special=*/true, /*parse_special=*/true);
int32_t divergence = prompt_len;
const int32_t min_len = std::min<int32_t>(prompt_len, (int32_t) full_tokens.size());
for (int32_t i = 0; i < min_len; i++) {
if (prompt_tokens[i] != full_tokens[i]) {
divergence = i;
break;
}
}
const int32_t cand_len = (int32_t) full_tokens.size() - divergence;
backend::CandidateScore * cs = response->add_candidates();
cs->set_num_tokens(cand_len);
if (cand_len <= 0) {
cs->set_log_prob(0.0);
if (request->length_normalize()) {
cs->set_length_normalized_log_prob(0.0);
}
continue;
}
if (divergence < 1) {
// Need at least one prior token (typically BOS) to
// predict the first candidate token's logit. Tokeniser
// models without BOS + an empty prompt fall in here.
return grpc::Status(grpc::StatusCode::INVALID_ARGUMENT,
"Score: prompt produced no leading tokens; need at least one (e.g. BOS) to predict candidate");
}
if ((int32_t) full_tokens.size() > n_ctx) {
return grpc::Status(grpc::StatusCode::OUT_OF_RANGE,
"Score: prompt+candidate exceeds context size (got " +
std::to_string(full_tokens.size()) + ", n_ctx=" + std::to_string(n_ctx) + ")");
}
// Build a batch covering the entire prompt+candidate. We
// need logits at (divergence-1) onward — those are the
// predictions for each candidate token.
llama_batch batch = llama_batch_init((int32_t) full_tokens.size(), 0, 1);
for (int32_t i = 0; i < (int32_t) full_tokens.size(); i++) {
batch.token[i] = full_tokens[i];
batch.pos[i] = i;
batch.n_seq_id[i] = 1;
batch.seq_id[i][0] = score_seq_id;
// logits[i] is "do we want the prediction *for the
// next token*, computed from this position?"
// We want predictions for candidate tokens at
// positions divergence .. full_tokens.size()-1, which
// come from logits at positions (divergence-1) ..
// (full_tokens.size()-2).
bool need_logit = (i >= divergence - 1) && (i < (int32_t) full_tokens.size() - 1);
batch.logits[i] = need_logit ? 1 : 0;
}
batch.n_tokens = (int32_t) full_tokens.size();
// Decode the batch. If decode fails (e.g. KV slot
// exhaustion), surface as INTERNAL — the caller will
// typically fall back to a sampling-based classifier.
int decode_err = llama_decode(lctx, batch);
if (decode_err != 0) {
llama_batch_free(batch);
llama_memory_seq_rm(mem, score_seq_id, -1, -1);
return grpc::Status(grpc::StatusCode::INTERNAL,
"llama_decode failed during Score: " + std::to_string(decode_err));
}
// Sum log-probabilities of the actual candidate tokens.
double total_log_prob = 0.0;
for (int32_t k = 0; k < cand_len; k++) {
// The k-th candidate token sits at full_tokens index
// (divergence + k). Its predicting logit is at batch
// position (divergence + k - 1).
int32_t logit_pos = divergence + k - 1;
const float * logits = llama_get_logits_ith(lctx, logit_pos);
if (logits == nullptr) {
llama_batch_free(batch);
llama_memory_seq_rm(mem, score_seq_id, -1, -1);
return grpc::Status(grpc::StatusCode::INTERNAL,
"llama_get_logits_ith returned null at position " + std::to_string(logit_pos));
}
llama_token target_token = full_tokens[divergence + k];
// Compute log_softmax(logits)[target_token] with the
// max-subtraction stability trick.
float max_logit = logits[0];
for (int32_t v = 1; v < n_vocab; v++) {
if (logits[v] > max_logit) max_logit = logits[v];
}
double sum_exp = 0.0;
for (int32_t v = 0; v < n_vocab; v++) {
sum_exp += std::exp((double)(logits[v] - max_logit));
}
double token_log_prob = (double)(logits[target_token] - max_logit) - std::log(sum_exp);
total_log_prob += token_log_prob;
if (request->include_token_logprobs()) {
backend::TokenLogProb * tlp = cs->add_tokens();
std::string piece = common_token_to_piece(lctx, target_token);
tlp->set_token(piece);
tlp->set_log_prob(token_log_prob);
}
}
cs->set_log_prob(total_log_prob);
if (request->length_normalize() && cand_len > 0) {
cs->set_length_normalized_log_prob(total_log_prob / (double) cand_len);
}
llama_batch_free(batch);
// Drop this candidate's KV-cache contribution so the next
// candidate starts from a clean state. Without this, the
// next decode would conflict at positions 0..N-1 for our
// sequence ID.
llama_memory_seq_rm(mem, score_seq_id, -1, -1);
}
return grpc::Status::OK;
}
grpc::Status TokenizeString(ServerContext* context, const backend::PredictOptions* request, backend::TokenizationResponse* response) override {
auto auth = checkAuth(context);
if (!auth.ok()) return auth;
if (params_base.model.path.empty()) {
return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "Model not loaded");
}
conflict_guard guard("TokenizeString", slot_loop_inflight, score_inflight, "score_inflight");
json body = parse_options(false, request, params_base, ctx_server.get_llama_context());
body["stream"] = false;
@@ -2792,6 +3472,8 @@ public:
grpc::Status GetMetrics(ServerContext* /*context*/, const backend::MetricsRequest* /*request*/, backend::MetricsResponse* response) override {
conflict_guard guard("GetMetrics", slot_loop_inflight, score_inflight, "score_inflight");
// request slots data using task queue
auto rd = ctx_server.get_response_reader();
int task_id = rd.queue_tasks.get_new_id();
@@ -2882,7 +3564,7 @@ public:
// Get template source and reconstruct a common_chat_template for analysis
std::string tmpl_src = common_chat_templates_source(ctx_server.impl->chat_params.tmpls.get());
if (!tmpl_src.empty()) {
const auto * vocab = llama_model_get_vocab(ctx_server.impl->model);
const auto * vocab = llama_model_get_vocab(ctx_server.impl->model_tgt);
std::string token_bos, token_eos;
if (vocab) {
auto bos_id = llama_vocab_bos(vocab);

View File

@@ -1,7 +1,7 @@
# Pinned to the HEAD of feature/turboquant-kv-cache on https://github.com/TheTom/llama-cpp-turboquant.
# Auto-bumped nightly by .github/workflows/bump_deps.yaml.
TURBOQUANT_VERSION?=69d8e4be47243e83b3d0d71e932bc7aa61c644dc
TURBOQUANT_VERSION?=5aeb2fdbe26cd4c534c6fa15de73cb5749bd0403
LLAMA_REPO?=https://github.com/TheTom/llama-cpp-turboquant
CMAKE_ARGS?=

View File

@@ -108,4 +108,50 @@ else
echo "==> $SRC has no post-#22397 speculative field refs, skipping spec rename patch"
fi
# 4. Revert the `ctx_server.impl->model_tgt` rename introduced by upstream
# ggml-org/llama.cpp#22838 (parallel drafting). The turboquant fork still
# exposes the field as `model` on `server_context_impl`. The two call sites
# are in the Rerank and ModelMetadata RPC handlers.
if grep -q 'ctx_server\.impl->model_tgt' "$SRC"; then
echo "==> patching $SRC to revert ctx_server.impl->model_tgt -> ctx_server.impl->model"
sed -E 's/ctx_server\.impl->model_tgt/ctx_server.impl->model/g' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> model_tgt rename OK"
else
echo "==> $SRC has no ctx_server.impl->model_tgt refs, skipping model_tgt rename patch"
fi
# 5. Define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top of the file so the
# grpc-server option parser skips the new option-handler blocks (ngram_mod,
# ngram_map_k, ngram_map_k4v, ngram_cache, draft.cache_type_*, draft.cpuparams*,
# draft.tensor_buft_overrides) introduced for the post-#22838 layout, the
# draft.tensor_buft_overrides sentinel termination, and the
# common_params::checkpoint_min_step default/option (added with the
# 35c9b1f3 bump). Those blocks reference struct fields that simply do not
# exist in the fork.
if grep -q '^#define LOCALAI_LEGACY_LLAMA_CPP_SPEC' "$SRC"; then
echo "==> $SRC already defines LOCALAI_LEGACY_LLAMA_CPP_SPEC, skipping"
else
echo "==> patching $SRC to define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top"
# Insert the define before the very first `#include` so it precedes all the
# speculative-decoding code paths.
awk '
!done && /^#include/ {
print "#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1"
print "// ^ injected by backend/cpp/turboquant/patch-grpc-server.sh"
print ""
done = 1
}
{ print }
END {
if (!done) {
print "patch-grpc-server.sh: no #include anchor found to insert LOCALAI_LEGACY_LLAMA_CPP_SPEC" > "/dev/stderr"
exit 1
}
}
' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> LOCALAI_LEGACY_LLAMA_CPP_SPEC define OK"
fi
echo "==> all patches applied"

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# acestep.cpp version
ACESTEP_REPO?=https://github.com/ace-step/acestep.cpp
ACESTEP_CPP_VERSION?=e0c8d75a672fca5684c88c68dbf6d12f58754258
ACESTEP_CPP_VERSION?=ed53caf164e4492a5620b2e3f2264629cf66da24
SO_TARGET?=libgoacestepcpp.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF

View File

@@ -22,12 +22,11 @@
#include <vector>
// Global model contexts (loaded once, reused across requests)
static DiTGGML g_dit = {};
static DiTGGMLConfig g_dit_cfg;
static VAEGGML g_vae = {};
static bool g_dit_loaded = false;
static bool g_vae_loaded = false;
static bool g_is_turbo = false;
static DiTGGML g_dit = {};
static VAEGGML g_vae = {};
static bool g_dit_loaded = false;
static bool g_vae_loaded = false;
static bool g_is_turbo = false;
// Silence latent [15000, 64] — read once from DiT GGUF
static std::vector<float> g_silence_full;
@@ -72,10 +71,9 @@ int load_model(const char * lm_model_path, const char * text_encoder_path,
g_text_enc_path = text_encoder_path;
g_dit_path = dit_model_path;
// Load DiT model
// Load DiT model (backend init + config are handled inside dit_ggml_load)
fprintf(stderr, "[acestep-cpp] Loading DiT from %s\n", dit_model_path);
dit_ggml_init_backend(&g_dit);
if (!dit_ggml_load(&g_dit, dit_model_path, g_dit_cfg, nullptr, 0.0f)) {
if (!dit_ggml_load(&g_dit, dit_model_path)) {
fprintf(stderr, "[acestep-cpp] FATAL: failed to load DiT from %s\n", dit_model_path);
return 1;
}
@@ -149,16 +147,16 @@ int generate_music(const char * caption, const char * lyrics, int bpm,
// Compute T (latent frames at 25Hz)
int T = (int)(duration * FRAMES_PER_SECOND);
T = ((T + g_dit_cfg.patch_size - 1) / g_dit_cfg.patch_size) * g_dit_cfg.patch_size;
int S = T / g_dit_cfg.patch_size;
T = ((T + g_dit.cfg.patch_size - 1) / g_dit.cfg.patch_size) * g_dit.cfg.patch_size;
int S = T / g_dit.cfg.patch_size;
if (T > 15000) {
fprintf(stderr, "[acestep-cpp] ERROR: T=%d exceeds max 15000\n", T);
return 2;
}
int Oc = g_dit_cfg.out_channels; // 64
int ctx_ch = g_dit_cfg.in_channels - Oc; // 128
int Oc = g_dit.cfg.out_channels; // 64
int ctx_ch = g_dit.cfg.in_channels - Oc; // 128
fprintf(stderr, "[acestep-cpp] T=%d, S=%d, duration=%.1fs, seed=%d\n", T, S, duration, seed);
@@ -191,9 +189,8 @@ int generate_music(const char * caption, const char * lyrics, int bpm,
fprintf(stderr, "[acestep-cpp] caption: %d tokens, lyrics: %d tokens\n", S_text, S_lyric);
// 4. Text encoder forward
// 4. Text encoder forward (backend init handled inside qwen3_load_text_encoder)
Qwen3GGML text_enc = {};
qwen3_init_backend(&text_enc);
if (!qwen3_load_text_encoder(&text_enc, g_text_enc_path.c_str())) {
fprintf(stderr, "[acestep-cpp] FATAL: failed to load text encoder\n");
return 4;
@@ -209,9 +206,8 @@ int generate_music(const char * caption, const char * lyrics, int bpm,
std::vector<float> lyric_embed(H_text * S_lyric);
qwen3_embed_lookup(&text_enc, lyric_ids.data(), S_lyric, lyric_embed.data());
// 6. Condition encoder
// 6. Condition encoder (backend init handled inside cond_ggml_load)
CondGGML cond = {};
cond_ggml_init_backend(&cond);
if (!cond_ggml_load(&cond, g_dit_path.c_str())) {
fprintf(stderr, "[acestep-cpp] FATAL: failed to load condition encoder\n");
qwen3_free(&text_enc);

View File

@@ -0,0 +1,12 @@
GOCMD=go
cloud-proxy:
CGO_ENABLED=0 $(GOCMD) build -ldflags "$(LD_FLAGS)" -tags "$(GO_TAGS)" -o cloud-proxy ./
package:
bash package.sh
build: cloud-proxy package
clean:
rm -f cloud-proxy

View File

@@ -0,0 +1,16 @@
package main
import (
"testing"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
// Ginkgo bootstrap. The other Test* functions in this package use
// raw testing.T and run independently; they coexist with Ginkgo
// specs registered via Describe / Context.
func TestCloudProxySpecs(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "cloud-proxy specs")
}

View File

@@ -0,0 +1,39 @@
package main
// cloud-proxy is a LocalAI backend that forwards request traffic to an
// external HTTP provider (OpenAI, Anthropic, etc.). Two modes:
//
// - passthrough: serves the Forward RPC; the client wire format is
// preserved end-to-end, no translation.
// - translate: serves Predict/PredictStream; the backend converts
// internal proto to the provider's wire format. (Phases 56.)
//
// LoadModel reads UpstreamURL/Mode/Provider/key references from
// ProxyOptions and resolves the API key once at load time.
import (
"flag"
"os"
grpc "github.com/mudler/LocalAI/pkg/grpc"
"github.com/mudler/xlog"
"golang.org/x/term"
)
var addr = flag.String("addr", "localhost:50051", "the address to listen on")
func main() {
// xlog's default handler emits ANSI color codes; that's fine for an
// interactive shell but unreadable when the backend's stdout is
// captured by LocalAI and tee'd to a log file. Force plain text when
// LOCALAI_LOG_FORMAT is unset and stdout isn't a terminal.
format := os.Getenv("LOCALAI_LOG_FORMAT")
if format == "" && !term.IsTerminal(int(os.Stdout.Fd())) {
format = xlog.TextFormat
}
xlog.SetLogger(xlog.NewLogger(xlog.LogLevel(os.Getenv("LOCALAI_LOG_LEVEL")), format))
flag.Parse()
if err := grpc.StartServer(*addr, NewCloudProxy()); err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,13 @@
#!/bin/bash
# Script to copy the cloud-proxy binary into the package dir for the
# final Dockerfile stage. Mirrors backend/go/local-store/package.sh —
# no extra runtime libs needed since the backend is pure Go.
set -e
CURDIR=$(dirname "$(realpath $0)")
mkdir -p $CURDIR/package
cp -avf $CURDIR/cloud-proxy $CURDIR/package/
cp -rfv $CURDIR/run.sh $CURDIR/package/

View File

@@ -0,0 +1,325 @@
package main
import (
"context"
"errors"
"io"
"net/http"
"net/http/httptest"
"strconv"
"sync"
grpc "github.com/mudler/LocalAI/pkg/grpc"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("composeURL", func() {
// Upstream URL convention: gallery configs put the canonical path
// in upstream_url, so per-request Path is ignored. A bare-host
// upstream_url accepts the per-request path.
DescribeTable("path resolution",
func(upstream, reqPath, want string) {
got, err := composeURL(upstream, reqPath)
Expect(err).NotTo(HaveOccurred())
Expect(got).To(Equal(want))
},
Entry("full path wins", "https://api.openai.com/v1/chat/completions", "/v1/something-else", "https://api.openai.com/v1/chat/completions"),
Entry("bare host accepts path", "https://api.openai.com", "/v1/chat/completions", "https://api.openai.com/v1/chat/completions"),
Entry("root slash treated as bare", "https://api.openai.com/", "/v1/chat/completions", "https://api.openai.com/v1/chat/completions"),
Entry("bare host + empty path", "https://api.openai.com", "", "https://api.openai.com"),
)
It("returns an error on invalid upstream URL", func() {
_, err := composeURL("://garbage", "")
Expect(err).To(HaveOccurred())
})
})
var _ = Describe("applyAuthHeader", func() {
It("sets x-api-key and anthropic-version for Anthropic, no Authorization", func() {
req, _ := http.NewRequest("POST", "https://example.com", nil)
applyAuthHeader(req, providerAnthropic, "ant-key")
Expect(req.Header.Get("x-api-key")).To(Equal("ant-key"))
Expect(req.Header.Get("anthropic-version")).NotTo(BeEmpty())
Expect(req.Header.Get("Authorization")).To(BeEmpty(), "Authorization must not leak on Anthropic backend")
})
It("sets Bearer Authorization for OpenAI, no x-api-key", func() {
req, _ := http.NewRequest("POST", "https://example.com", nil)
applyAuthHeader(req, providerOpenAI, "sk-key")
Expect(req.Header.Get("Authorization")).To(Equal("Bearer sk-key"))
Expect(req.Header.Get("x-api-key")).To(BeEmpty(), "x-api-key must not leak on OpenAI backend")
})
It("defaults to Bearer when provider is empty", func() {
// Passthrough mode often has provider == "" because the operator
// doesn't claim a specific upstream wire format. Most providers
// (including OpenAI-compatible ones) accept Bearer, so default to it.
req, _ := http.NewRequest("POST", "https://example.com", nil)
applyAuthHeader(req, "", "some-key")
Expect(req.Header.Get("Authorization")).To(Equal("Bearer some-key"))
})
It("preserves an existing anthropic-version header", func() {
// If the client supplied anthropic-version (rare but legitimate
// for an upstream pinned to a specific date), the proxy must not
// clobber it.
req, _ := http.NewRequest("POST", "https://example.com", nil)
req.Header.Set("anthropic-version", "2024-10-01")
applyAuthHeader(req, providerAnthropic, "k")
Expect(req.Header.Get("anthropic-version")).To(Equal("2024-10-01"))
})
})
var _ = Describe("isHopByHopHeader", func() {
DescribeTable("hop-by-hop classification",
func(header string, want bool) {
Expect(isHopByHopHeader(header)).To(Equal(want))
},
Entry("Connection is hop-by-hop", "Connection", true),
Entry("Keep-Alive is hop-by-hop", "Keep-Alive", true),
Entry("Proxy-Connection is hop-by-hop", "Proxy-Connection", true),
Entry("Transfer-Encoding is hop-by-hop", "Transfer-Encoding", true),
Entry("TE is hop-by-hop", "TE", true),
Entry("Trailer is hop-by-hop", "Trailer", true),
Entry("Upgrade is hop-by-hop", "Upgrade", true),
Entry("Host is hop-by-hop", "Host", true),
Entry("Content-Length is hop-by-hop", "Content-Length", true),
// Case-insensitive — RFC 7230 doesn't constrain header case.
Entry("lowercase connection is hop-by-hop", "connection", true),
Entry("uppercase HOST is hop-by-hop", "HOST", true),
// Non hop-by-hop — must NOT be stripped.
Entry("Authorization is end-to-end", "Authorization", false),
Entry("Content-Type is end-to-end", "Content-Type", false),
Entry("Accept is end-to-end", "Accept", false),
Entry("X-Custom is end-to-end", "X-Custom", false),
)
})
var _ = Describe("Forward", func() {
It("strips hop-by-hop and Connection headers before upstream, preserves custom headers", func() {
gotConnection := make(chan string, 1)
gotXCustom := make(chan string, 1)
gotHost := make(chan string, 1)
upstream := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
gotConnection <- r.Header.Get("Connection")
gotXCustom <- r.Header.Get("X-Custom")
gotHost <- r.Header.Get("Host")
w.WriteHeader(http.StatusOK)
}))
defer upstream.Close()
cp := NewCloudProxy()
Expect(cp.Load(&pb.ModelOptions{
Proxy: &pb.ProxyOptions{
UpstreamUrl: upstream.URL,
Mode: modePassthrough,
},
})).To(Succeed())
addr := "test://forward-hopbyhop"
grpc.Provide(addr, cp)
c := grpc.NewClient(addr, true, nil, false)
stream, err := c.Forward(context.Background())
Expect(err).NotTo(HaveOccurred())
Expect(stream.Send(&pb.ForwardRequest{
Path: "/v1/chat/completions",
Method: "POST",
Headers: []*pb.ForwardHeader{
{Name: "Connection", Value: "keep-alive"},
{Name: "Host", Value: "spoofed.example.com"},
{Name: "X-Custom", Value: "preserved"},
},
})).To(Succeed())
Expect(stream.CloseSend()).To(Succeed())
_, _ = stream.Recv()
for {
if _, err := stream.Recv(); errors.Is(err, io.EOF) || err != nil {
break
}
}
Expect(<-gotConnection).To(BeEmpty(), "Connection must not leak to upstream")
Expect(<-gotHost).NotTo(Equal("spoofed.example.com"), "Host header must not be spoofed through")
Expect(<-gotXCustom).To(Equal("preserved"), "X-Custom header must survive")
})
It("replaces caller-supplied Authorization with the configured key", func() {
// The proxy must overwrite a client-supplied Authorization header
// so a downstream caller can't smuggle stale or wrong credentials.
gotAuth := make(chan string, 1)
upstream := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
gotAuth <- r.Header.Get("Authorization")
w.WriteHeader(http.StatusOK)
}))
defer upstream.Close()
GinkgoT().Setenv("CLOUD_PROXY_AUTH_REPLACE_KEY", "sk-real")
cp := NewCloudProxy()
Expect(cp.Load(&pb.ModelOptions{
Proxy: &pb.ProxyOptions{
UpstreamUrl: upstream.URL,
Mode: modePassthrough,
ApiKeyEnv: "CLOUD_PROXY_AUTH_REPLACE_KEY",
},
})).To(Succeed())
addr := "test://forward-replaces-auth"
grpc.Provide(addr, cp)
c := grpc.NewClient(addr, true, nil, false)
stream, err := c.Forward(context.Background())
Expect(err).NotTo(HaveOccurred())
Expect(stream.Send(&pb.ForwardRequest{
Path: "/v1/chat/completions",
Method: "POST",
Headers: []*pb.ForwardHeader{
// Client-supplied Authorization with the wrong scheme / key.
{Name: "Authorization", Value: "Basic Zm9vOmJhcg=="},
},
})).To(Succeed())
Expect(stream.CloseSend()).To(Succeed())
_, _ = stream.Recv()
for {
if _, err := stream.Recv(); errors.Is(err, io.EOF) || err != nil {
break
}
}
Expect(<-gotAuth).To(Equal("Bearer sk-real"), "caller-supplied Basic header must be replaced")
})
It("refuses to follow upstream redirects and never leaks the key to the redirect target", func() {
// A 3xx from the configured upstream means misconfiguration or a
// hijacked/spoofed host. Following it would replay the request —
// and the injected API key — to the Location host. Anthropic's
// x-api-key is NOT stripped by Go on cross-host redirects, so this
// would be a credential leak. The proxy must refuse the redirect.
sinkHit := make(chan string, 1)
sink := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
sinkHit <- r.Header.Get("x-api-key")
w.WriteHeader(http.StatusOK)
}))
defer sink.Close()
redirector := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
http.Redirect(w, r, sink.URL, http.StatusFound)
}))
defer redirector.Close()
GinkgoT().Setenv("CLOUD_PROXY_REDIRECT_KEY", "ant-secret")
cp := NewCloudProxy()
Expect(cp.Load(&pb.ModelOptions{
Proxy: &pb.ProxyOptions{
UpstreamUrl: redirector.URL,
Mode: modePassthrough,
Provider: providerAnthropic,
ApiKeyEnv: "CLOUD_PROXY_REDIRECT_KEY",
},
})).To(Succeed())
addr := "test://forward-no-redirect"
grpc.Provide(addr, cp)
c := grpc.NewClient(addr, true, nil, false)
stream, err := c.Forward(context.Background())
Expect(err).NotTo(HaveOccurred())
Expect(stream.Send(&pb.ForwardRequest{
Path: "/v1/messages",
Method: "POST",
})).To(Succeed())
Expect(stream.CloseSend()).To(Succeed())
// Drain the stream; a refused redirect surfaces as a non-EOF error.
var streamErr error
for {
if _, err := stream.Recv(); err != nil {
if !errors.Is(err, io.EOF) {
streamErr = err
}
break
}
}
Expect(streamErr).To(HaveOccurred(), "refused redirect must surface as an error")
Expect(sinkHit).NotTo(Receive(), "the redirect target must never be contacted")
})
It("handles concurrent calls without interference", func() {
// CloudProxy explicitly omits base.SingleThread — independent
// Forward streams must not block each other or leak state.
upstream := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
body, _ := io.ReadAll(r.Body)
w.WriteHeader(http.StatusOK)
_, _ = w.Write(body)
}))
defer upstream.Close()
cp := NewCloudProxy()
Expect(cp.Load(&pb.ModelOptions{
Proxy: &pb.ProxyOptions{
UpstreamUrl: upstream.URL,
Mode: modePassthrough,
},
})).To(Succeed())
addr := "test://forward-concurrent"
grpc.Provide(addr, cp)
c := grpc.NewClient(addr, true, nil, false)
const N = 8
var wg sync.WaitGroup
errs := make(chan error, N)
for i := 0; i < N; i++ {
wg.Add(1)
go func(idx int) {
defer wg.Done()
stream, err := c.Forward(context.Background())
if err != nil {
errs <- err
return
}
payload := "request-" + string(rune('A'+idx))
if err := stream.Send(&pb.ForwardRequest{
Path: "/v1/chat/completions",
Method: "POST",
BodyChunk: []byte(payload),
}); err != nil {
errs <- err
return
}
_ = stream.CloseSend()
_, _ = stream.Recv()
var body []byte
for {
r, err := stream.Recv()
if errors.Is(err, io.EOF) {
break
}
if err != nil {
errs <- err
return
}
body = append(body, r.GetBodyChunk()...)
}
if string(body) != payload {
errs <- &echoMismatch{want: payload, got: string(body)}
}
}(i)
}
wg.Wait()
close(errs)
var collected []error
for err := range errs {
collected = append(collected, err)
}
Expect(collected).To(BeEmpty(), "no concurrent Forward call should fail")
})
})
type echoMismatch struct{ want, got string }
func (e *echoMismatch) Error() string {
return "echo mismatch: want " + strconv.Quote(e.want) + " got " + strconv.Quote(e.got)
}

View File

@@ -0,0 +1,508 @@
package main
import (
"bufio"
"bytes"
"context"
"encoding/json"
"fmt"
"io"
"net/http"
"strings"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/xlog"
)
// Anthropic Messages API wire-format types. Narrowed to what translate
// mode preserves through the Reply proto: text + tool_use blocks +
// usage tokens. Image blocks, prompt caching, metadata, and stop
// sequence metadata are not modelled — passthrough mode covers those.
//
// Notable differences from OpenAI:
// - max_tokens is REQUIRED. Anthropic 400s without it.
// - Roles are user/assistant only — system messages move to a
// top-level `system` string field.
// - Streaming SSE uses event: lines alongside data: lines. The
// events we care about: content_block_start (carries tool_use
// init: id + name), content_block_delta (text_delta with text;
// input_json_delta with partial_json for tool arguments), and
// message_stop (terminates the stream). Others are ignored.
type anthropicRequest struct {
Model string `json:"model"`
MaxTokens int32 `json:"max_tokens"`
System string `json:"system,omitempty"`
Messages []anthropicMessage `json:"messages"`
Stream bool `json:"stream,omitempty"`
Temperature *float64 `json:"temperature,omitempty"`
TopP *float64 `json:"top_p,omitempty"`
StopSequences []string `json:"stop_sequences,omitempty"`
Tools []anthropicTool `json:"tools,omitempty"`
ToolChoice *anthropicToolChoice `json:"tool_choice,omitempty"`
}
// Content is `any` because Anthropic accepts a bare string OR a
// list of content blocks. Use the string form for plain user/
// assistant turns; switch to []anthropicContentBlock when the
// turn needs tool_use (assistant) or tool_result (user) blocks.
type anthropicMessage struct {
Role string `json:"role"`
Content any `json:"content"`
}
type anthropicTool struct {
Name string `json:"name"`
Description string `json:"description,omitempty"`
InputSchema json.RawMessage `json:"input_schema"`
}
// anthropicToolChoice mirrors the four shapes Anthropic accepts:
// {"type":"auto"} | {"type":"any"} | {"type":"tool","name":"X"} |
// {"type":"none"} (newer models). OpenAI's "auto"/"none"/
// "required"/{"function":{"name":"X"}} all map here.
type anthropicToolChoice struct {
Type string `json:"type"`
Name string `json:"name,omitempty"`
}
// anthropicContentBlock is the union shape used both for response
// blocks (text/tool_use we read off the wire) and outbound request
// blocks (tool_use/tool_result we emit in the conversation history).
// Anthropic encodes tool calls inline rather than as a separate field,
// so we walk Content[] looking for type=="tool_use" on responses and
// produce equivalent blocks when serialising prior-turn tool calls.
type anthropicContentBlock struct {
Type string `json:"type"`
Text string `json:"text,omitempty"`
ID string `json:"id,omitempty"`
Name string `json:"name,omitempty"`
Input json.RawMessage `json:"input,omitempty"`
// Tool-result block fields. tool_result uses `content` (not
// `text`) and pairs with `tool_use_id`; modelling them as
// distinct fields avoids ambiguity at marshal time.
ToolUseID string `json:"tool_use_id,omitempty"`
ResultContent string `json:"content,omitempty"`
}
type anthropicResponse struct {
ID string `json:"id"`
Type string `json:"type"`
Role string `json:"role"`
Content []anthropicContentBlock `json:"content"`
Model string `json:"model"`
Usage *anthropicUsage `json:"usage,omitempty"`
}
type anthropicUsage struct {
InputTokens int `json:"input_tokens"`
OutputTokens int `json:"output_tokens"`
}
// anthropicStreamEvent is the union shape used for every event type we
// process. Type discriminates; only the matching fields are populated.
// content_block_start carries ContentBlock (with id/name for tool_use);
// content_block_delta carries Delta (text or partial_json).
type anthropicStreamEvent struct {
Type string `json:"type"`
Index int `json:"index,omitempty"`
ContentBlock *anthropicContentBlock `json:"content_block,omitempty"`
Delta *anthropicStreamDelta `json:"delta,omitempty"`
Message *anthropicResponse `json:"message,omitempty"`
Usage *anthropicUsage `json:"usage,omitempty"`
}
type anthropicStreamDelta struct {
Type string `json:"type,omitempty"`
Text string `json:"text,omitempty"`
PartialJSON string `json:"partial_json,omitempty"`
}
// Anthropic requires max_tokens. If the caller didn't set it, use a
// generous-but-bounded default so the request doesn't 400.
const anthropicDefaultMaxTokens int32 = 4096
const anthropicToolChoiceNone = "none"
// Reused JSON-Schema defaults for malformed inputs. Anthropic requires
// input_schema to be a JSON object and tool_use.input to be a JSON
// object; clients that omit them must not 400 the entire request.
var (
emptyJSONObject = json.RawMessage(`{}`)
emptyObjectSchema = json.RawMessage(`{"type":"object","properties":{}}`)
)
func buildAnthropicRequest(opts *pb.PredictOptions, cfg *proxyConfig, stream bool) ([]byte, error) {
req := anthropicRequest{
Model: modelName(cfg, opts),
MaxTokens: opts.GetTokens(),
Stream: stream,
StopSequences: opts.GetStopPrompts(),
}
if req.MaxTokens <= 0 {
req.MaxTokens = anthropicDefaultMaxTokens
}
// Newer Anthropic models 400 when both temperature and top_p are
// set ("`temperature` and `top_p` cannot both be specified for
// this model. Please use only one.") even though their docs only
// "recommend" picking one. The OpenAI-compatible chat UI almost
// always sends both with default values, so prefer temperature
// and drop top_p when both are present.
if t := opts.GetTemperature(); t != 0 {
v := float64(t)
req.Temperature = &v
} else if t := opts.GetTopP(); t != 0 {
v := float64(t)
req.TopP = &v
}
req.Tools = convertOpenAITools(opts.GetTools())
req.ToolChoice = convertOpenAIToolChoice(opts.GetToolChoice())
// Anthropic rejects tool_choice without tools and older models
// don't accept {"type":"none"} — collapse to a no-tools request.
if req.ToolChoice != nil && req.ToolChoice.Type == anthropicToolChoiceNone {
req.Tools, req.ToolChoice = nil, nil
}
var systemParts []string
for _, m := range opts.GetMessages() {
role := m.GetRole()
if role == "system" {
if c := m.GetContent(); c != "" {
systemParts = append(systemParts, c)
}
continue
}
switch role {
case "user":
req.Messages = append(req.Messages, anthropicMessage{
Role: "user",
Content: m.GetContent(),
})
case "assistant":
if blocks := assistantBlocks(m); blocks != nil {
req.Messages = append(req.Messages, anthropicMessage{Role: "assistant", Content: blocks})
continue
}
req.Messages = append(req.Messages, anthropicMessage{
Role: "assistant",
Content: m.GetContent(),
})
case "tool", "function":
req.Messages = appendToolResult(req.Messages, anthropicContentBlock{
Type: "tool_result",
ToolUseID: m.GetToolCallId(),
ResultContent: m.GetContent(),
})
}
}
req.System = strings.Join(systemParts, "\n\n")
if len(req.Messages) == 0 && opts.GetPrompt() != "" {
req.Messages = []anthropicMessage{{Role: "user", Content: opts.GetPrompt()}}
}
return json.Marshal(req)
}
// appendToolResult appends a tool_result block as a user message,
// merging into a preceding user message that already carries blocks.
// Anthropic concatenates consecutive same-role messages on its end,
// but explicit merging keeps the body smaller and the conversation
// strictly alternating — which some upstream filters require.
func appendToolResult(msgs []anthropicMessage, block anthropicContentBlock) []anthropicMessage {
if n := len(msgs); n > 0 && msgs[n-1].Role == "user" {
if existing, ok := msgs[n-1].Content.([]anthropicContentBlock); ok {
msgs[n-1].Content = append(existing, block)
return msgs
}
}
return append(msgs, anthropicMessage{
Role: "user",
Content: []anthropicContentBlock{block},
})
}
func convertOpenAITools(toolsJSON string) []anthropicTool {
if toolsJSON == "" {
return nil
}
var raw []openAITool
if err := json.Unmarshal([]byte(toolsJSON), &raw); err != nil {
xlog.Warn("cloud-proxy: anthropic translate: unparseable tools JSON, dropping", "error", err)
return nil
}
tools := make([]anthropicTool, 0, len(raw))
for _, t := range raw {
if t.Function.Name == "" {
continue
}
schema := t.Function.Parameters
if len(schema) == 0 {
schema = emptyObjectSchema
}
tools = append(tools, anthropicTool{
Name: t.Function.Name,
Description: t.Function.Description,
InputSchema: schema,
})
}
return tools
}
// convertOpenAIToolChoice accepts the spec form
// ({type:function, function:{name:X}}) and the flat legacy form
// ({type:function, name:X}) some clients send. Unknown object shapes
// are warned and dropped rather than silently treated as auto.
func convertOpenAIToolChoice(toolChoiceJSON string) *anthropicToolChoice {
if toolChoiceJSON == "" {
return nil
}
var asString string
if err := json.Unmarshal([]byte(toolChoiceJSON), &asString); err == nil {
switch asString {
case "auto":
return &anthropicToolChoice{Type: "auto"}
case "none":
return &anthropicToolChoice{Type: anthropicToolChoiceNone}
case "required":
return &anthropicToolChoice{Type: "any"}
}
return nil
}
var asObj struct {
Type string `json:"type"`
Name string `json:"name"`
Function struct {
Name string `json:"name"`
} `json:"function"`
}
if err := json.Unmarshal([]byte(toolChoiceJSON), &asObj); err != nil {
xlog.Warn("cloud-proxy: anthropic translate: unparseable tool_choice, dropping", "error", err)
return nil
}
if name := asObj.Function.Name; name != "" {
return &anthropicToolChoice{Type: "tool", Name: name}
}
if asObj.Name != "" {
return &anthropicToolChoice{Type: "tool", Name: asObj.Name}
}
xlog.Warn("cloud-proxy: anthropic translate: unrecognised tool_choice shape, dropping", "shape", toolChoiceJSON)
return nil
}
// openAITool mirrors pkg/functions.Tool but keeps Parameters as
// json.RawMessage so the input_schema passes through verbatim — no
// re-marshal cost, no fidelity loss on exotic schemas.
type openAITool struct {
Type string `json:"type"`
Function struct {
Name string `json:"name"`
Description string `json:"description"`
Parameters json.RawMessage `json:"parameters"`
} `json:"function"`
}
func assistantBlocks(m *pb.Message) []anthropicContentBlock {
toolCallsJSON := m.GetToolCalls()
if toolCallsJSON == "" {
return nil
}
var toolCalls []openAIToolCall
if err := json.Unmarshal([]byte(toolCallsJSON), &toolCalls); err != nil || len(toolCalls) == 0 {
return nil
}
blocks := make([]anthropicContentBlock, 0, len(toolCalls)+1)
if text := m.GetContent(); text != "" {
blocks = append(blocks, anthropicContentBlock{Type: "text", Text: text})
}
for _, tc := range toolCalls {
// OpenAI's arguments are a JSON-encoded string; pass through
// as RawMessage so a non-JSON string from a poorly-formed
// local model doesn't crash the marshaller downstream.
args := json.RawMessage(tc.Function.Arguments)
if len(args) == 0 {
args = emptyJSONObject
}
blocks = append(blocks, anthropicContentBlock{
Type: "tool_use",
ID: tc.ID,
Name: tc.Function.Name,
Input: args,
})
}
return blocks
}
// doAnthropicRequest is the Anthropic counterpart of doOpenAIRequest.
// applyAuthHeader sets x-api-key and anthropic-version when provider
// is anthropic, so this method doesn't need to duplicate that.
func (c *CloudProxy) doAnthropicRequest(ctx context.Context, cfg *proxyConfig, body []byte) (*http.Response, error) {
req, err := http.NewRequestWithContext(ctx, http.MethodPost, cfg.upstreamURL, bytes.NewReader(body))
if err != nil {
return nil, fmt.Errorf("cloud-proxy: build request: %w", err)
}
req.Header.Set("Content-Type", "application/json")
req.Header.Set("Accept", "*/*")
if cfg.apiKey != "" {
applyAuthHeader(req, cfg.provider, cfg.apiKey)
}
resp, err := c.client.Do(req)
if err != nil {
return nil, fmt.Errorf("cloud-proxy: upstream request: %w", err)
}
return resp, nil
}
// predictAnthropicRich returns the full Reply: joined text from all
// text blocks, tool_use blocks mapped to ToolCallDelta, and usage
// tokens.
func (c *CloudProxy) predictAnthropicRich(ctx context.Context, cfg *proxyConfig, opts *pb.PredictOptions) (*pb.Reply, error) {
body, err := buildAnthropicRequest(opts, cfg, false)
if err != nil {
return nil, fmt.Errorf("cloud-proxy: marshal request: %w", err)
}
resp, err := c.doAnthropicRequest(ctx, cfg, body)
if err != nil {
return nil, err
}
defer func() { _ = resp.Body.Close() }()
if resp.StatusCode >= 400 {
errBody, _ := io.ReadAll(io.LimitReader(resp.Body, 1<<20))
return nil, fmt.Errorf("cloud-proxy: upstream %d: %s", resp.StatusCode, string(errBody))
}
var parsed anthropicResponse
if err := json.NewDecoder(resp.Body).Decode(&parsed); err != nil {
return nil, fmt.Errorf("cloud-proxy: decode response: %w", err)
}
reply := &pb.Reply{}
if parsed.Usage != nil {
reply.PromptTokens = int32(parsed.Usage.InputTokens)
reply.Tokens = int32(parsed.Usage.OutputTokens)
}
var content strings.Builder
var toolCalls []*pb.ToolCallDelta
toolIdx := 0
for _, b := range parsed.Content {
switch b.Type {
case "text":
content.WriteString(b.Text)
case "tool_use":
// Input is a structured JSON object; we serialise to a
// string so it fits the OpenAI-shaped arguments field
// downstream consumers expect.
args := ""
if len(b.Input) > 0 {
args = string(b.Input)
}
toolCalls = append(toolCalls, newToolCallDelta(toolIdx, b.ID, b.Name, args))
toolIdx++
}
}
reply.Message = []byte(content.String())
if len(toolCalls) > 0 {
reply.ChatDeltas = []*pb.ChatDelta{{ToolCalls: toolCalls}}
}
return reply, nil
}
// predictAnthropicStreamRich streams Reply chunks from Anthropic's SSE.
// Three event types matter: content_block_start (initialises tool_use
// id+name), content_block_delta (carries text or input_json_delta),
// message_stop (terminates). The block index from the wire feeds
// straight into ToolCallDelta.Index so downstream consumers can
// reassemble multiple parallel tool calls.
func (c *CloudProxy) predictAnthropicStreamRich(ctx context.Context, cfg *proxyConfig, opts *pb.PredictOptions, results chan<- *pb.Reply) error {
body, err := buildAnthropicRequest(opts, cfg, true)
if err != nil {
return fmt.Errorf("cloud-proxy: marshal request: %w", err)
}
resp, err := c.doAnthropicRequest(ctx, cfg, body)
if err != nil {
return err
}
defer func() { _ = resp.Body.Close() }()
if resp.StatusCode >= 400 {
errBody, _ := io.ReadAll(io.LimitReader(resp.Body, 1<<20))
return fmt.Errorf("cloud-proxy: upstream %d: %s", resp.StatusCode, string(errBody))
}
scanner := bufio.NewScanner(resp.Body)
scanner.Buffer(make([]byte, 0, 64*1024), 1<<20)
for scanner.Scan() {
line := scanner.Text()
if !strings.HasPrefix(line, "data:") {
continue
}
payload := strings.TrimSpace(strings.TrimPrefix(line, "data:"))
if payload == "" {
continue
}
var ev anthropicStreamEvent
if err := json.Unmarshal([]byte(payload), &ev); err != nil {
xlog.Debug("cloud-proxy: skip malformed SSE chunk", "error", err)
continue
}
switch ev.Type {
case "content_block_start":
// tool_use blocks announce id + name here; arguments arrive
// in subsequent input_json_delta events. Emit a Reply with
// just the tool_call init fields so consumers can allocate
// a slot at this index.
if ev.ContentBlock != nil && ev.ContentBlock.Type == "tool_use" {
if !sendReply(ctx, results, &pb.Reply{
ChatDeltas: []*pb.ChatDelta{{ToolCalls: []*pb.ToolCallDelta{
newToolCallDelta(ev.Index, ev.ContentBlock.ID, ev.ContentBlock.Name, ""),
}}},
}) {
return ctx.Err()
}
}
case "content_block_delta":
if ev.Delta == nil {
continue
}
switch ev.Delta.Type {
case "text_delta":
if ev.Delta.Text == "" {
continue
}
if !sendReply(ctx, results, &pb.Reply{
Message: []byte(ev.Delta.Text),
ChatDeltas: []*pb.ChatDelta{{Content: ev.Delta.Text}},
}) {
return ctx.Err()
}
case "input_json_delta":
if ev.Delta.PartialJSON == "" {
continue
}
if !sendReply(ctx, results, &pb.Reply{
ChatDeltas: []*pb.ChatDelta{{ToolCalls: []*pb.ToolCallDelta{
newToolCallDelta(ev.Index, "", "", ev.Delta.PartialJSON),
}}},
}) {
return ctx.Err()
}
}
case "message_delta":
// Anthropic sends final usage in message_delta.usage. Emit
// a usage-only Reply so the consumer can record totals.
if ev.Usage != nil {
if !sendReply(ctx, results, &pb.Reply{
Tokens: int32(ev.Usage.OutputTokens),
}) {
return ctx.Err()
}
}
case "message_stop":
return nil
}
}
return scanner.Err()
}

View File

@@ -0,0 +1,334 @@
package main
import (
"encoding/json"
"io"
"math"
"net/http"
"net/http/httptest"
"strings"
"testing"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/gomega"
)
// fakeAnthropicUpstream mirrors fakeOpenAIUpstream but decodes the
// request body as an anthropicRequest so tests can assert on the
// translated wire shape (system field, max_tokens, etc.).
func fakeAnthropicUpstream(t *testing.T, handler func(req anthropicRequest) (status int, body string, contentType string)) (*httptest.Server, *anthropicRequest) {
t.Helper()
var captured anthropicRequest
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
raw, _ := io.ReadAll(r.Body)
_ = json.Unmarshal(raw, &captured)
status, body, ct := handler(captured)
w.Header().Set("Content-Type", ct)
w.WriteHeader(status)
_, _ = io.WriteString(w, body)
}))
return srv, &captured
}
func newAnthropicTranslateCloudProxy(t *testing.T, upstreamURL string) *CloudProxy {
t.Helper()
g := NewWithT(t)
t.Setenv("CLOUD_PROXY_ANTHROPIC_FAKE", "sk-ant-fake")
cp := NewCloudProxy()
err := cp.Load(&pb.ModelOptions{
Model: "claude-local",
Proxy: &pb.ProxyOptions{
UpstreamUrl: upstreamURL,
Mode: modeTranslate,
Provider: providerAnthropic,
ApiKeyEnv: "CLOUD_PROXY_ANTHROPIC_FAKE",
UpstreamModel: "claude-3-5-sonnet-20241022",
},
})
g.Expect(err).NotTo(HaveOccurred())
return cp
}
func TestPredict_Anthropic_BasicMessages(t *testing.T) {
g := NewWithT(t)
srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
return 200, `{"id":"msg_1","type":"message","role":"assistant","content":[{"type":"text","text":"hi there"}],"model":"claude-3-5-sonnet-20241022","usage":{"input_tokens":5,"output_tokens":2}}`, "application/json"
})
defer srv.Close()
cp := newAnthropicTranslateCloudProxy(t, srv.URL)
got, err := cp.Predict(&pb.PredictOptions{
Messages: []*pb.Message{
{Role: "system", Content: "be brief"},
{Role: "user", Content: "hello"},
},
Temperature: 0.5,
TopP: 0.9,
Tokens: 32,
})
g.Expect(err).NotTo(HaveOccurred())
g.Expect(got).To(Equal("hi there"))
g.Expect(captured.Model).To(Equal("claude-3-5-sonnet-20241022"))
// System message must be hoisted out of Messages into top-level field.
g.Expect(captured.System).To(Equal("be brief"))
g.Expect(captured.Messages).To(HaveLen(1))
g.Expect(captured.Messages[0].Role).To(Equal("user"))
g.Expect(captured.MaxTokens).To(Equal(int32(32)))
g.Expect(captured.Temperature).NotTo(BeNil())
g.Expect(*captured.Temperature).To(Equal(0.5))
// Anthropic 400s when both temperature and top_p are set; the
// translator must prefer temperature and drop top_p.
g.Expect(captured.TopP).To(BeNil())
g.Expect(captured.Stream).To(BeFalse())
}
// When only top_p is set, it should be forwarded.
func TestPredict_Anthropic_TopPOnly(t *testing.T) {
g := NewWithT(t)
srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
return 200, `{"content":[{"type":"text","text":"ok"}]}`, "application/json"
})
defer srv.Close()
cp := newAnthropicTranslateCloudProxy(t, srv.URL)
_, err := cp.Predict(&pb.PredictOptions{
Messages: []*pb.Message{{Role: "user", Content: "hello"}},
TopP: 0.9,
Tokens: 16,
})
g.Expect(err).NotTo(HaveOccurred())
g.Expect(captured.Temperature).To(BeNil())
// PredictOptions.TopP is float32 on the wire; the translator widens
// to float64 so 0.9 round-trips as 0.8999999761581421… — compare
// with a small tolerance rather than exact equality.
g.Expect(captured.TopP).NotTo(BeNil())
g.Expect(math.Abs(*captured.TopP - 0.9)).To(BeNumerically("<=", 1e-6))
}
func TestPredict_Anthropic_DefaultsMaxTokens(t *testing.T) {
g := NewWithT(t)
// Anthropic 400s without max_tokens. The translator must default
// it when the caller doesn't supply Tokens.
srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
return 200, `{"content":[{"type":"text","text":"ok"}]}`, "application/json"
})
defer srv.Close()
cp := newAnthropicTranslateCloudProxy(t, srv.URL)
_, err := cp.Predict(&pb.PredictOptions{Messages: []*pb.Message{{Role: "user", Content: "x"}}})
g.Expect(err).NotTo(HaveOccurred())
g.Expect(captured.MaxTokens).To(Equal(anthropicDefaultMaxTokens))
}
func TestPredict_Anthropic_PromptFallback(t *testing.T) {
g := NewWithT(t)
srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
return 200, `{"content":[{"type":"text","text":"ok"}]}`, "application/json"
})
defer srv.Close()
cp := newAnthropicTranslateCloudProxy(t, srv.URL)
_, err := cp.Predict(&pb.PredictOptions{Prompt: "what time is it?", Tokens: 16})
g.Expect(err).NotTo(HaveOccurred())
g.Expect(captured.Messages).To(HaveLen(1))
g.Expect(captured.Messages[0].Role).To(Equal("user"))
g.Expect(captured.Messages[0].Content).To(Equal("what time is it?"))
}
func TestPredict_Anthropic_ConcatenatesContentBlocks(t *testing.T) {
g := NewWithT(t)
// Anthropic may return multiple text blocks; the translator joins
// them so the Predict() string return is the full assistant message.
srv, _ := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
return 200, `{"content":[{"type":"text","text":"hello "},{"type":"text","text":"world"}]}`, "application/json"
})
defer srv.Close()
cp := newAnthropicTranslateCloudProxy(t, srv.URL)
got, err := cp.Predict(&pb.PredictOptions{Messages: []*pb.Message{{Role: "user", Content: "x"}}, Tokens: 16})
g.Expect(err).NotTo(HaveOccurred())
g.Expect(got).To(Equal("hello world"))
}
func TestPredict_Anthropic_UpstreamError(t *testing.T) {
g := NewWithT(t)
srv, _ := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
return 401, `{"error":{"type":"authentication_error","message":"bad key"}}`, "application/json"
})
defer srv.Close()
cp := newAnthropicTranslateCloudProxy(t, srv.URL)
_, err := cp.Predict(&pb.PredictOptions{Messages: []*pb.Message{{Role: "user", Content: "x"}}, Tokens: 16})
g.Expect(err).To(HaveOccurred())
g.Expect(err.Error()).To(ContainSubstring("401"))
}
func TestPredictStream_Anthropic_StreamsTextDeltas(t *testing.T) {
g := NewWithT(t)
// Real Anthropic SSE has event: lines + data: lines. The translator
// only needs the data: payload; only content_block_delta with
// delta.type=text_delta carries content. message_stop ends.
frames := []string{
"event: message_start\ndata: {\"type\":\"message_start\"}\n\n",
"event: content_block_start\ndata: {\"type\":\"content_block_start\",\"index\":0}\n\n",
"event: content_block_delta\ndata: {\"type\":\"content_block_delta\",\"index\":0,\"delta\":{\"type\":\"text_delta\",\"text\":\"hello\"}}\n\n",
"event: content_block_delta\ndata: {\"type\":\"content_block_delta\",\"index\":0,\"delta\":{\"type\":\"text_delta\",\"text\":\" \"}}\n\n",
"event: content_block_delta\ndata: {\"type\":\"content_block_delta\",\"index\":0,\"delta\":{\"type\":\"text_delta\",\"text\":\"world\"}}\n\n",
"event: content_block_stop\ndata: {\"type\":\"content_block_stop\",\"index\":0}\n\n",
"event: message_stop\ndata: {\"type\":\"message_stop\"}\n\n",
}
body := strings.Join(frames, "")
srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
return 200, body, "text/event-stream"
})
defer srv.Close()
cp := newAnthropicTranslateCloudProxy(t, srv.URL)
results := make(chan string, 8)
done := make(chan error, 1)
go func() {
done <- cp.PredictStream(&pb.PredictOptions{
Messages: []*pb.Message{{Role: "user", Content: "hi"}},
Tokens: 16,
}, results)
}()
var got []string
for s := range results {
got = append(got, s)
}
err := <-done
g.Expect(err).NotTo(HaveOccurred())
g.Expect(strings.Join(got, "")).To(Equal("hello world"))
g.Expect(captured.Stream).To(BeTrue())
}
func TestBuildAnthropic_TranslatesOpenAITools(t *testing.T) {
g := NewWithT(t)
srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
return 200, `{"content":[{"type":"text","text":"ok"}]}`, "application/json"
})
defer srv.Close()
cp := newAnthropicTranslateCloudProxy(t, srv.URL)
tools := `[{"type":"function","function":{"name":"get_weather","description":"Get weather","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}]`
_, err := cp.Predict(&pb.PredictOptions{
Messages: []*pb.Message{{Role: "user", Content: "weather in Paris?"}},
Tools: tools,
ToolChoice: `"auto"`,
Tokens: 32,
})
g.Expect(err).NotTo(HaveOccurred())
g.Expect(captured.Tools).To(HaveLen(1))
g.Expect(captured.Tools[0].Name).To(Equal("get_weather"))
g.Expect(captured.Tools[0].Description).To(Equal("Get weather"))
// input_schema must be the parameters object verbatim.
g.Expect(string(captured.Tools[0].InputSchema)).To(ContainSubstring(`"city"`))
g.Expect(captured.ToolChoice).NotTo(BeNil())
g.Expect(captured.ToolChoice.Type).To(Equal("auto"))
}
func TestBuildAnthropic_ToolChoice_RequiredMapsToAny(t *testing.T) {
g := NewWithT(t)
srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
return 200, `{"content":[]}`, "application/json"
})
defer srv.Close()
cp := newAnthropicTranslateCloudProxy(t, srv.URL)
_, err := cp.Predict(&pb.PredictOptions{
Messages: []*pb.Message{{Role: "user", Content: "x"}},
Tools: `[{"type":"function","function":{"name":"t","parameters":{"type":"object"}}}]`,
ToolChoice: `"required"`,
Tokens: 16,
})
g.Expect(err).NotTo(HaveOccurred())
g.Expect(captured.ToolChoice).NotTo(BeNil())
g.Expect(captured.ToolChoice.Type).To(Equal("any"))
}
func TestBuildAnthropic_ToolChoice_NoneDropsTools(t *testing.T) {
g := NewWithT(t)
srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
return 200, `{"content":[]}`, "application/json"
})
defer srv.Close()
cp := newAnthropicTranslateCloudProxy(t, srv.URL)
_, err := cp.Predict(&pb.PredictOptions{
Messages: []*pb.Message{{Role: "user", Content: "x"}},
Tools: `[{"type":"function","function":{"name":"t","parameters":{"type":"object"}}}]`,
ToolChoice: `"none"`,
Tokens: 16,
})
g.Expect(err).NotTo(HaveOccurred())
g.Expect(captured.Tools).To(BeNil())
g.Expect(captured.ToolChoice).To(BeNil())
}
func TestBuildAnthropic_ToolChoice_NamedFunction(t *testing.T) {
g := NewWithT(t)
srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
return 200, `{"content":[]}`, "application/json"
})
defer srv.Close()
cp := newAnthropicTranslateCloudProxy(t, srv.URL)
_, err := cp.Predict(&pb.PredictOptions{
Messages: []*pb.Message{{Role: "user", Content: "x"}},
Tools: `[{"type":"function","function":{"name":"weather","parameters":{"type":"object"}}}]`,
ToolChoice: `{"type":"function","function":{"name":"weather"}}`,
Tokens: 16,
})
g.Expect(err).NotTo(HaveOccurred())
g.Expect(captured.ToolChoice).NotTo(BeNil())
g.Expect(captured.ToolChoice.Type).To(Equal("tool"))
g.Expect(captured.ToolChoice.Name).To(Equal("weather"))
}
func TestBuildAnthropic_RoundTripsAssistantToolCalls(t *testing.T) {
g := NewWithT(t)
// LocalAI Assistant's second turn: the LLM previously emitted a
// tool_use, the server executed it, and the conversation now
// includes the assistant turn (with tool_calls) plus a tool-role
// result message. Both must convert to Anthropic block form.
srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
return 200, `{"content":[{"type":"text","text":"ok"}]}`, "application/json"
})
defer srv.Close()
cp := newAnthropicTranslateCloudProxy(t, srv.URL)
tools := `[{"type":"function","function":{"name":"list_models","parameters":{"type":"object"}}}]`
toolCallsJSON := `[{"id":"call_abc","type":"function","function":{"name":"list_models","arguments":"{}"}}]`
_, err := cp.Predict(&pb.PredictOptions{
Tools: tools,
Messages: []*pb.Message{
{Role: "user", Content: "what models are installed?"},
{Role: "assistant", Content: "", ToolCalls: toolCallsJSON},
{Role: "tool", Content: `{"models":["a","b"]}`, ToolCallId: "call_abc"},
},
Tokens: 64,
})
g.Expect(err).NotTo(HaveOccurred())
g.Expect(captured.Messages).To(HaveLen(3))
// 1. user text — bare string
s, ok := captured.Messages[0].Content.(string)
g.Expect(ok).To(BeTrue())
g.Expect(s).To(Equal("what models are installed?"))
// 2. assistant — must be a content-block list with one tool_use
// json.Unmarshal of `any` produces []any not []anthropicContentBlock.
blocks, ok := captured.Messages[1].Content.([]any)
g.Expect(ok).To(BeTrue())
g.Expect(blocks).To(HaveLen(1))
b0, _ := blocks[0].(map[string]any)
g.Expect(b0["type"]).To(Equal("tool_use"))
g.Expect(b0["id"]).To(Equal("call_abc"))
g.Expect(b0["name"]).To(Equal("list_models"))
// 3. tool → user with tool_result block
g.Expect(captured.Messages[2].Role).To(Equal("user"))
resBlocks, _ := captured.Messages[2].Content.([]any)
r0, _ := resBlocks[0].(map[string]any)
g.Expect(r0["type"]).To(Equal("tool_result"))
g.Expect(r0["tool_use_id"]).To(Equal("call_abc"))
g.Expect(r0["content"]).To(Equal(`{"models":["a","b"]}`))
}

View File

@@ -0,0 +1,119 @@
package main
import (
"encoding/json"
"strings"
"testing"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/gomega"
)
// Verify buildOpenAIRequest preserves caller-supplied tools and
// tool_choice as opaque JSON. PredictOptions carries them as strings;
// they must land in the outbound request body unchanged so the
// upstream sees the caller's intent verbatim. A regression here would
// silently disable function calling for translate-mode clients.
func TestBuildOpenAIRequest_ToolsAndToolChoicePassthrough(t *testing.T) {
g := NewWithT(t)
cfg := &proxyConfig{upstreamModel: "gpt-4o"}
toolsJSON := `[{"type":"function","function":{"name":"search","parameters":{"type":"object"}}}]`
choiceJSON := `{"type":"function","function":{"name":"search"}}`
body, err := buildOpenAIRequest(&pb.PredictOptions{
Messages: []*pb.Message{{Role: "user", Content: "find x"}},
Tools: toolsJSON,
ToolChoice: choiceJSON,
}, cfg, false)
g.Expect(err).NotTo(HaveOccurred())
var decoded openAIRequest
err = json.Unmarshal(body, &decoded)
g.Expect(err).NotTo(HaveOccurred())
// Compare the JSON-canonical form so whitespace differences are ignored.
gotTools, _ := json.Marshal(json.RawMessage(decoded.Tools))
wantTools, _ := json.Marshal(json.RawMessage(toolsJSON))
g.Expect(string(gotTools)).To(Equal(string(wantTools)))
gotChoice, _ := json.Marshal(json.RawMessage(decoded.ToolChoice))
wantChoice, _ := json.Marshal(json.RawMessage(choiceJSON))
g.Expect(string(gotChoice)).To(Equal(string(wantChoice)))
}
// Garbage JSON in tools / tool_choice is silently dropped (omitted)
// rather than blowing up the request. Documents the parseRawJSON
// behaviour — operators shouldn't see hard failures from an upstream
// caller's mis-formatted tools field.
func TestBuildOpenAIRequest_InvalidToolsJSONDropped(t *testing.T) {
g := NewWithT(t)
cfg := &proxyConfig{upstreamModel: "gpt-4o"}
body, err := buildOpenAIRequest(&pb.PredictOptions{
Messages: []*pb.Message{{Role: "user", Content: "x"}},
Tools: "this is not json",
ToolChoice: "{also bad",
}, cfg, false)
g.Expect(err).NotTo(HaveOccurred())
g.Expect(string(body)).NotTo(ContainSubstring("this is not json"))
g.Expect(string(body)).NotTo(ContainSubstring("{also bad"))
}
// Anthropic empty content array yields an empty Reply (not an error).
// Mirrors how an upstream tool_use-only response might arrive — the
// content array can legitimately be empty in some edge cases.
func TestPredictRich_Anthropic_EmptyContent(t *testing.T) {
g := NewWithT(t)
srv, _ := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
return 200, `{"id":"m1","type":"message","role":"assistant","content":[],"usage":{"input_tokens":3,"output_tokens":0}}`, "application/json"
})
defer srv.Close()
cp := newAnthropicTranslateCloudProxy(t, srv.URL)
reply, err := cp.PredictRich(&pb.PredictOptions{
Messages: []*pb.Message{{Role: "user", Content: "x"}},
Tokens: 16,
})
g.Expect(err).NotTo(HaveOccurred())
g.Expect(string(reply.GetMessage())).To(Equal(""))
g.Expect(reply.GetChatDeltas()).To(HaveLen(0))
g.Expect(reply.GetPromptTokens()).To(Equal(int32(3)))
}
// A truncated / malformed SSE payload mid-stream should be tolerated:
// the malformed chunk gets skipped (xlog.Debug logged), valid chunks
// before AND after it still reach the channel.
func TestPredictStreamRich_OpenAI_TolerantOfBadChunks(t *testing.T) {
g := NewWithT(t)
body := strings.Join([]string{
`data: {"choices":[{"index":0,"delta":{"content":"hello"}}]}`,
``,
`data: this-is-not-json{{`,
``,
`data: {"choices":[{"index":0,"delta":{"content":" world"}}]}`,
``,
`data: [DONE]`,
``,
}, "\n")
srv, _ := fakeOpenAIUpstream(t, func(_ openAIRequest) (int, string, string) {
return 200, body, "text/event-stream"
})
defer srv.Close()
cp := newTranslateCloudProxy(t, srv.URL)
results := make(chan *pb.Reply, 8)
done := make(chan error, 1)
go func() {
done <- cp.PredictStreamRich(&pb.PredictOptions{
Messages: []*pb.Message{{Role: "user", Content: "hi"}},
}, results)
close(results)
}()
var assembled strings.Builder
for reply := range results {
assembled.Write(reply.GetMessage())
}
err := <-done
g.Expect(err).NotTo(HaveOccurred())
// The good chunks before and after the malformed one both made it through.
g.Expect(assembled.String()).To(Equal("hello world"))
}

View File

@@ -0,0 +1,320 @@
package main
import (
"bufio"
"bytes"
"context"
"encoding/json"
"errors"
"fmt"
"io"
"net/http"
"strings"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/xlog"
)
// OpenAI Chat Completions wire-format types. Narrowed to the fields
// translate mode needs to preserve through the Reply proto: content,
// role, tool_calls (typed so we can map them to pb.ToolCallDelta),
// and sampling params copied verbatim from PredictOptions.
//
// Provider-specific extensions (logit_bias, function calling beyond
// tool_calls, etc.) are not modelled — passthrough mode covers callers
// that need full upstream fidelity.
type openAIRequest struct {
Model string `json:"model"`
Messages []openAIMessage `json:"messages"`
Stream bool `json:"stream,omitempty"`
Temperature *float64 `json:"temperature,omitempty"`
TopP *float64 `json:"top_p,omitempty"`
MaxTokens *int32 `json:"max_tokens,omitempty"`
Stop []string `json:"stop,omitempty"`
FrequencyPenalty *float64 `json:"frequency_penalty,omitempty"`
PresencePenalty *float64 `json:"presence_penalty,omitempty"`
Tools json.RawMessage `json:"tools,omitempty"`
ToolChoice json.RawMessage `json:"tool_choice,omitempty"`
}
type openAIMessage struct {
Role string `json:"role"`
Content string `json:"content,omitempty"`
Name string `json:"name,omitempty"`
ToolCallID string `json:"tool_call_id,omitempty"`
ToolCalls []openAIToolCall `json:"tool_calls,omitempty"`
}
// openAIToolCall covers both the non-streaming response shape (full
// id+function+arguments) and the streaming-delta shape (sparse fields,
// index assignment). The proto's ToolCallDelta absorbs both — name is
// set on first appearance, arguments arrive incrementally in streaming.
type openAIToolCall struct {
Index int `json:"index,omitempty"`
ID string `json:"id,omitempty"`
Type string `json:"type,omitempty"`
Function openAIFunctionCall `json:"function,omitempty"`
}
type openAIFunctionCall struct {
Name string `json:"name,omitempty"`
Arguments string `json:"arguments,omitempty"`
}
type openAIChoice struct {
Index int `json:"index"`
Message openAIMessage `json:"message"`
FinishReason string `json:"finish_reason"`
}
type openAIResponse struct {
ID string `json:"id"`
Choices []openAIChoice `json:"choices"`
Usage *openAIUsage `json:"usage,omitempty"`
}
type openAIStreamChoice struct {
Index int `json:"index"`
Delta struct {
Content string `json:"content,omitempty"`
Role string `json:"role,omitempty"`
ToolCalls []openAIToolCall `json:"tool_calls,omitempty"`
} `json:"delta"`
FinishReason string `json:"finish_reason,omitempty"`
}
type openAIStreamChunk struct {
Choices []openAIStreamChoice `json:"choices"`
Usage *openAIUsage `json:"usage,omitempty"`
}
type openAIUsage struct {
PromptTokens int `json:"prompt_tokens"`
CompletionTokens int `json:"completion_tokens"`
TotalTokens int `json:"total_tokens"`
}
// buildOpenAIRequest converts pb.PredictOptions into the OpenAI Chat
// Completions request body. Prefers Messages when non-empty; falls
// back to wrapping Prompt as a single user message so plain
// /completions-style calls still work in translate mode.
func buildOpenAIRequest(opts *pb.PredictOptions, cfg *proxyConfig, stream bool) ([]byte, error) {
req := openAIRequest{
Model: modelName(cfg, opts),
Stream: stream,
Stop: opts.GetStopPrompts(),
Tools: parseRawJSON(opts.GetTools()),
ToolChoice: parseRawJSON(opts.GetToolChoice()),
}
if t := opts.GetTemperature(); t != 0 {
v := float64(t)
req.Temperature = &v
}
if t := opts.GetTopP(); t != 0 {
v := float64(t)
req.TopP = &v
}
if n := opts.GetTokens(); n > 0 {
req.MaxTokens = &n
}
if p := opts.GetFrequencyPenalty(); p != 0 {
v := float64(p)
req.FrequencyPenalty = &v
}
if p := opts.GetPresencePenalty(); p != 0 {
v := float64(p)
req.PresencePenalty = &v
}
for _, m := range opts.GetMessages() {
msg := openAIMessage{
Role: m.GetRole(),
Content: m.GetContent(),
Name: m.GetName(),
ToolCallID: m.GetToolCallId(),
}
// Pre-existing tool_calls arrive as a JSON string from the
// upstream caller's previous assistant turn; pass-through as-is.
if tc := m.GetToolCalls(); tc != "" {
_ = json.Unmarshal([]byte(tc), &msg.ToolCalls)
}
req.Messages = append(req.Messages, msg)
}
// Fallback for plain Prompt requests (no Messages array). LocalAI
// templating may have produced a flat prompt; rewrap as a single
// user message so the upstream chat endpoint accepts it.
if len(req.Messages) == 0 && opts.GetPrompt() != "" {
req.Messages = []openAIMessage{{Role: "user", Content: opts.GetPrompt()}}
}
return json.Marshal(req)
}
// modelName picks the upstream model: upstream_model from the proxy
// config wins (operator override), else the local model name captured
// at LoadModel time. Operator sets upstream_model to map LocalAI's
// alias (e.g. "claude-strict") to the upstream's canonical name
// (e.g. "claude-3-5-sonnet-20241022").
func modelName(cfg *proxyConfig, _ *pb.PredictOptions) string {
if cfg.upstreamModel != "" {
return cfg.upstreamModel
}
return cfg.localModel
}
// parseRawJSON parses a JSON string into a RawMessage so it round-trips
// into the upstream body. Returns nil for empty/invalid input so the
// field is omitted (omitempty).
func parseRawJSON(s string) json.RawMessage {
if s == "" {
return nil
}
var probe json.RawMessage
if err := json.Unmarshal([]byte(s), &probe); err != nil {
return nil
}
return probe
}
// doOpenAIRequest builds + sends the upstream request. Returns the
// raw response on success; caller handles status / body.
func (c *CloudProxy) doOpenAIRequest(ctx context.Context, cfg *proxyConfig, body []byte) (*http.Response, error) {
req, err := http.NewRequestWithContext(ctx, http.MethodPost, cfg.upstreamURL, bytes.NewReader(body))
if err != nil {
return nil, fmt.Errorf("cloud-proxy: build request: %w", err)
}
req.Header.Set("Content-Type", "application/json")
req.Header.Set("Accept", "*/*")
if cfg.apiKey != "" {
applyAuthHeader(req, cfg.provider, cfg.apiKey)
}
resp, err := c.client.Do(req)
if err != nil {
return nil, fmt.Errorf("cloud-proxy: upstream request: %w", err)
}
return resp, nil
}
// predictOpenAIRich is the non-streaming translate path. Returns a
// fully-populated *pb.Reply with assistant content, tool calls, and
// token usage. The gRPC server forwards the Reply verbatim.
func (c *CloudProxy) predictOpenAIRich(ctx context.Context, cfg *proxyConfig, opts *pb.PredictOptions) (*pb.Reply, error) {
body, err := buildOpenAIRequest(opts, cfg, false)
if err != nil {
return nil, fmt.Errorf("cloud-proxy: marshal request: %w", err)
}
resp, err := c.doOpenAIRequest(ctx, cfg, body)
if err != nil {
return nil, err
}
defer func() { _ = resp.Body.Close() }()
if resp.StatusCode >= 400 {
errBody, _ := io.ReadAll(io.LimitReader(resp.Body, 1<<20))
return nil, fmt.Errorf("cloud-proxy: upstream %d: %s", resp.StatusCode, string(errBody))
}
var parsed openAIResponse
if err := json.NewDecoder(resp.Body).Decode(&parsed); err != nil {
return nil, fmt.Errorf("cloud-proxy: decode response: %w", err)
}
if len(parsed.Choices) == 0 {
return nil, errors.New("cloud-proxy: upstream returned no choices")
}
choice := parsed.Choices[0]
reply := &pb.Reply{
Message: []byte(choice.Message.Content),
}
if parsed.Usage != nil {
reply.PromptTokens = int32(parsed.Usage.PromptTokens)
reply.Tokens = int32(parsed.Usage.CompletionTokens)
}
if len(choice.Message.ToolCalls) > 0 {
// Non-streaming: a single ChatDelta carries the full tool-call
// set. Index/Name/Arguments are populated together; downstream
// consumers don't need to assemble streaming deltas.
delta := &pb.ChatDelta{}
for _, tc := range choice.Message.ToolCalls {
delta.ToolCalls = append(delta.ToolCalls,
newToolCallDelta(tc.Index, tc.ID, tc.Function.Name, tc.Function.Arguments))
}
reply.ChatDeltas = []*pb.ChatDelta{delta}
}
return reply, nil
}
// predictOpenAIStreamRich streams *pb.Reply chunks. Each chunk carries
// either a content delta (Message + ChatDeltas[].Content) or tool-call
// deltas (ChatDeltas[].ToolCalls). The final Reply carries usage tokens
// when the upstream sends them (stream_options.include_usage).
func (c *CloudProxy) predictOpenAIStreamRich(ctx context.Context, cfg *proxyConfig, opts *pb.PredictOptions, results chan<- *pb.Reply) error {
body, err := buildOpenAIRequest(opts, cfg, true)
if err != nil {
return fmt.Errorf("cloud-proxy: marshal request: %w", err)
}
resp, err := c.doOpenAIRequest(ctx, cfg, body)
if err != nil {
return err
}
defer func() { _ = resp.Body.Close() }()
if resp.StatusCode >= 400 {
errBody, _ := io.ReadAll(io.LimitReader(resp.Body, 1<<20))
return fmt.Errorf("cloud-proxy: upstream %d: %s", resp.StatusCode, string(errBody))
}
scanner := bufio.NewScanner(resp.Body)
scanner.Buffer(make([]byte, 0, 64*1024), 1<<20)
for scanner.Scan() {
line := scanner.Text()
if !strings.HasPrefix(line, "data:") {
continue
}
payload := strings.TrimSpace(strings.TrimPrefix(line, "data:"))
if payload == "" || payload == "[DONE]" {
return nil
}
var chunk openAIStreamChunk
if err := json.Unmarshal([]byte(payload), &chunk); err != nil {
xlog.Debug("cloud-proxy: skip malformed SSE chunk", "error", err)
continue
}
// Usage frames may arrive separately from content frames when
// stream_options.include_usage is set; emit a usage-only Reply
// in that case so the consumer sees the totals.
if chunk.Usage != nil && len(chunk.Choices) == 0 {
if !sendReply(ctx, results, &pb.Reply{
PromptTokens: int32(chunk.Usage.PromptTokens),
Tokens: int32(chunk.Usage.CompletionTokens),
}) {
return ctx.Err()
}
continue
}
for _, ch := range chunk.Choices {
reply := &pb.Reply{}
if ch.Delta.Content != "" {
reply.Message = []byte(ch.Delta.Content)
reply.ChatDeltas = []*pb.ChatDelta{{Content: ch.Delta.Content}}
}
if len(ch.Delta.ToolCalls) > 0 {
if len(reply.ChatDeltas) == 0 {
reply.ChatDeltas = []*pb.ChatDelta{{}}
}
for _, tc := range ch.Delta.ToolCalls {
reply.ChatDeltas[0].ToolCalls = append(reply.ChatDeltas[0].ToolCalls,
newToolCallDelta(tc.Index, tc.ID, tc.Function.Name, tc.Function.Arguments))
}
}
if reply.Message == nil && len(reply.ChatDeltas) == 0 {
continue
}
if !sendReply(ctx, results, reply) {
return ctx.Err()
}
}
}
return scanner.Err()
}

View File

@@ -0,0 +1,170 @@
package main
import (
"encoding/json"
"io"
"net/http"
"net/http/httptest"
"strings"
"testing"
. "github.com/onsi/gomega"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
)
// fakeOpenAIUpstream returns an httptest.Server that decodes the
// inbound request as an openAIRequest, calls handler with it, and
// writes the handler's reply as the response.
func fakeOpenAIUpstream(t *testing.T, handler func(req openAIRequest) (status int, body string, contentType string)) (*httptest.Server, *openAIRequest) {
t.Helper()
var captured openAIRequest
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
raw, _ := io.ReadAll(r.Body)
_ = json.Unmarshal(raw, &captured)
status, body, ct := handler(captured)
w.Header().Set("Content-Type", ct)
w.WriteHeader(status)
_, _ = io.WriteString(w, body)
}))
return srv, &captured
}
func newTranslateCloudProxy(t *testing.T, upstreamURL string) *CloudProxy {
t.Helper()
g := NewWithT(t)
t.Setenv("CLOUD_PROXY_OPENAI_FAKE", "sk-fake-openai")
cp := NewCloudProxy()
err := cp.Load(&pb.ModelOptions{
Model: "gpt-4o-local",
Proxy: &pb.ProxyOptions{
UpstreamUrl: upstreamURL,
Mode: modeTranslate,
Provider: providerOpenAI,
ApiKeyEnv: "CLOUD_PROXY_OPENAI_FAKE",
UpstreamModel: "gpt-4o",
},
})
g.Expect(err).NotTo(HaveOccurred())
return cp
}
func TestPredict_OpenAI_BasicChat(t *testing.T) {
g := NewWithT(t)
srv, captured := fakeOpenAIUpstream(t, func(_ openAIRequest) (int, string, string) {
return 200, `{"id":"resp-1","choices":[{"index":0,"message":{"role":"assistant","content":"hi there"},"finish_reason":"stop"}],"usage":{"prompt_tokens":5,"completion_tokens":2,"total_tokens":7}}`, "application/json"
})
defer srv.Close()
cp := newTranslateCloudProxy(t, srv.URL)
got, err := cp.Predict(&pb.PredictOptions{
Messages: []*pb.Message{
{Role: "system", Content: "be brief"},
{Role: "user", Content: "hello"},
},
Temperature: 0.5,
TopP: 0.9,
Tokens: 32,
})
g.Expect(err).NotTo(HaveOccurred())
g.Expect(got).To(Equal("hi there"))
// Verify the upstream saw a properly-translated request.
g.Expect(captured.Model).To(Equal("gpt-4o"))
g.Expect(captured.Messages).To(HaveLen(2))
g.Expect(captured.Messages[0].Role).To(Equal("system"))
g.Expect(captured.Messages[1].Role).To(Equal("user"))
g.Expect(captured.Temperature).NotTo(BeNil())
g.Expect(*captured.Temperature).To(Equal(0.5))
g.Expect(captured.MaxTokens).NotTo(BeNil())
g.Expect(*captured.MaxTokens).To(Equal(int32(32)))
g.Expect(captured.Stream).To(BeFalse())
}
func TestPredict_OpenAI_PromptFallback(t *testing.T) {
g := NewWithT(t)
// No Messages array — backend should synth a single user message
// from Prompt so non-chat clients still route through translate.
srv, captured := fakeOpenAIUpstream(t, func(_ openAIRequest) (int, string, string) {
return 200, `{"choices":[{"message":{"role":"assistant","content":"ok"}}]}`, "application/json"
})
defer srv.Close()
cp := newTranslateCloudProxy(t, srv.URL)
_, err := cp.Predict(&pb.PredictOptions{Prompt: "what time is it?"})
g.Expect(err).NotTo(HaveOccurred())
g.Expect(captured.Messages).To(HaveLen(1))
g.Expect(captured.Messages[0].Role).To(Equal("user"))
g.Expect(captured.Messages[0].Content).To(Equal("what time is it?"))
}
func TestPredict_OpenAI_UpstreamError(t *testing.T) {
g := NewWithT(t)
srv, _ := fakeOpenAIUpstream(t, func(_ openAIRequest) (int, string, string) {
return 401, `{"error":{"message":"bad key"}}`, "application/json"
})
defer srv.Close()
cp := newTranslateCloudProxy(t, srv.URL)
_, err := cp.Predict(&pb.PredictOptions{Messages: []*pb.Message{{Role: "user", Content: "x"}}})
g.Expect(err).To(HaveOccurred())
g.Expect(err.Error()).To(ContainSubstring("401"))
}
func TestPredictStream_OpenAI_StreamsContent(t *testing.T) {
g := NewWithT(t)
// Stream three content deltas then [DONE]. Verify the channel
// receives them in order with no missing pieces.
chunks := []string{
`{"choices":[{"index":0,"delta":{"role":"assistant"}}]}`,
`{"choices":[{"index":0,"delta":{"content":"hello"}}]}`,
`{"choices":[{"index":0,"delta":{"content":" "}}]}`,
`{"choices":[{"index":0,"delta":{"content":"world"}}]}`,
`{"choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}`,
}
body := ""
for _, c := range chunks {
body += "data: " + c + "\n\n"
}
body += "data: [DONE]\n\n"
srv, captured := fakeOpenAIUpstream(t, func(_ openAIRequest) (int, string, string) {
return 200, body, "text/event-stream"
})
defer srv.Close()
cp := newTranslateCloudProxy(t, srv.URL)
results := make(chan string, 8)
done := make(chan error, 1)
go func() {
done <- cp.PredictStream(&pb.PredictOptions{
Messages: []*pb.Message{{Role: "user", Content: "hi"}},
}, results)
}()
var got []string
for s := range results {
got = append(got, s)
}
err := <-done
g.Expect(err).NotTo(HaveOccurred())
g.Expect(strings.Join(got, "")).To(Equal("hello world"))
g.Expect(captured.Stream).To(BeTrue())
}
func TestPredict_RejectedInPassthroughMode(t *testing.T) {
g := NewWithT(t)
t.Setenv("CLOUD_PROXY_FAKE", "k")
cp := NewCloudProxy()
err := cp.Load(&pb.ModelOptions{
Proxy: &pb.ProxyOptions{
UpstreamUrl: "https://example.com",
Mode: modePassthrough,
ApiKeyEnv: "CLOUD_PROXY_FAKE",
},
})
g.Expect(err).NotTo(HaveOccurred())
_, err = cp.Predict(&pb.PredictOptions{})
g.Expect(err).To(HaveOccurred())
g.Expect(err.Error()).To(ContainSubstring("only valid in translate"))
}

View File

@@ -0,0 +1,436 @@
package main
import (
"context"
"errors"
"fmt"
"io"
"net/http"
"net/url"
"os"
"strings"
"sync/atomic"
"github.com/mudler/xlog"
"github.com/mudler/LocalAI/pkg/grpc/base"
"github.com/mudler/LocalAI/pkg/grpc/grpcerrors"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/LocalAI/pkg/httpclient"
)
// Mirror of core/config.Proxy{Mode,Provider}* — backends don't
// import core to keep the boundary clean.
const (
modePassthrough = "passthrough"
modeTranslate = "translate"
providerOpenAI = "openai"
providerAnthropic = "anthropic"
)
// CloudProxy is the LocalAI backend that proxies model traffic to a
// configured upstream HTTP provider. Concurrency: base.SingleThread is
// NOT embedded — forward calls are independent and HTTP transport is
// goroutine-safe, so multiple Forward streams can run in parallel.
// Locking would serialise requests to a chat provider for no benefit.
type CloudProxy struct {
base.Base
cfg atomic.Pointer[proxyConfig]
client *http.Client
}
type proxyConfig struct {
upstreamURL string
mode string
provider string
upstreamModel string
localModel string // ModelOptions.Model — fallback when upstream_model is unset
apiKey string // resolved at Load time
}
func NewCloudProxy() *CloudProxy {
// httpclient.New refuses redirects outright: the proxy talks to a
// single configured upstream API (OpenAI/Anthropic/...) that answers
// directly, so a 3xx means misconfiguration, a hijacked upstream, or
// DNS trickery — never normal operation. Following it would replay the
// request, including the operator's x-api-key (which Go does NOT strip
// on cross-host redirects), to an unvetted host and leak the key
// (GHSA-3mj3-57v2-4636). It also imposes no body deadline, so streaming
// SSE responses that legitimately last minutes are not truncated.
return &CloudProxy{client: httpclient.New()}
}
func (c *CloudProxy) Load(opts *pb.ModelOptions) error {
po := opts.GetProxy()
if po == nil {
return errors.New("cloud-proxy: Load requires ProxyOptions to be set")
}
if po.GetUpstreamUrl() == "" {
return errors.New("cloud-proxy: upstream_url is required")
}
if _, err := url.ParseRequestURI(po.GetUpstreamUrl()); err != nil {
return fmt.Errorf("cloud-proxy: upstream_url %q invalid: %w", po.GetUpstreamUrl(), err)
}
mode := po.GetMode()
if mode == "" {
mode = modePassthrough
}
switch mode {
case modePassthrough:
case modeTranslate:
switch po.GetProvider() {
case providerOpenAI:
// implemented in provider_openai.go
case providerAnthropic:
// implemented in provider_anthropic.go
default:
return fmt.Errorf("cloud-proxy: translate mode requires provider in {%s, %s}, got %q",
providerOpenAI, providerAnthropic, po.GetProvider())
}
default:
return fmt.Errorf("cloud-proxy: unknown mode %q", mode)
}
key, err := resolveAPIKey(po.GetApiKeyEnv(), po.GetApiKeyFile())
if err != nil {
return err
}
c.cfg.Store(&proxyConfig{
upstreamURL: po.GetUpstreamUrl(),
mode: mode,
provider: po.GetProvider(),
upstreamModel: po.GetUpstreamModel(),
localModel: opts.GetModel(),
apiKey: key,
})
xlog.Info("cloud-proxy: ready",
"upstream", po.GetUpstreamUrl(),
"mode", mode,
"provider", po.GetProvider(),
"has_key", key != "")
return nil
}
// resolveAPIKey mirrors config.ProxyConfig.ResolveAPIKey. Duplicated
// (a few lines) rather than importing core/config from a backend
// binary — keeps backends independent of core's package layout.
// Mutual-exclusion is enforced upstream in core/config.Validate.
func resolveAPIKey(envName, filePath string) (string, error) {
if envName != "" {
v := os.Getenv(envName)
if v == "" {
return "", fmt.Errorf("cloud-proxy: api_key_env %q is unset", envName)
}
return v, nil
}
if filePath != "" {
b, err := os.ReadFile(filePath)
if err != nil {
return "", fmt.Errorf("cloud-proxy: read api_key_file %q: %w", filePath, err)
}
return strings.TrimSpace(string(b)), nil
}
return "", nil
}
// PredictRich is the non-streaming translate path. Returns a fully-
// populated *pb.Reply: content, tool-call deltas (ChatDeltas), and
// usage tokens. Implements the optional grpc.AIModelRich interface;
// the gRPC server prefers this path over Predict when present so
// tool calls survive the round-trip. Passthrough mode rejects
// PredictRich — callers must use Forward.
func (c *CloudProxy) PredictRich(opts *pb.PredictOptions) (reply *pb.Reply, err error) {
cfg := c.cfg.Load()
if cfg == nil {
return nil, grpcerrors.ModelNotLoaded("cloud-proxy")
}
if cfg.mode != modeTranslate {
return nil, fmt.Errorf("cloud-proxy: Predict only valid in translate mode (have %s)", cfg.mode)
}
xlog.Info("cloud-proxy: predict", "provider", cfg.provider, "upstream", cfg.upstreamURL, "upstream_model", cfg.upstreamModel)
defer func() {
if err != nil {
xlog.Warn("cloud-proxy: predict failed", "provider", cfg.provider, "error", err)
}
}()
ctx := context.Background()
switch cfg.provider {
case providerOpenAI:
return c.predictOpenAIRich(ctx, cfg, opts)
case providerAnthropic:
return c.predictAnthropicRich(ctx, cfg, opts)
default:
return nil, fmt.Errorf("cloud-proxy: predict not implemented for provider %q", cfg.provider)
}
}
// PredictStreamRich is the rich streaming counterpart of PredictRich.
// Each emitted Reply carries either a content delta, tool-call deltas,
// or usage tokens (the final upstream frame). base.Base.PredictStream
// is bypassed when AIModelRich is implemented, so the channel is
// closed by the gRPC server pump.
func (c *CloudProxy) PredictStreamRich(opts *pb.PredictOptions, results chan<- *pb.Reply) (err error) {
cfg := c.cfg.Load()
if cfg == nil {
return grpcerrors.ModelNotLoaded("cloud-proxy")
}
if cfg.mode != modeTranslate {
return fmt.Errorf("cloud-proxy: PredictStream only valid in translate mode (have %s)", cfg.mode)
}
xlog.Info("cloud-proxy: predict-stream", "provider", cfg.provider, "upstream", cfg.upstreamURL, "upstream_model", cfg.upstreamModel)
defer func() {
if err != nil {
xlog.Warn("cloud-proxy: predict-stream failed", "provider", cfg.provider, "error", err)
}
}()
ctx := context.Background()
switch cfg.provider {
case providerOpenAI:
return c.predictOpenAIStreamRich(ctx, cfg, opts, results)
case providerAnthropic:
return c.predictAnthropicStreamRich(ctx, cfg, opts, results)
default:
return fmt.Errorf("cloud-proxy: predictStream not implemented for provider %q", cfg.provider)
}
}
// Predict is the legacy (string, error) AIModel signature. Used only
// if a caller goes through the non-rich path (it shouldn't, since
// server.go prefers PredictRich). Provided so the AIModel interface
// is satisfied for backends that haven't opted into the rich variant.
func (c *CloudProxy) Predict(opts *pb.PredictOptions) (string, error) {
reply, err := c.PredictRich(opts)
if err != nil {
return "", err
}
return string(reply.GetMessage()), nil
}
// PredictStream is the legacy chan-string streaming path. Adapts the
// rich stream by extracting only content text — tool-call-only chunks
// (no Message bytes) and usage-only chunks are silently dropped, since
// the legacy chan-string contract cannot represent them. Consumers
// that need tool calls must call PredictStreamRich directly.
func (c *CloudProxy) PredictStream(opts *pb.PredictOptions, results chan string) error {
defer close(results)
richCh := make(chan *pb.Reply)
errCh := make(chan error, 1)
go func() {
errCh <- c.PredictStreamRich(opts, richCh)
close(richCh)
}()
for reply := range richCh {
if msg := reply.GetMessage(); len(msg) > 0 {
results <- string(msg)
}
}
return <-errCh
}
// sendReply pushes one Reply onto a stream channel honouring ctx
// cancellation. Returns false on cancel so the caller can exit with
// ctx.Err(). Used by both translate-mode providers.
func sendReply(ctx context.Context, results chan<- *pb.Reply, reply *pb.Reply) bool {
select {
case results <- reply:
return true
case <-ctx.Done():
return false
}
}
// newToolCallDelta is a small constructor for the cross-provider
// tool-call delta shape. Centralised so the int32 cast and the four
// fields stay consistent across the OpenAI / Anthropic translators.
// Empty name/args are valid — Anthropic streaming announces the call
// with id+name then sends arguments incrementally; OpenAI's reverse
// pattern (args without name) also lands here.
func newToolCallDelta(index int, id, name, args string) *pb.ToolCallDelta {
return &pb.ToolCallDelta{
Index: int32(index),
Id: id,
Name: name,
Arguments: args,
}
}
// Forward shovels bytes between a Forward gRPC stream and an upstream
// HTTP request. First request message carries path/method/headers and
// the initial body chunk; subsequent messages append body chunks. The
// first reply carries upstream status + response headers; subsequent
// replies stream body chunks until the upstream connection closes.
// Cancellation of ctx (the gRPC stream context) closes the upstream
// connection.
func (c *CloudProxy) Forward(ctx context.Context, in <-chan *pb.ForwardRequest, out chan<- *pb.ForwardReply) error {
defer close(out)
cfg := c.cfg.Load()
if cfg == nil {
return grpcerrors.ModelNotLoaded("cloud-proxy")
}
if cfg.mode != modePassthrough {
return fmt.Errorf("cloud-proxy: Forward only valid in passthrough mode (have %s)", cfg.mode)
}
first, ok := <-in
if !ok {
return errors.New("cloud-proxy: Forward stream closed before first request")
}
// Honour the per-request path only when the configured upstream_url
// has no path of its own — gallery convention is to put the
// canonical path in upstream_url.
fullURL, err := composeURL(cfg.upstreamURL, first.GetPath())
if err != nil {
return err
}
method := first.GetMethod()
if method == "" {
method = http.MethodPost
}
// Pipe the body in from the gRPC stream so the HTTP request can
// start before the client finishes sending. The pipe-reader is
// closed via CloseWithError on the error paths so the writer
// goroutine doesn't block forever.
pr, pw := io.Pipe()
go func() {
var writeErr error
defer func() { _ = pw.CloseWithError(writeErr) }()
if len(first.GetBodyChunk()) > 0 {
if _, writeErr = pw.Write(first.GetBodyChunk()); writeErr != nil {
return
}
}
for req := range in {
if len(req.GetBodyChunk()) == 0 {
continue
}
if _, writeErr = pw.Write(req.GetBodyChunk()); writeErr != nil {
return
}
}
}()
req, err := http.NewRequestWithContext(ctx, method, fullURL, pr)
if err != nil {
_ = pr.CloseWithError(err) // unblocks the body-pump's pw.Write
return fmt.Errorf("cloud-proxy: build request: %w", err)
}
// Apply caller-supplied headers, then override with the
// authorization header derived from the resolved key. Caller-
// supplied Authorization is always replaced — operators may not
// know the backend's auth scheme, and silently leaking through a
// client Authorization header to a different upstream would
// confuse the upstream and could leak credentials.
for _, h := range first.GetHeaders() {
if h == nil || h.GetName() == "" {
continue
}
// Strip hop-by-hop headers that aren't meaningful to the
// upstream (Host is set by the http client from the URL;
// Content-Length is computed from the body).
if isHopByHopHeader(h.GetName()) {
continue
}
req.Header.Add(h.GetName(), h.GetValue())
}
if cfg.apiKey != "" {
applyAuthHeader(req, cfg.provider, cfg.apiKey)
}
xlog.Info("cloud-proxy: forward", "method", method, "url", fullURL, "provider", cfg.provider)
resp, err := c.client.Do(req)
if err != nil {
xlog.Warn("cloud-proxy: forward upstream failed", "url", fullURL, "error", err)
return fmt.Errorf("cloud-proxy: upstream request failed: %w", err)
}
defer func() { _ = resp.Body.Close() }()
logFn := xlog.Info
if resp.StatusCode >= 400 {
logFn = xlog.Warn
}
logFn("cloud-proxy: forward response", "url", fullURL, "status", resp.StatusCode)
// First reply: status + response headers, no body.
headers := make([]*pb.ForwardHeader, 0, len(resp.Header))
for k, vs := range resp.Header {
for _, v := range vs {
headers = append(headers, &pb.ForwardHeader{Name: k, Value: v})
}
}
out <- &pb.ForwardReply{Status: int32(resp.StatusCode), Headers: headers}
// Subsequent replies: body chunks. Use a fixed 8KB buffer — small
// enough that SSE token frames flush promptly, large enough that
// long chunked-transfer bodies aren't death by a thousand reads.
buf := make([]byte, 8*1024)
for {
n, rerr := resp.Body.Read(buf)
if n > 0 {
chunk := make([]byte, n)
copy(chunk, buf[:n])
out <- &pb.ForwardReply{BodyChunk: chunk}
}
if rerr != nil {
if errors.Is(rerr, io.EOF) {
return nil
}
return fmt.Errorf("cloud-proxy: upstream body read: %w", rerr)
}
}
}
// composeURL combines the configured upstream URL with the per-request
// path. The upstream URL typically already includes the canonical path
// (e.g. https://api.openai.com/v1/chat/completions) so the per-request
// path is ignored in that case. When upstream_url is a bare host
// (https://api.openai.com), the request path is appended.
func composeURL(upstream, reqPath string) (string, error) {
u, err := url.Parse(upstream)
if err != nil {
return "", fmt.Errorf("cloud-proxy: parse upstream_url %q: %w", upstream, err)
}
if u.Path == "" || u.Path == "/" {
u.Path = reqPath
}
return u.String(), nil
}
// applyAuthHeader writes the appropriate authorization header for the
// provider. OpenAI/Anthropic/most providers use Bearer; Anthropic
// historically uses x-api-key + anthropic-version, but accepts Bearer
// too via the OpenAI-compatible path. Default to Bearer when provider
// is empty (passthrough mode where the operator doesn't claim a
// provider).
func applyAuthHeader(req *http.Request, provider, key string) {
switch provider {
case providerAnthropic:
req.Header.Set("x-api-key", key)
if req.Header.Get("anthropic-version") == "" {
req.Header.Set("anthropic-version", "2023-06-01")
}
default:
req.Header.Set("Authorization", "Bearer "+key)
}
}
// isHopByHopHeader returns true for headers that should not be
// forwarded from the client request to the upstream (RFC 7230 §6.1
// hop-by-hop list, plus a few that the http.Client sets itself).
func isHopByHopHeader(name string) bool {
switch strings.ToLower(name) {
case "connection", "proxy-connection", "keep-alive", "transfer-encoding",
"te", "trailer", "upgrade", "host", "content-length":
return true
}
return false
}

View File

@@ -0,0 +1,206 @@
package main
import (
"context"
"errors"
"io"
"net/http"
"net/http/httptest"
"strings"
"testing"
grpc "github.com/mudler/LocalAI/pkg/grpc"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/gomega"
)
// helper: run a CloudProxy in-process via grpc.Provide so tests can
// call Forward through the public Backend interface without listening
// on a real socket.
func newInProcClient(t *testing.T, proxy *CloudProxy) grpc.Backend {
t.Helper()
addr := "test://" + t.Name()
grpc.Provide(addr, proxy)
return grpc.NewClient(addr, true, nil, false)
}
func TestForward_PassthroughEcho(t *testing.T) {
g := NewWithT(t)
// Fake upstream: echoes the request body back, prefixed with a
// canary so the test can assert both that the body reached the
// upstream and the response made it back to the client.
gotBody := make(chan string, 1)
gotAuth := make(chan string, 1)
gotPath := make(chan string, 1)
upstream := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
body, _ := io.ReadAll(r.Body)
gotBody <- string(body)
gotAuth <- r.Header.Get("Authorization")
gotPath <- r.URL.Path
w.Header().Set("X-Echo", "true")
w.WriteHeader(http.StatusOK)
_, _ = w.Write([]byte("echo: " + string(body)))
}))
defer upstream.Close()
t.Setenv("CLOUD_PROXY_FAKE_KEY", "sk-fake")
cp := NewCloudProxy()
err := cp.Load(&pb.ModelOptions{
Proxy: &pb.ProxyOptions{
UpstreamUrl: upstream.URL,
Mode: modePassthrough,
ApiKeyEnv: "CLOUD_PROXY_FAKE_KEY",
},
})
g.Expect(err).NotTo(HaveOccurred())
c := newInProcClient(t, cp)
stream, err := c.Forward(context.Background())
g.Expect(err).NotTo(HaveOccurred())
err = stream.Send(&pb.ForwardRequest{
Path: "/v1/chat/completions",
Method: "POST",
Headers: []*pb.ForwardHeader{{Name: "Content-Type", Value: "application/json"}},
BodyChunk: []byte(`{"prompt":`),
})
g.Expect(err).NotTo(HaveOccurred())
err = stream.Send(&pb.ForwardRequest{BodyChunk: []byte(`"hi"}`)})
g.Expect(err).NotTo(HaveOccurred())
err = stream.CloseSend()
g.Expect(err).NotTo(HaveOccurred())
// First reply: status + headers.
first, err := stream.Recv()
g.Expect(err).NotTo(HaveOccurred())
g.Expect(first.Status).To(Equal(int32(http.StatusOK)))
g.Expect(hasHeader(first.Headers, "X-Echo", "true")).To(BeTrue())
// Subsequent replies: body.
var body []byte
for {
r, err := stream.Recv()
if errors.Is(err, io.EOF) {
break
}
g.Expect(err).NotTo(HaveOccurred())
body = append(body, r.BodyChunk...)
}
g.Expect(string(body)).To(Equal(`echo: {"prompt":"hi"}`))
// Upstream observations.
var gotBodyVal, gotAuthVal, gotPathVal string
g.Eventually(gotBody).Should(Receive(&gotBodyVal), "upstream never saw body")
g.Expect(gotBodyVal).To(Equal(`{"prompt":"hi"}`))
g.Eventually(gotAuth).Should(Receive(&gotAuthVal), "upstream never saw auth header")
g.Expect(gotAuthVal).To(Equal("Bearer sk-fake"))
g.Eventually(gotPath).Should(Receive(&gotPathVal), "upstream never saw path")
g.Expect(gotPathVal).To(Equal("/v1/chat/completions"))
}
func TestForward_AnthropicAuthHeader(t *testing.T) {
g := NewWithT(t)
gotXAPIKey := make(chan string, 1)
gotVersion := make(chan string, 1)
upstream := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
gotXAPIKey <- r.Header.Get("x-api-key")
gotVersion <- r.Header.Get("anthropic-version")
w.WriteHeader(http.StatusOK)
}))
defer upstream.Close()
t.Setenv("CLOUD_PROXY_ANTHROPIC_KEY", "sk-ant-fake")
cp := NewCloudProxy()
err := cp.Load(&pb.ModelOptions{
Proxy: &pb.ProxyOptions{
UpstreamUrl: upstream.URL,
Mode: modePassthrough,
Provider: providerAnthropic,
ApiKeyEnv: "CLOUD_PROXY_ANTHROPIC_KEY",
},
})
g.Expect(err).NotTo(HaveOccurred())
c := newInProcClient(t, cp)
stream, err := c.Forward(context.Background())
g.Expect(err).NotTo(HaveOccurred())
err = stream.Send(&pb.ForwardRequest{Path: "/v1/messages", Method: "POST"})
g.Expect(err).NotTo(HaveOccurred())
_ = stream.CloseSend()
_, _ = stream.Recv() // drain status
for {
if _, err := stream.Recv(); errors.Is(err, io.EOF) || err != nil {
break
}
}
g.Expect(<-gotXAPIKey).To(Equal("sk-ant-fake"))
g.Expect(<-gotVersion).NotTo(BeEmpty())
}
func TestLoad_ValidatesConfig(t *testing.T) {
g := NewWithT(t)
cp := NewCloudProxy()
err := cp.Load(&pb.ModelOptions{})
g.Expect(err).To(HaveOccurred())
g.Expect(err.Error()).To(ContainSubstring("ProxyOptions"))
err = cp.Load(&pb.ModelOptions{Proxy: &pb.ProxyOptions{}})
g.Expect(err).To(HaveOccurred())
g.Expect(err.Error()).To(ContainSubstring("upstream_url"))
err = cp.Load(&pb.ModelOptions{Proxy: &pb.ProxyOptions{
UpstreamUrl: "https://example.com",
Mode: "rewrite",
}})
g.Expect(err).To(HaveOccurred())
g.Expect(err.Error()).To(ContainSubstring("unknown mode"))
// translate + openai should load successfully (Phase 5).
err = cp.Load(&pb.ModelOptions{Proxy: &pb.ProxyOptions{
UpstreamUrl: "https://example.com/v1/chat/completions",
Mode: modeTranslate,
Provider: providerOpenAI,
}})
g.Expect(err).NotTo(HaveOccurred())
// translate + anthropic should load successfully (Phase 6).
err = cp.Load(&pb.ModelOptions{Proxy: &pb.ProxyOptions{
UpstreamUrl: "https://example.com/v1/messages",
Mode: modeTranslate,
Provider: providerAnthropic,
}})
g.Expect(err).NotTo(HaveOccurred())
err = cp.Load(&pb.ModelOptions{Proxy: &pb.ProxyOptions{
UpstreamUrl: "https://example.com",
ApiKeyEnv: "DEFINITELY_UNSET_ENV_VAR_XYZ",
}})
g.Expect(err).To(HaveOccurred())
g.Expect(err.Error()).To(ContainSubstring("unset"))
}
func TestForward_RejectsWithoutLoad(t *testing.T) {
g := NewWithT(t)
cp := NewCloudProxy()
c := newInProcClient(t, cp)
stream, err := c.Forward(context.Background())
g.Expect(err).NotTo(HaveOccurred())
_ = stream.CloseSend()
_, err = stream.Recv()
g.Expect(err).To(HaveOccurred())
g.Expect(err.Error()).To(ContainSubstring("not loaded"))
}
func hasHeader(hs []*pb.ForwardHeader, name, value string) bool {
for _, h := range hs {
if strings.EqualFold(h.GetName(), name) && h.GetValue() == value {
return true
}
}
return false
}

6
backend/go/cloud-proxy/run.sh Executable file
View File

@@ -0,0 +1,6 @@
#!/bin/bash
set -ex
CURDIR=$(dirname "$(realpath $0)")
exec $CURDIR/cloud-proxy "$@"

View File

@@ -0,0 +1,232 @@
package main
import (
"strings"
"testing"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/gomega"
)
// OpenAI: non-streaming tool call response. Verify the response is
// mapped to Reply.ChatDeltas[].ToolCalls with id/name/arguments intact,
// and usage tokens land on Reply.PromptTokens / Reply.Tokens.
func TestPredictRich_OpenAI_ToolCalls(t *testing.T) {
srv, _ := fakeOpenAIUpstream(t, func(_ openAIRequest) (int, string, string) {
return 200, `{
"id":"resp-1",
"choices":[{
"index":0,
"message":{
"role":"assistant",
"content":"",
"tool_calls":[
{"id":"call_abc","type":"function","function":{"name":"get_weather","arguments":"{\"location\":\"SF\"}"}},
{"id":"call_def","type":"function","function":{"name":"get_time","arguments":"{\"tz\":\"PT\"}"}}
]
},
"finish_reason":"tool_calls"
}],
"usage":{"prompt_tokens":42,"completion_tokens":18,"total_tokens":60}
}`, "application/json"
})
defer srv.Close()
g := NewWithT(t)
cp := newTranslateCloudProxy(t, srv.URL)
reply, err := cp.PredictRich(&pb.PredictOptions{
Messages: []*pb.Message{{Role: "user", Content: "what's the weather?"}},
})
g.Expect(err).NotTo(HaveOccurred())
g.Expect(string(reply.GetMessage())).To(Equal(""))
g.Expect(reply.GetPromptTokens()).To(Equal(int32(42)))
g.Expect(reply.GetTokens()).To(Equal(int32(18)))
g.Expect(reply.GetChatDeltas()).To(HaveLen(1))
tcs := reply.GetChatDeltas()[0].GetToolCalls()
g.Expect(tcs).To(HaveLen(2))
g.Expect(tcs[0].GetId()).To(Equal("call_abc"))
g.Expect(tcs[0].GetName()).To(Equal("get_weather"))
g.Expect(tcs[0].GetArguments()).To(ContainSubstring(`"location":"SF"`))
g.Expect(tcs[1].GetId()).To(Equal("call_def"))
g.Expect(tcs[1].GetName()).To(Equal("get_time"))
}
// OpenAI: streaming tool call. Arguments arrive as a sequence of
// delta chunks; the consumer is expected to concatenate by tool index.
// Verify each chunk reaches the channel and the assembled arguments
// match the input.
func TestPredictStreamRich_OpenAI_ToolCallDeltas(t *testing.T) {
chunks := []string{
// Frame 0: announce the tool call (id + name, no args yet).
`{"choices":[{"index":0,"delta":{"role":"assistant","tool_calls":[{"index":0,"id":"call_xyz","type":"function","function":{"name":"search"}}]}}]}`,
// Frames 1-3: arguments arrive in fragments.
`{"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{\"q\":"}}]}}]}`,
`{"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\"clo"}}]}}]}`,
`{"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"uds\"}"}}]}}]}`,
// Stop frame.
`{"choices":[{"index":0,"delta":{},"finish_reason":"tool_calls"}]}`,
}
body := ""
for _, c := range chunks {
body += "data: " + c + "\n\n"
}
body += "data: [DONE]\n\n"
srv, _ := fakeOpenAIUpstream(t, func(_ openAIRequest) (int, string, string) {
return 200, body, "text/event-stream"
})
defer srv.Close()
g := NewWithT(t)
cp := newTranslateCloudProxy(t, srv.URL)
results := make(chan *pb.Reply, 16)
done := make(chan error, 1)
go func() {
done <- cp.PredictStreamRich(&pb.PredictOptions{
Messages: []*pb.Message{{Role: "user", Content: "find something"}},
}, results)
close(results)
}()
var (
toolName string
toolID string
toolIndex int32 = -1
argsBuf strings.Builder
)
for reply := range results {
for _, cd := range reply.GetChatDeltas() {
for _, tc := range cd.GetToolCalls() {
if tc.GetName() != "" {
toolName = tc.GetName()
}
if tc.GetId() != "" {
toolID = tc.GetId()
}
if toolIndex == -1 {
toolIndex = tc.GetIndex()
}
argsBuf.WriteString(tc.GetArguments())
}
}
}
err := <-done
g.Expect(err).NotTo(HaveOccurred())
g.Expect(toolID).To(Equal("call_xyz"))
g.Expect(toolName).To(Equal("search"))
g.Expect(toolIndex).To(Equal(int32(0)))
g.Expect(argsBuf.String()).To(Equal(`{"q":"clouds"}`))
}
// Anthropic: non-streaming tool_use block. The block appears in
// Content[] alongside text blocks; the input field is a structured
// JSON object. Map to ToolCallDelta with arguments as serialised JSON
// so downstream OpenAI-shaped consumers see a familiar format.
func TestPredictRich_Anthropic_ToolUse(t *testing.T) {
srv, _ := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
return 200, `{
"id":"msg_1","type":"message","role":"assistant",
"content":[
{"type":"text","text":"Let me check that."},
{"type":"tool_use","id":"toolu_01","name":"weather","input":{"location":"SF"}}
],
"model":"claude","usage":{"input_tokens":12,"output_tokens":34}
}`, "application/json"
})
defer srv.Close()
g := NewWithT(t)
cp := newAnthropicTranslateCloudProxy(t, srv.URL)
reply, err := cp.PredictRich(&pb.PredictOptions{
Messages: []*pb.Message{{Role: "user", Content: "what's the weather?"}},
Tokens: 64,
})
g.Expect(err).NotTo(HaveOccurred())
g.Expect(string(reply.GetMessage())).To(Equal("Let me check that."))
g.Expect(reply.GetPromptTokens()).To(Equal(int32(12)))
g.Expect(reply.GetTokens()).To(Equal(int32(34)))
g.Expect(reply.GetChatDeltas()).To(HaveLen(1))
g.Expect(reply.GetChatDeltas()[0].GetToolCalls()).To(HaveLen(1))
tc := reply.GetChatDeltas()[0].GetToolCalls()[0]
g.Expect(tc.GetId()).To(Equal("toolu_01"))
g.Expect(tc.GetName()).To(Equal("weather"))
g.Expect(tc.GetArguments()).To(ContainSubstring(`"location":"SF"`))
}
// Anthropic: streaming tool_use. content_block_start announces the
// tool's id + name; input_json_delta events carry argument fragments
// which the consumer accumulates. message_delta carries final usage.
func TestPredictStreamRich_Anthropic_InputJSONDelta(t *testing.T) {
frames := []string{
"event: message_start\ndata: {\"type\":\"message_start\"}\n\n",
// Block 0 is a tool_use; consumer should allocate a slot.
"event: content_block_start\ndata: {\"type\":\"content_block_start\",\"index\":0,\"content_block\":{\"type\":\"tool_use\",\"id\":\"toolu_42\",\"name\":\"lookup\"}}\n\n",
"event: content_block_delta\ndata: {\"type\":\"content_block_delta\",\"index\":0,\"delta\":{\"type\":\"input_json_delta\",\"partial_json\":\"{\\\"q\\\":\"}}\n\n",
"event: content_block_delta\ndata: {\"type\":\"content_block_delta\",\"index\":0,\"delta\":{\"type\":\"input_json_delta\",\"partial_json\":\"\\\"rain\\\"}\"}}\n\n",
"event: content_block_stop\ndata: {\"type\":\"content_block_stop\",\"index\":0}\n\n",
"event: message_delta\ndata: {\"type\":\"message_delta\",\"usage\":{\"output_tokens\":7}}\n\n",
"event: message_stop\ndata: {\"type\":\"message_stop\"}\n\n",
}
body := strings.Join(frames, "")
srv, _ := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
return 200, body, "text/event-stream"
})
defer srv.Close()
g := NewWithT(t)
cp := newAnthropicTranslateCloudProxy(t, srv.URL)
results := make(chan *pb.Reply, 16)
done := make(chan error, 1)
go func() {
done <- cp.PredictStreamRich(&pb.PredictOptions{
Messages: []*pb.Message{{Role: "user", Content: "rain?"}},
Tokens: 64,
}, results)
close(results)
}()
var (
toolID, toolName string
argsBuf strings.Builder
finalTokens int32
)
for reply := range results {
if reply.GetTokens() > 0 && len(reply.GetChatDeltas()) == 0 {
finalTokens = reply.GetTokens()
continue
}
for _, cd := range reply.GetChatDeltas() {
for _, tc := range cd.GetToolCalls() {
if tc.GetId() != "" {
toolID = tc.GetId()
}
if tc.GetName() != "" {
toolName = tc.GetName()
}
argsBuf.WriteString(tc.GetArguments())
}
}
}
err := <-done
g.Expect(err).NotTo(HaveOccurred())
g.Expect(toolID).To(Equal("toolu_42"))
g.Expect(toolName).To(Equal("lookup"))
g.Expect(argsBuf.String()).To(Equal(`{"q":"rain"}`))
g.Expect(finalTokens).To(Equal(int32(7)))
}
// Sanity: the legacy Predict() (string, error) signature still works
// — it delegates to PredictRich and extracts Message.
func TestPredict_LegacyWrapper_OpenAI(t *testing.T) {
srv, _ := fakeOpenAIUpstream(t, func(_ openAIRequest) (int, string, string) {
return 200, `{"choices":[{"message":{"role":"assistant","content":"hello"}}]}`, "application/json"
})
defer srv.Close()
g := NewWithT(t)
cp := newTranslateCloudProxy(t, srv.URL)
got, err := cp.Predict(&pb.PredictOptions{Messages: []*pb.Message{{Role: "user", Content: "hi"}}})
g.Expect(err).NotTo(HaveOccurred())
g.Expect(got).To(Equal("hello"))
}

5
backend/go/crispasr/.gitignore vendored Normal file
View File

@@ -0,0 +1,5 @@
sources
build*
libgocrispasr*.so
crispasr
package

View File

@@ -0,0 +1,30 @@
cmake_minimum_required(VERSION 3.12)
project(gocrispasr LANGUAGES C CXX)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
add_subdirectory(./sources/CrispASR)
add_library(gocrispasr MODULE cpp/crispasr_shim.cpp)
target_include_directories(gocrispasr PRIVATE
${CMAKE_CURRENT_SOURCE_DIR}/sources/CrispASR/include
${CMAKE_CURRENT_SOURCE_DIR}/sources/CrispASR/ggml/include)
# Link the same backend set as crispasr-cli (examples/cli/CMakeLists.txt) so
# the session API can dispatch to every compiled-in architecture, not just
# whisper. crispasr is the referencer; the backend static libs supply the
# per-architecture symbols; ggml is the math/runtime base.
target_link_libraries(gocrispasr PRIVATE
crispasr
parakeet canary canary_ctc cohere granite_speech granite_nle
voxtral voxtral4b qwen3_asr qwen3_tts orpheus chatterbox indextts
kokoro voxcpm2_tts m2m100 t5_translate wav2vec2-ggml vibevoice
silero-lid pyannote-seg funasr paraformer sensevoice
crisp_audio
ggml)
if(CMAKE_CXX_COMPILER_ID MATCHES "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.0)
target_link_libraries(gocrispasr PRIVATE stdc++fs)
endif()
set_property(TARGET gocrispasr PROPERTY CXX_STANDARD 17)
set_target_properties(gocrispasr PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})

View File

@@ -0,0 +1,132 @@
CMAKE_ARGS?=
BUILD_TYPE?=
NATIVE?=false
GOCMD?=go
GO_TAGS?=
JOBS?=$(shell nproc --ignore=1)
# CrispASR version (release tag)
CRISPASR_REPO?=https://github.com/CrispStrobe/CrispASR
CRISPASR_VERSION?=13d54e110e1538e0f0bc3af0680b9ab246cfb48d
SO_TARGET?=libgocrispasr.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
# Keep the build lean: no tests/examples/server/SDL2/curl/ffmpeg (the FROM scratch
# image cannot satisfy those runtime deps). All ASR/TTS model backends stay enabled.
CMAKE_ARGS+=-DCRISPASR_BUILD_TESTS=OFF -DCRISPASR_BUILD_EXAMPLES=OFF -DCRISPASR_BUILD_SERVER=OFF
CMAKE_ARGS+=-DCRISPASR_SDL2=OFF -DCRISPASR_CURL=OFF -DCRISPASR_FFMPEG=OFF
ifeq ($(NATIVE),false)
CMAKE_ARGS+=-DGGML_NATIVE=OFF
endif
ifeq ($(BUILD_TYPE),cublas)
CMAKE_ARGS+=-DGGML_CUDA=ON
else ifeq ($(BUILD_TYPE),openblas)
CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
else ifeq ($(BUILD_TYPE),clblas)
CMAKE_ARGS+=-DGGML_CLBLAST=ON -DCLBlast_DIR=/some/path
else ifeq ($(BUILD_TYPE),hipblas)
CMAKE_ARGS+=-DGGML_HIPBLAS=ON
else ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DGGML_VULKAN=ON
else ifeq ($(OS),Darwin)
ifneq ($(BUILD_TYPE),metal)
CMAKE_ARGS+=-DGGML_METAL=OFF
else
CMAKE_ARGS+=-DGGML_METAL=ON
CMAKE_ARGS+=-DGGML_METAL_EMBED_LIBRARY=ON
endif
endif
ifeq ($(BUILD_TYPE),sycl_f16)
CMAKE_ARGS+=-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx \
-DGGML_SYCL_F16=ON
endif
ifeq ($(BUILD_TYPE),sycl_f32)
CMAKE_ARGS+=-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx
endif
sources/CrispASR:
mkdir -p sources/CrispASR
cd sources/CrispASR && \
git init && \
git remote add origin $(CRISPASR_REPO) && \
git fetch origin && \
git checkout $(CRISPASR_VERSION) && \
git submodule update --init --recursive --depth 1 --single-branch
# CrispASR's src/CMakeLists.txt locates its vendored llama.cpp
# (crispasr-llama-core, used by the chat C-ABI) via ${CMAKE_SOURCE_DIR},
# which assumes CrispASR is the top-level CMake project. We add_subdirectory
# it, so ${CMAKE_SOURCE_DIR} is THIS backend dir and the talk-llama sources
# aren't found. Rewrite to ${PROJECT_SOURCE_DIR} (the crispasr project root),
# which is correct both standalone and as a subproject. Idempotent.
sed -i 's#\$${CMAKE_SOURCE_DIR}/examples/talk-llama#\$${PROJECT_SOURCE_DIR}/examples/talk-llama#' sources/CrispASR/src/CMakeLists.txt
# Detect OS
UNAME_S := $(shell uname -s)
ifeq ($(UNAME_S),Linux)
VARIANT_TARGETS = libgocrispasr-avx.so libgocrispasr-avx2.so libgocrispasr-avx512.so libgocrispasr-fallback.so
else
VARIANT_TARGETS = libgocrispasr-fallback.so
endif
crispasr: main.go gocrispasr.go $(VARIANT_TARGETS)
CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o crispasr ./
package: crispasr
bash package.sh
build: package
clean: purge
rm -rf libgocrispasr*.so package sources/CrispASR crispasr
purge:
rm -rf build*
ifeq ($(UNAME_S),Linux)
libgocrispasr-avx.so: sources/CrispASR
$(MAKE) purge
$(info ${GREEN}I crispasr build info:avx${RESET})
SO_TARGET=libgocrispasr-avx.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgocrispasr-custom
rm -rfv build*
libgocrispasr-avx2.so: sources/CrispASR
$(MAKE) purge
$(info ${GREEN}I crispasr build info:avx2${RESET})
SO_TARGET=libgocrispasr-avx2.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgocrispasr-custom
rm -rfv build*
libgocrispasr-avx512.so: sources/CrispASR
$(MAKE) purge
$(info ${GREEN}I crispasr build info:avx512${RESET})
SO_TARGET=libgocrispasr-avx512.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgocrispasr-custom
rm -rfv build*
endif
libgocrispasr-fallback.so: sources/CrispASR
$(MAKE) purge
$(info ${GREEN}I crispasr build info:fallback${RESET})
SO_TARGET=libgocrispasr-fallback.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgocrispasr-custom
rm -rfv build*
libgocrispasr-custom: CMakeLists.txt cpp/crispasr_shim.cpp cpp/crispasr_shim.h
mkdir -p build-$(SO_TARGET) && \
cd build-$(SO_TARGET) && \
cmake .. $(CMAKE_ARGS) && \
cmake --build . --config Release -j$(JOBS) && \
cd .. && \
mv build-$(SO_TARGET)/libgocrispasr.so ./$(SO_TARGET)
test: crispasr
CGO_ENABLED=0 $(GOCMD) test -v ./...
all: crispasr package

View File

@@ -0,0 +1,253 @@
#include "crispasr_shim.h"
#include "ggml-backend.h"
#include "crispasr.h"
#include <atomic>
#include <vector>
// Opaque session types. crispasr.h declares `struct crispasr_session;` but not
// the result type nor the open/transcribe/result accessors — those are
// CA_EXPORT extern "C" symbols in src/crispasr_c_api.cpp, so we forward-declare
// exactly the ones we use. Signatures verified against
// sources/CrispASR/src/crispasr_c_api.cpp.
struct crispasr_session_result;
extern "C" {
crispasr_session *crispasr_session_open(const char *model_path, int n_threads);
crispasr_session *crispasr_session_open_explicit(const char *model_path,
const char *backend_name,
int n_threads);
int crispasr_session_set_codec_path(crispasr_session *s, const char *path);
void crispasr_session_close(crispasr_session *s);
const char *crispasr_session_backend(crispasr_session *s);
int crispasr_session_set_translate(crispasr_session *s, int enable);
crispasr_session_result *crispasr_session_transcribe_lang(
crispasr_session *s, const float *pcm, int n_samples, const char *language);
int crispasr_session_result_n_segments(crispasr_session_result *r);
const char *crispasr_session_result_segment_text(crispasr_session_result *r,
int i);
int64_t crispasr_session_result_segment_t0(crispasr_session_result *r, int i);
int64_t crispasr_session_result_segment_t1(crispasr_session_result *r, int i);
void crispasr_session_result_free(crispasr_session_result *r);
float *crispasr_session_synthesize(crispasr_session *s, const char *text,
int *out_n_samples);
void crispasr_pcm_free(float *pcm);
int crispasr_session_set_speaker_name(crispasr_session *s, const char *name);
int crispasr_session_set_voice(crispasr_session *s, const char *path,
const char *ref_text_or_null);
}
static crispasr_session *g_session = nullptr;
static crispasr_session_result *g_result = nullptr;
static struct whisper_vad_context *vctx;
static std::vector<float> flat_segs;
static std::atomic<int> g_abort{0};
extern "C" void set_abort(int v) {
g_abort.store(v, std::memory_order_relaxed);
}
static void ggml_log_cb(enum ggml_log_level level, const char *log,
void *data) {
const char *level_str;
if (!log) {
return;
}
switch (level) {
case GGML_LOG_LEVEL_DEBUG:
level_str = "DEBUG";
break;
case GGML_LOG_LEVEL_INFO:
level_str = "INFO";
break;
case GGML_LOG_LEVEL_WARN:
level_str = "WARN";
break;
case GGML_LOG_LEVEL_ERROR:
level_str = "ERROR";
break;
default: /* Potential future-proofing */
level_str = "?????";
break;
}
fprintf(stderr, "[%-5s] ", level_str);
fputs(log, stderr);
fflush(stderr);
}
int load_model(const char *const model_path, int threads,
const char *backend_name) {
whisper_log_set(ggml_log_cb, nullptr);
ggml_backend_load_all();
if (backend_name && *backend_name) {
g_session =
crispasr_session_open_explicit(model_path, backend_name, threads);
} else {
g_session = crispasr_session_open(model_path, threads);
}
if (g_session == nullptr) {
fprintf(stderr, "error: failed to open CrispASR session for model\n");
return 1;
}
fprintf(stderr, "info: CrispASR backend selected: %s\n",
crispasr_session_backend(g_session));
return 0;
}
// set_codec_path forwards a companion file (qwen3-tts codec, orpheus SNAC,
// chatterbox s3gen, or mimo-asr tokenizer) to the active session. Returns 0 on
// success or when the active backend needs no companion, negative on failure,
// and -1 when no session is open.
int set_codec_path(const char *path) {
return g_session ? crispasr_session_set_codec_path(g_session, path) : -1;
}
int load_model_vad(const char *const model_path) {
whisper_log_set(ggml_log_cb, nullptr);
ggml_backend_load_all();
struct whisper_vad_context_params vcparams =
whisper_vad_default_context_params();
// XXX: Overridden to false in upstream due to performance?
// vcparams.use_gpu = true;
vctx = whisper_vad_init_from_file_with_params(model_path, vcparams);
if (vctx == nullptr) {
fprintf(stderr, "error: Failed to init model as VAD\n");
return 1;
}
return 0;
}
int vad(float pcmf32[], size_t pcmf32_len, float **segs_out,
size_t *segs_out_len) {
if (!whisper_vad_detect_speech(vctx, pcmf32, pcmf32_len)) {
fprintf(stderr, "error: failed to detect speech\n");
return 1;
}
struct whisper_vad_params params = whisper_vad_default_params();
struct whisper_vad_segments *segs =
whisper_vad_segments_from_probs(vctx, params);
size_t segn = whisper_vad_segments_n_segments(segs);
// fprintf(stderr, "Got segments %zd\n", segn);
flat_segs.clear();
for (int i = 0; i < segn; i++) {
flat_segs.push_back(whisper_vad_segments_get_segment_t0(segs, i));
flat_segs.push_back(whisper_vad_segments_get_segment_t1(segs, i));
}
// fprintf(stderr, "setting out variables: %p=%p -> %p, %p=%zx -> %zx\n",
// segs_out, *segs_out, flat_segs.data(), segs_out_len, *segs_out_len,
// flat_segs.size());
*segs_out = flat_segs.data();
*segs_out_len = flat_segs.size();
// fprintf(stderr, "freeing segs\n");
whisper_vad_free_segments(segs);
// fprintf(stderr, "returning\n");
return 0;
}
// threads, diarize and prompt are accepted for Go-side API parity but unused
// in Phase 1: the thread count is fixed at session open, and diarization and
// the initial prompt are separate CrispASR features not yet wired through the
// session ASR path.
int transcribe(uint32_t threads, char *lang, bool translate, bool diarize,
float pcmf32[], size_t pcmf32_len, size_t *segs_out_len,
char *prompt) {
(void)threads;
(void)diarize;
(void)prompt;
if (!g_session) {
return 1;
}
// Reset stale abort flag from any prior cancelled call. set_abort remains
// best-effort: the session transcribe call is blocking and exposes no abort
// hook, so a mid-decode abort cannot interrupt it.
g_abort.store(0, std::memory_order_relaxed);
crispasr_session_set_translate(g_session, translate ? 1 : 0);
if (g_result) {
crispasr_session_result_free(g_result);
g_result = nullptr;
}
const char *language = (lang && *lang) ? lang : nullptr;
g_result = crispasr_session_transcribe_lang(g_session, pcmf32, (int)pcmf32_len,
language);
if (!g_result) {
fprintf(stderr, "error: transcription failed\n");
return 1;
}
*segs_out_len = crispasr_session_result_n_segments(g_result);
return 0;
}
const char *get_segment_text(int i) {
if (!g_result) {
return "";
}
return crispasr_session_result_segment_text(g_result, i);
}
int64_t get_segment_t0(int i) {
if (!g_result) {
return 0;
}
return crispasr_session_result_segment_t0(g_result, i);
}
int64_t get_segment_t1(int i) {
if (!g_result) {
return 0;
}
return crispasr_session_result_segment_t1(g_result, i);
}
const char *get_backend(void) {
return g_session ? crispasr_session_backend(g_session) : "";
}
// TTS uses the already-open session (crispasr_session_open auto-detects a TTS
// model). Output is 24 kHz mono float PCM (upstream CrispASR convention),
// malloc'd by the C API; the caller must release it via tts_free.
float *tts_synthesize(const char *text, int *out_n_samples) {
if (out_n_samples) *out_n_samples = 0;
if (!g_session || !text) return nullptr;
return crispasr_session_synthesize(g_session, text, out_n_samples);
}
void tts_free(float *pcm) {
if (pcm) crispasr_pcm_free(pcm);
}
int tts_set_voice(const char *name) {
if (!g_session || !name || !*name) return 0;
return crispasr_session_set_speaker_name(g_session, name);
}
// tts_set_voice_file loads a voice from a file: a .gguf path selects a voice
// pack, a .wav path with a non-empty ref_text performs zero-shot voice cloning
// (the C API returns -2 when ref_text is required but missing). Returns -1 when
// no session is open or path is null.
int tts_set_voice_file(const char *path, const char *ref_text) {
if (!g_session || !path) return -1;
const char *ref = (ref_text && *ref_text) ? ref_text : nullptr;
return crispasr_session_set_voice(g_session, path, ref);
}

View File

@@ -0,0 +1,23 @@
#include <cstddef>
#include <cstdint>
extern "C" {
int load_model(const char *const model_path, int threads,
const char *backend_name);
int set_codec_path(const char *path);
int load_model_vad(const char *const model_path);
int vad(float pcmf32[], size_t pcmf32_size, float **segs_out,
size_t *segs_out_len);
int transcribe(uint32_t threads, char *lang, bool translate, bool diarize,
float pcmf32[], size_t pcmf32_len, size_t *segs_out_len,
char *prompt);
const char *get_segment_text(int i);
int64_t get_segment_t0(int i);
int64_t get_segment_t1(int i);
const char *get_backend(void);
void set_abort(int v);
float *tts_synthesize(const char *text, int *out_n_samples); // 24kHz mono float, malloc'd; NULL on failure
void tts_free(float *pcm);
int tts_set_voice(const char *name); // best-effort speaker selection; 0 ok
int tts_set_voice_file(const char *path, const char *ref_text); // load voice pack (.gguf) or zero-shot clone (.wav + ref_text)
}

View File

@@ -0,0 +1,497 @@
package main
import (
"context"
"fmt"
"os"
"path/filepath"
"strings"
"sync"
"unsafe"
"github.com/go-audio/audio"
"github.com/go-audio/wav"
"github.com/mudler/LocalAI/pkg/grpc/base"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/LocalAI/pkg/utils"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
)
var (
CppLoadModel func(modelPath string, threads int, backendName string) int
CppSetCodecPath func(path string) int
CppLoadModelVAD func(modelPath string) int
CppVAD func(pcmf32 []float32, pcmf32Size uintptr, segsOut unsafe.Pointer, segsOutLen unsafe.Pointer) int
CppTranscribe func(threads uint32, lang string, translate bool, diarize bool, pcmf32 []float32, pcmf32Len uintptr, segsOutLen unsafe.Pointer, prompt string) int
CppGetSegmentText func(i int) string
CppGetSegmentStart func(i int) int64
CppGetSegmentEnd func(i int) int64
CppGetBackend func() string
CppSetAbort func(v int)
CppTTSSynthesize func(text string, outNSamples unsafe.Pointer) uintptr
CppTTSFree func(ptr uintptr)
CppTTSSetVoice func(name string) int
CppTTSSetVoiceFile func(path string, refText string) int
)
type CrispASR struct {
base.SingleThread
}
// splitOption splits a "prefix:value" model option into its key and value,
// matching the convention used by other backends (see sherpa-onnx). It returns
// ok=false when the option carries no ':' separator.
func splitOption(oo string) (key, value string, ok bool) {
parts := strings.SplitN(oo, ":", 2)
if len(parts) != 2 {
return "", "", false
}
return parts[0], parts[1], true
}
func (w *CrispASR) Load(opts *pb.ModelOptions) error {
vadOnly := false
backendName := ""
codecPath := ""
speakerName := ""
voicePath := ""
voiceRefText := ""
for _, oo := range opts.Options {
if oo == "vad_only" {
vadOnly = true
continue
}
switch key, value, ok := splitOption(oo); {
case ok && key == "backend":
backendName = value
case ok && key == "codec":
codecPath = value
case ok && key == "speaker":
speakerName = value
case ok && key == "voice":
voicePath = value
case ok && key == "voice_text":
voiceRefText = value
default:
fmt.Fprintf(os.Stderr, "Unrecognized option: %v\n", oo)
}
}
if vadOnly {
if ret := CppLoadModelVAD(opts.ModelFile); ret != 0 {
return fmt.Errorf("Failed to load CrispASR VAD model")
}
return nil
}
// Resolve a relative companion path against the model directory so a config
// can reference a sibling codec/tokenizer file by name alone.
if codecPath != "" && !filepath.IsAbs(codecPath) {
codecPath = filepath.Join(filepath.Dir(opts.ModelFile), codecPath)
}
// A voice file (.gguf pack or .wav prompt) is resolved against the model
// directory just like the codec, so a config can reference a sibling file.
if voicePath != "" && !filepath.IsAbs(voicePath) {
voicePath = filepath.Join(filepath.Dir(opts.ModelFile), voicePath)
}
if ret := CppLoadModel(opts.ModelFile, int(opts.Threads), backendName); ret != 0 {
return fmt.Errorf("Failed to load CrispASR transcription model")
}
// Load the companion file (codec/tokenizer/s3gen) after the session is open.
// rc==0 means success or "not applicable" for the active backend; only a
// negative code is fatal.
if codecPath != "" {
if rc := CppSetCodecPath(codecPath); rc < 0 {
return fmt.Errorf("crispasr: failed to load companion file %q (rc=%d)", codecPath, rc)
}
fmt.Fprintf(os.Stderr, "CrispASR companion file loaded: %s\n", codecPath)
}
// Apply the Load-time default voice. A baked speaker (speaker:) is selected
// by name and is best-effort: a backend that can't honor it is logged, not
// fatal. A voice file (voice:) is a hard requirement once configured, so a
// negative rc fails Load.
if speakerName != "" {
if rc := CppTTSSetVoice(speakerName); rc != 0 {
fmt.Fprintf(os.Stderr, "crispasr: speaker %q not applied (rc=%d)\n", speakerName, rc)
}
}
if voicePath != "" {
if rc := CppTTSSetVoiceFile(voicePath, voiceRefText); rc < 0 {
return fmt.Errorf("crispasr: failed to load voice %q (rc=%d)", voicePath, rc)
}
fmt.Fprintf(os.Stderr, "CrispASR voice loaded: %s\n", voicePath)
}
fmt.Fprintf(os.Stderr, "CrispASR backend selected: %s\n", CppGetBackend())
return nil
}
func (w *CrispASR) VAD(req *pb.VADRequest) (pb.VADResponse, error) {
audio := req.Audio
// We expect 0xdeadbeef to be overwritten and if we see it in a stack trace we know it wasn't
segsPtr, segsLen := uintptr(0xdeadbeef), uintptr(0xdeadbeef)
segsPtrPtr, segsLenPtr := unsafe.Pointer(&segsPtr), unsafe.Pointer(&segsLen)
if ret := CppVAD(audio, uintptr(len(audio)), segsPtrPtr, segsLenPtr); ret != 0 {
return pb.VADResponse{}, fmt.Errorf("Failed VAD")
}
// Happens when CPP vector has not had any elements pushed to it
if segsPtr == 0 {
return pb.VADResponse{
Segments: []*pb.VADSegment{},
}, nil
}
// unsafeptr warning is caused by segsPtr being on the stack and therefor being subject to stack copying AFAICT
// however the stack shouldn't have grown between setting segsPtr and now, also the memory pointed to is allocated by C++
segs := unsafe.Slice((*float32)(unsafe.Pointer(segsPtr)), segsLen) //nolint:govet // segsPtr addresses C++-owned heap memory passed back through the cgo-free purego boundary; the uintptr->Pointer round-trip is intentional and the buffer outlives this read.
vadSegments := []*pb.VADSegment{}
for i := range len(segs) >> 1 {
s := segs[2*i] / 100
t := segs[2*i+1] / 100
vadSegments = append(vadSegments, &pb.VADSegment{
Start: s,
End: t,
})
}
return pb.VADResponse{
Segments: vadSegments,
}, nil
}
func (w *CrispASR) AudioTranscription(ctx context.Context, opts *pb.TranscriptRequest) (pb.TranscriptResult, error) {
if err := ctx.Err(); err != nil {
return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled")
}
dir, err := os.MkdirTemp("", "crispasr")
if err != nil {
return pb.TranscriptResult{}, err
}
defer func() { _ = os.RemoveAll(dir) }()
convertedPath := filepath.Join(dir, "converted.wav")
if err := utils.AudioToWav(opts.Dst, convertedPath); err != nil {
return pb.TranscriptResult{}, err
}
fh, err := os.Open(convertedPath)
if err != nil {
return pb.TranscriptResult{}, err
}
defer func() { _ = fh.Close() }()
d := wav.NewDecoder(fh)
buf, err := d.FullPCMBuffer()
if err != nil {
return pb.TranscriptResult{}, err
}
data := buf.AsFloat32Buffer().Data
var duration float32
if buf.Format != nil && buf.Format.SampleRate > 0 {
duration = float32(len(data)) / float32(buf.Format.SampleRate)
}
segsLen := uintptr(0xdeadbeef)
segsLenPtr := unsafe.Pointer(&segsLen)
// Watcher: flips the C-side abort flag when ctx is cancelled. The
// goroutine is joined synchronously (close(done) signals it to exit,
// wg.Wait() blocks until it has) so a late CppSetAbort(1) cannot fire
// after the function returns and corrupt the next transcription call.
done := make(chan struct{})
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
select {
case <-ctx.Done():
CppSetAbort(1)
case <-done:
}
}()
defer func() {
close(done)
wg.Wait()
}()
ret := CppTranscribe(opts.Threads, opts.Language, opts.Translate, opts.Diarize, data, uintptr(len(data)), segsLenPtr, opts.Prompt)
if ret == 2 {
return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled")
}
if ret != 0 {
return pb.TranscriptResult{}, fmt.Errorf("Failed Transcribe")
}
segments := []*pb.TranscriptSegment{}
text := ""
for i := range int(segsLen) {
// segment start/end conversion factor taken from https://github.com/ggml-org/whisper.cpp/blob/master/examples/cli/cli.cpp#L895
s := CppGetSegmentStart(i) * (10000000)
t := CppGetSegmentEnd(i) * (10000000)
// The session result can emit bytes that aren't valid UTF-8 (e.g. a
// multibyte codepoint split across token boundaries); protobuf string
// fields reject those at marshal time. Scrub before the value escapes
// cgo. The session result is segment+word based and exposes no token
// IDs, so Tokens is left empty.
txt := strings.ToValidUTF8(strings.Clone(CppGetSegmentText(i)), "<22>")
segment := &pb.TranscriptSegment{
Id: int32(i),
Text: txt,
Start: s, End: t,
}
segments = append(segments, segment)
text += " " + strings.TrimSpace(txt)
}
return pb.TranscriptResult{
Segments: segments,
Text: strings.TrimSpace(text),
Language: opts.Language,
Duration: duration,
}, nil
}
// AudioTranscriptionStream runs the session transcribe to completion and then
// emits one delta per non-empty segment, followed by a final TranscriptResult.
// Progressive/real-time streaming isn't available via the session API (there
// is no per-decode callback), so deltas are emitted per-segment after the
// blocking decode returns rather than as segments are produced. The offline
// AudioTranscription is unchanged; both paths share the session and the
// SingleThread concurrency model.
func (w *CrispASR) AudioTranscriptionStream(ctx context.Context, opts *pb.TranscriptRequest, results chan *pb.TranscriptStreamResponse) error {
defer close(results)
if err := ctx.Err(); err != nil {
return status.Error(codes.Canceled, "transcription cancelled")
}
dir, err := os.MkdirTemp("", "crispasr")
if err != nil {
return err
}
defer func() { _ = os.RemoveAll(dir) }()
convertedPath := filepath.Join(dir, "converted.wav")
if err := utils.AudioToWav(opts.Dst, convertedPath); err != nil {
return err
}
fh, err := os.Open(convertedPath)
if err != nil {
return err
}
defer func() { _ = fh.Close() }()
d := wav.NewDecoder(fh)
buf, err := d.FullPCMBuffer()
if err != nil {
return err
}
data := buf.AsFloat32Buffer().Data
var duration float32
if buf.Format != nil && buf.Format.SampleRate > 0 {
duration = float32(len(data)) / float32(buf.Format.SampleRate)
}
// Same abort-watcher pattern as AudioTranscription. Joined synchronously
// so a late CppSetAbort(1) cannot fire after this function returns.
// Best-effort only: the session transcribe is blocking with no abort hook.
done := make(chan struct{})
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
select {
case <-ctx.Done():
CppSetAbort(1)
case <-done:
}
}()
defer func() {
close(done)
wg.Wait()
}()
segsLen := uintptr(0xdeadbeef)
segsLenPtr := unsafe.Pointer(&segsLen)
ret := CppTranscribe(opts.Threads, opts.Language, opts.Translate, opts.Diarize, data, uintptr(len(data)), segsLenPtr, opts.Prompt)
if ret == 2 {
return status.Error(codes.Canceled, "transcription cancelled")
}
if ret != 0 {
return fmt.Errorf("Failed Transcribe")
}
// Walk the segments once: emit a delta per non-empty segment and build the
// final TranscriptResult.Segments alongside. The first delta has no leading
// space and subsequent ones are prefixed with a single space, so
// concat(deltas) == final.Text exactly, matching the e2e contract.
segments := []*pb.TranscriptSegment{}
var assembled strings.Builder
for i := range int(segsLen) {
s := CppGetSegmentStart(i) * 10000000
t := CppGetSegmentEnd(i) * 10000000
txt := strings.ToValidUTF8(strings.Clone(CppGetSegmentText(i)), "<22>")
segments = append(segments, &pb.TranscriptSegment{
Id: int32(i),
Text: txt,
Start: s, End: t,
})
trimmed := strings.TrimSpace(txt)
if trimmed == "" {
continue
}
var delta string
if assembled.Len() == 0 {
delta = trimmed
} else {
delta = " " + trimmed
}
results <- &pb.TranscriptStreamResponse{Delta: delta}
assembled.WriteString(delta)
}
final := &pb.TranscriptResult{
Segments: segments,
Text: assembled.String(),
Language: opts.Language,
Duration: duration,
}
results <- &pb.TranscriptStreamResponse{FinalResult: final}
return nil
}
// synthesize returns 24 kHz mono float32 PCM for text via the open session.
func (w *CrispASR) synthesize(text string) ([]float32, error) {
if text == "" {
return nil, fmt.Errorf("crispasr: TTS requires non-empty text")
}
var n int32
ptr := CppTTSSynthesize(text, unsafe.Pointer(&n))
if ptr == 0 || n <= 0 {
return nil, fmt.Errorf("crispasr: synthesis failed (the loaded model may not be a supported TTS backend, or needs extra config e.g. orpheus SNAC codec)")
}
defer CppTTSFree(ptr)
src := unsafe.Slice((*float32)(unsafe.Pointer(ptr)), int(n)) //nolint:govet // ptr addresses C-allocated PCM returned across the purego boundary; copied out immediately below, before tts_free.
out := make([]float32, int(n)) // copy out of C memory before free
copy(out, src)
return out, nil
}
// setVoice applies a per-call speaker/voice override (best effort). CrispASR
// returns a negative code when the active backend can't honor the name; we log
// it rather than fail, so an unknown voice falls back to the default speaker.
func setVoice(voice string) {
v := strings.TrimSpace(voice)
if v == "" {
return
}
if rc := CppTTSSetVoice(v); rc != 0 {
fmt.Fprintf(os.Stderr, "crispasr: voice %q not applied by the active TTS backend (rc=%d); using default\n", v, rc)
}
}
func (w *CrispASR) TTS(req *pb.TTSRequest) error {
if req.Dst == "" {
return fmt.Errorf("crispasr: TTS requires a destination path")
}
setVoice(req.Voice)
pcm, err := w.synthesize(req.Text)
if err != nil {
return err
}
return writeWAV24k(req.Dst, pcm)
}
// TTSStream is the streaming counterpart to TTS. CrispASR has no progressive
// (native streaming) synth, so we synthesize the whole utterance, encode it to
// a 24 kHz WAV, and emit the encoded bytes as a single chunk. The gRPC server
// wrapper (pkg/grpc/server.go:TTSStream) ranges over the channel until it is
// closed, so this method owns the close - mirrors vibevoice-cpp's TTSStream.
func (w *CrispASR) TTSStream(req *pb.TTSRequest, results chan []byte) error {
defer close(results)
if req.Text == "" {
return fmt.Errorf("crispasr: TTSStream requires text")
}
setVoice(req.Voice)
pcm, err := w.synthesize(req.Text)
if err != nil {
return err
}
tmp, err := os.CreateTemp("", "crispasr-tts-stream-*.wav")
if err != nil {
return fmt.Errorf("crispasr: tempfile: %w", err)
}
dst := tmp.Name()
if err := tmp.Close(); err != nil {
return fmt.Errorf("crispasr: close tempfile: %w", err)
}
defer func() { _ = os.Remove(dst) }()
if err := writeWAV24k(dst, pcm); err != nil {
return err
}
encoded, err := os.ReadFile(dst)
if err != nil {
return fmt.Errorf("crispasr: read tempfile: %w", err)
}
results <- encoded
return nil
}
// writeWAV24k writes pcm as a 24000 Hz, mono, 16-bit PCM WAV at dst.
func writeWAV24k(dst string, pcm []float32) error {
f, err := os.Create(dst)
if err != nil {
return fmt.Errorf("crispasr: create %q: %w", dst, err)
}
enc := wav.NewEncoder(f, 24000, 16, 1, 1)
ints := make([]int, len(pcm))
for i, s := range pcm {
if s > 1 {
s = 1
} else if s < -1 {
s = -1
}
ints[i] = int(s * 32767)
}
buf := &audio.IntBuffer{
Format: &audio.Format{NumChannels: 1, SampleRate: 24000},
Data: ints,
SourceBitDepth: 16,
}
if err := enc.Write(buf); err != nil {
_ = enc.Close()
_ = f.Close()
return fmt.Errorf("crispasr: encode WAV: %w", err)
}
if err := enc.Close(); err != nil {
_ = f.Close()
return fmt.Errorf("crispasr: finalize WAV: %w", err)
}
if err := f.Close(); err != nil {
return fmt.Errorf("crispasr: close %q: %w", dst, err)
}
return nil
}

View File

@@ -0,0 +1,193 @@
package main
import (
"context"
"os"
"path/filepath"
"strings"
"sync"
"testing"
"github.com/ebitengine/purego"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
)
func TestCrispASR(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "CrispASR Backend Suite")
}
var (
libLoadOnce sync.Once
libLoadErr error
)
// ensureLibLoaded mirrors main.go's bootstrap so a Go test can drive the
// bridge without spinning up the gRPC server. Skips the current spec when the
// shared library isn't present (e.g. running before `make backends/whisper`).
func ensureLibLoaded() {
libLoadOnce.Do(func() {
libName := os.Getenv("CRISPASR_LIBRARY")
if libName == "" {
libName = "./libgocrispasr-fallback.so"
}
if _, err := os.Stat(libName); err != nil {
libLoadErr = err
return
}
gosd, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
if err != nil {
libLoadErr = err
return
}
purego.RegisterLibFunc(&CppLoadModel, gosd, "load_model")
purego.RegisterLibFunc(&CppSetCodecPath, gosd, "set_codec_path")
purego.RegisterLibFunc(&CppTranscribe, gosd, "transcribe")
purego.RegisterLibFunc(&CppGetSegmentText, gosd, "get_segment_text")
purego.RegisterLibFunc(&CppGetSegmentStart, gosd, "get_segment_t0")
purego.RegisterLibFunc(&CppGetSegmentEnd, gosd, "get_segment_t1")
purego.RegisterLibFunc(&CppGetBackend, gosd, "get_backend")
purego.RegisterLibFunc(&CppSetAbort, gosd, "set_abort")
purego.RegisterLibFunc(&CppTTSSynthesize, gosd, "tts_synthesize")
purego.RegisterLibFunc(&CppTTSFree, gosd, "tts_free")
purego.RegisterLibFunc(&CppTTSSetVoice, gosd, "tts_set_voice")
purego.RegisterLibFunc(&CppTTSSetVoiceFile, gosd, "tts_set_voice_file")
})
if libLoadErr != nil {
Skip("whisper library not loadable: " + libLoadErr.Error())
}
}
// fixturesOrSkip returns the model + audio paths or skips the spec if either
// env var is unset. The test never runs in default CI — it requires a real
// whisper model and a long audio file (~3 minutes) on disk.
func fixturesOrSkip() (string, string) {
modelPath := os.Getenv("CRISPASR_MODEL_PATH")
audioPath := os.Getenv("CRISPASR_AUDIO_PATH")
if modelPath == "" || audioPath == "" {
Skip("set CRISPASR_MODEL_PATH and CRISPASR_AUDIO_PATH to run this spec")
}
return modelPath, audioPath
}
// ttsModelOrSkip returns the TTS model path or skips the spec when the env var
// is unset. Like the transcription fixtures, this never runs in default CI — it
// needs a real TTS model (e.g. a vibevoice GGUF) on disk.
func ttsModelOrSkip() string {
modelPath := os.Getenv("CRISPASR_TTS_MODEL_PATH")
if modelPath == "" {
Skip("set CRISPASR_TTS_MODEL_PATH to run this spec")
}
return modelPath
}
var _ = Describe("CrispASR", func() {
Context("AudioTranscription cancellation", func() {
It("returns codes.Canceled on a pre-cancelled context and still succeeds afterwards", func() {
modelPath, audioPath := fixturesOrSkip()
ensureLibLoaded()
w := &CrispASR{}
Expect(w.Load(&pb.ModelOptions{ModelFile: modelPath})).To(Succeed())
// The session transcribe is blocking and exposes no abort hook, so
// a mid-decode cancel can't interrupt it. The contract we can rely
// on is the pre-call ctx.Err() check: a context cancelled before
// the call must yield codes.Canceled without starting a decode.
ctx, cancel := context.WithCancel(context.Background())
cancel()
_, err := w.AudioTranscription(ctx, &pb.TranscriptRequest{
Dst: audioPath,
Threads: 4,
Language: "en",
})
Expect(err).To(HaveOccurred(), "expected pre-cancelled context to fail")
st, ok := status.FromError(err)
Expect(ok).To(BeTrue(), "expected gRPC status error, got %v", err)
Expect(st.Code()).To(Equal(codes.Canceled), "expected codes.Canceled, got %v", err)
// Subsequent transcription must succeed — proves g_abort reset.
res, err := w.AudioTranscription(context.Background(), &pb.TranscriptRequest{
Dst: audioPath,
Threads: 4,
Language: "en",
})
Expect(err).ToNot(HaveOccurred(), "post-cancel transcription failed")
Expect(res.Text).ToNot(BeEmpty(), "post-cancel transcription returned empty text")
})
})
Context("AudioTranscriptionStream", func() {
It("emits multiple deltas progressively for a multi-segment clip", func() {
modelPath, audioPath := fixturesOrSkip()
ensureLibLoaded()
w := &CrispASR{}
Expect(w.Load(&pb.ModelOptions{ModelFile: modelPath})).To(Succeed())
results := make(chan *pb.TranscriptStreamResponse, 64)
done := make(chan error, 1)
go func() {
done <- w.AudioTranscriptionStream(context.Background(), &pb.TranscriptRequest{
Dst: audioPath,
Threads: 4,
Language: "en",
Stream: true,
}, results)
}()
var deltas []string
var assembled strings.Builder
var finalText string
var finalSegmentCount int
for chunk := range results {
if d := chunk.GetDelta(); d != "" {
deltas = append(deltas, d)
assembled.WriteString(d)
}
if final := chunk.GetFinalResult(); final != nil {
finalText = final.GetText()
finalSegmentCount = len(final.GetSegments())
}
}
Expect(<-done).ToNot(HaveOccurred())
// One delta per non-empty segment is emitted after the blocking
// decode returns (the session API has no per-decode callback), so a
// multi-segment clip MUST produce >=2 delta events, and
// concat(deltas) MUST equal final.Text exactly.
Expect(len(deltas)).To(BeNumerically(">=", 2),
"expected multiple deltas from a multi-segment clip, got %d (assembled=%q)",
len(deltas), assembled.String())
Expect(finalSegmentCount).To(BeNumerically(">=", 2),
"expected final to carry multiple segments")
Expect(assembled.String()).To(Equal(finalText),
"concat(deltas) must equal final.Text")
})
})
Context("TTS", func() {
It("synthesizes a non-empty WAV", func() {
ttsModel := ttsModelOrSkip()
ensureLibLoaded()
w := &CrispASR{}
Expect(w.Load(&pb.ModelOptions{ModelFile: ttsModel})).To(Succeed())
dst := filepath.Join(GinkgoT().TempDir(), "out.wav")
Expect(w.TTS(&pb.TTSRequest{Text: "Hello from CrispASR.", Dst: dst})).To(Succeed())
info, err := os.Stat(dst)
Expect(err).ToNot(HaveOccurred(), "synthesized WAV should exist at %q", dst)
// A real 24 kHz mono WAV is a 44-byte header plus samples; anything
// this small would mean an empty/failed synth.
Expect(info.Size()).To(BeNumerically(">", 1024),
"expected a non-trivial WAV, got %d bytes", info.Size())
})
})
})

View File

@@ -0,0 +1,58 @@
package main
// Note: this is started internally by LocalAI and a server is allocated for each model
import (
"flag"
"os"
"github.com/ebitengine/purego"
grpc "github.com/mudler/LocalAI/pkg/grpc"
)
var (
addr = flag.String("addr", "localhost:50051", "the address to connect to")
)
type LibFuncs struct {
FuncPtr any
Name string
}
func main() {
libName := os.Getenv("CRISPASR_LIBRARY")
if libName == "" {
libName = "./libgocrispasr-fallback.so"
}
lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
if err != nil {
panic(err)
}
libFuncs := []LibFuncs{
{&CppLoadModel, "load_model"},
{&CppSetCodecPath, "set_codec_path"},
{&CppLoadModelVAD, "load_model_vad"},
{&CppVAD, "vad"},
{&CppTranscribe, "transcribe"},
{&CppGetSegmentText, "get_segment_text"},
{&CppGetSegmentStart, "get_segment_t0"},
{&CppGetSegmentEnd, "get_segment_t1"},
{&CppGetBackend, "get_backend"},
{&CppSetAbort, "set_abort"},
{&CppTTSSynthesize, "tts_synthesize"},
{&CppTTSFree, "tts_free"},
{&CppTTSSetVoice, "tts_set_voice"},
{&CppTTSSetVoiceFile, "tts_set_voice_file"},
}
for _, lf := range libFuncs {
purego.RegisterLibFunc(lf.FuncPtr, lib, lf.Name)
}
flag.Parse()
if err := grpc.StartServer(*addr, &CrispASR{}); err != nil {
panic(err)
}
}

65
backend/go/crispasr/package.sh Executable file
View File

@@ -0,0 +1,65 @@
#!/bin/bash
# Script to copy the appropriate libraries based on architecture
# This script is used in the final stage of the Dockerfile
set -e
CURDIR=$(dirname "$(realpath $0)")
REPO_ROOT="${CURDIR}/../../.."
# Create lib directory
mkdir -p $CURDIR/package/lib
cp -avf $CURDIR/crispasr $CURDIR/package/
cp -fv $CURDIR/libgocrispasr-*.so $CURDIR/package/
cp -fv $CURDIR/run.sh $CURDIR/package/
# Detect architecture and copy appropriate libraries
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
# x86_64 architecture
echo "Detected x86_64 architecture, copying x86_64 libraries..."
cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
# ARM64 architecture
echo "Detected ARM64 architecture, copying ARM64 libraries..."
cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ $(uname -s) = "Darwin" ]; then
echo "Detected Darwin"
else
echo "Error: Could not detect architecture"
exit 1
fi
# Package GPU libraries based on BUILD_TYPE
# The GPU library packaging script will detect BUILD_TYPE and copy appropriate GPU libraries
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
if [ -f "$GPU_LIB_SCRIPT" ]; then
echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
package_gpu_libs
fi
echo "Packaging completed successfully"
ls -liah $CURDIR/package/
ls -liah $CURDIR/package/lib/

52
backend/go/crispasr/run.sh Executable file
View File

@@ -0,0 +1,52 @@
#!/bin/bash
set -ex
# Get the absolute current dir where the script is located
CURDIR=$(dirname "$(realpath $0)")
cd /
echo "CPU info:"
if [ "$(uname)" != "Darwin" ]; then
grep -e "model\sname" /proc/cpuinfo | head -1
grep -e "flags" /proc/cpuinfo | head -1
fi
LIBRARY="$CURDIR/libgocrispasr-fallback.so"
if [ "$(uname)" != "Darwin" ]; then
if grep -q -e "\savx\s" /proc/cpuinfo ; then
echo "CPU: AVX found OK"
if [ -e $CURDIR/libgocrispasr-avx.so ]; then
LIBRARY="$CURDIR/libgocrispasr-avx.so"
fi
fi
if grep -q -e "\savx2\s" /proc/cpuinfo ; then
echo "CPU: AVX2 found OK"
if [ -e $CURDIR/libgocrispasr-avx2.so ]; then
LIBRARY="$CURDIR/libgocrispasr-avx2.so"
fi
fi
# Check avx 512
if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
echo "CPU: AVX512F found OK"
if [ -e $CURDIR/libgocrispasr-avx512.so ]; then
LIBRARY="$CURDIR/libgocrispasr-avx512.so"
fi
fi
fi
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
export CRISPASR_LIBRARY=$LIBRARY
# If there is a lib/ld.so, use it
if [ -f $CURDIR/lib/ld.so ]; then
echo "Using lib/ld.so"
echo "Using library: $LIBRARY"
exec $CURDIR/lib/ld.so $CURDIR/crispasr "$@"
fi
echo "Using library: $LIBRARY"
exec $CURDIR/crispasr "$@"

View File

@@ -8,6 +8,6 @@ import (
func assert(cond bool, msg string) {
if !cond {
xlog.Fatal().Stack().Msg(msg)
xlog.Fatal(msg)
}
}

View File

@@ -1,7 +1,22 @@
package main
// This is a wrapper to statisfy the GRPC service interface
// It is meant to be used by the main executable that is the server for the specific backend type (falcon, gpt3, etc)
// LocalAI's in-process vector store, exposed as a gRPC backend. Keep
// the implementation here — NOT in a pkg/ library imported by the main
// LocalAI process. The whole point of the gRPC surface is that vector
// storage is a backend like any other (local-store, qdrant, pinecone,
// ...) and can be swapped without changing the routing/recognition
// code that consumes it.
//
// Storage is a sorted parallel-slice (keys [][]float32, values
// [][]byte). Set/Delete preserve the sort so Get can binary-search.
// Find scans linearly and uses a heap to keep the top-K — fine for
// the tens-to-thousands range. The "normalized fast path" (Find when
// every stored key has unit magnitude AND the query is normalized)
// skips the per-item magnitude calculation.
//
// Concurrency: base.SingleThread serialises gRPC calls so the
// non-thread-safe slice/heap manipulation here is sound.
import (
"container/heap"
"fmt"
@@ -10,30 +25,27 @@ import (
"github.com/mudler/LocalAI/pkg/grpc/base"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/xlog"
"github.com/mudler/LocalAI/pkg/store"
)
type Store struct {
base.SingleThread
// The sorted keys
keys [][]float32
// The sorted values
keys [][]float32
values [][]byte
// If for every K it holds that ||k||^2 = 1, then we can use the normalized distance functions
// TODO: Should we normalize incoming keys if they are not instead?
// keysAreNormalized stays true until any non-unit-magnitude key
// is added; once false, the magnitude-aware fallback path is
// used by Find. Re-evaluated only at Set time, never again on
// its own — a deletion of the offending key does NOT flip it
// back to true (the bookkeeping cost would dominate the gain).
keysAreNormalized bool
// The first key decides the length of the keys
keyLen int
}
// TODO: Only used for sorting using Go's builtin implementation. The interfaces are columnar because
// that's theoretically best for memory layout and cache locality, but this isn't optimized yet.
type Pair struct {
Key []float32
Value []byte
// keyLen is the dimension of every stored key. -1 means "no
// keys yet, dimension is open". Dimension mismatch on Set is
// rejected so cosine similarity (which requires equal-length
// vectors) doesn't silently mis-match.
keyLen int
}
func NewStore() *Store {
@@ -45,334 +57,278 @@ func NewStore() *Store {
}
}
func compareSlices(k1, k2 []float32) int {
assert(len(k1) == len(k2), fmt.Sprintf("compareSlices: len(k1) = %d, len(k2) = %d", len(k1), len(k2)))
return slices.Compare(k1, k2)
}
func hasKey(unsortedSlice [][]float32, target []float32) bool {
return slices.ContainsFunc(unsortedSlice, func(k []float32) bool {
return compareSlices(k, target) == 0
})
}
func findInSortedSlice(sortedSlice [][]float32, target []float32) (int, bool) {
return slices.BinarySearchFunc(sortedSlice, target, func(k, t []float32) int {
return compareSlices(k, t)
})
}
func isSortedPairs(kvs []Pair) bool {
for i := 1; i < len(kvs); i++ {
if compareSlices(kvs[i-1].Key, kvs[i].Key) > 0 {
return false
}
}
return true
}
func isSortedKeys(keys [][]float32) bool {
for i := 1; i < len(keys); i++ {
if compareSlices(keys[i-1], keys[i]) > 0 {
return false
}
}
return true
}
func sortIntoKeySlicese(keys []*pb.StoresKey) [][]float32 {
ks := make([][]float32, len(keys))
for i, k := range keys {
ks[i] = k.Floats
}
slices.SortFunc(ks, compareSlices)
assert(len(ks) == len(keys), fmt.Sprintf("len(ks) = %d, len(keys) = %d", len(ks), len(keys)))
assert(isSortedKeys(ks), "keys are not sorted")
return ks
}
// Load is a no-op — local-store has no on-disk artefact. opts.Model is
// just a namespace identifier; isolation is already handled upstream
// (ModelLoader spawns a fresh local-store process per (backend,
// model) tuple, so each namespace is its own Store{} instance).
func (s *Store) Load(opts *pb.ModelOptions) error {
// local-store is an in-memory vector store with no on-disk artefact to
// load — opts.Model is just a namespace identifier. The old `!= ""` guard
// rejected any non-empty model name with "not implemented", which broke
// callers that pass a namespace to isolate embedding spaces (face vs.
// voice biometrics both go through local-store but need distinct stores
// so ArcFace 512-D and ECAPA-TDNN 192-D don't collide). Namespace
// isolation is already handled upstream: ModelLoader spawns a fresh
// local-store process per (backend, model) tuple, so each namespace is
// its own Store{} instance. Nothing to do here beyond accepting the load.
_ = opts
return nil
}
// Sort the incoming kvs and merge them with the existing sorted kvs
func (s *Store) StoresSet(opts *pb.StoresSetOptions) error {
if len(opts.Keys) == 0 {
return fmt.Errorf("no keys to add")
keys := store.UnwrapKeys(opts.Keys)
values := store.UnwrapValues(opts.Values)
if len(keys) == 0 {
return fmt.Errorf("local-store: Set: no keys to add")
}
if len(opts.Keys) != len(opts.Values) {
return fmt.Errorf("len(keys) = %d, len(values) = %d", len(opts.Keys), len(opts.Values))
if len(keys) != len(values) {
return fmt.Errorf("local-store: Set: len(keys) = %d, len(values) = %d", len(keys), len(values))
}
if s.keyLen == -1 {
s.keyLen = len(opts.Keys[0].Floats)
} else {
if len(opts.Keys[0].Floats) != s.keyLen {
return fmt.Errorf("Try to add key with length %d when existing length is %d", len(opts.Keys[0].Floats), s.keyLen)
}
s.keyLen = len(keys[0])
} else if len(keys[0]) != s.keyLen {
return fmt.Errorf("local-store: Set: key length %d does not match existing %d", len(keys[0]), s.keyLen)
}
kvs := make([]Pair, len(opts.Keys))
for i, k := range opts.Keys {
if s.keysAreNormalized && !isNormalized(k.Floats) {
kvs := make([]incomingPair, len(keys))
for i, k := range keys {
if len(k) != s.keyLen {
return fmt.Errorf("local-store: Set: key %d length %d does not match existing %d", i, len(k), s.keyLen)
}
if s.keysAreNormalized && !isNormalized(k) {
s.keysAreNormalized = false
var sample []float32
if len(s.keys) > 5 {
sample = k.Floats[:5]
} else {
sample = k.Floats
}
xlog.Debug("Key is not normalized", "sample", sample)
}
kvs[i] = Pair{
Key: k.Floats,
Value: opts.Values[i].Bytes,
}
kvs[i] = incomingPair{key: k, value: values[i]}
}
slices.SortFunc(kvs, func(a, b Pair) int {
return compareSlices(a.Key, b.Key)
})
assert(len(kvs) == len(opts.Keys), fmt.Sprintf("len(kvs) = %d, len(opts.Keys) = %d", len(kvs), len(opts.Keys)))
assert(isSortedPairs(kvs), "keys are not sorted")
l := len(kvs) + len(s.keys)
merge_ks := make([][]float32, 0, l)
merge_vs := make([][]byte, 0, l)
i, j := 0, 0
for {
if i+j >= l {
break
}
if i >= len(kvs) {
merge_ks = append(merge_ks, s.keys[j])
merge_vs = append(merge_vs, s.values[j])
j++
continue
}
if j >= len(s.keys) {
merge_ks = append(merge_ks, kvs[i].Key)
merge_vs = append(merge_vs, kvs[i].Value)
i++
continue
}
c := compareSlices(kvs[i].Key, s.keys[j])
if c < 0 {
merge_ks = append(merge_ks, kvs[i].Key)
merge_vs = append(merge_vs, kvs[i].Value)
i++
} else if c > 0 {
merge_ks = append(merge_ks, s.keys[j])
merge_vs = append(merge_vs, s.values[j])
j++
} else {
merge_ks = append(merge_ks, kvs[i].Key)
merge_vs = append(merge_vs, kvs[i].Value)
i++
j++
}
}
assert(len(merge_ks) == l, fmt.Sprintf("len(merge_ks) = %d, l = %d", len(merge_ks), l))
assert(isSortedKeys(merge_ks), "merge keys are not sorted")
s.keys = merge_ks
s.values = merge_vs
slices.SortFunc(kvs, func(a, b incomingPair) int { return slices.Compare(a.key, b.key) })
merged := mergeSortedPairs(s.keys, s.values, kvs)
s.keys = merged.keys
s.values = merged.values
assert(slices.IsSortedFunc(s.keys, slices.Compare[[]float32]), "Set: s.keys not sorted post-merge")
assert(len(s.keys) == len(s.values), "Set: keys/values length skew")
return nil
}
func (s *Store) StoresDelete(opts *pb.StoresDeleteOptions) error {
if len(opts.Keys) == 0 {
return fmt.Errorf("no keys to delete")
keys := store.UnwrapKeys(opts.Keys)
if len(keys) == 0 {
return fmt.Errorf("local-store: Delete: no keys to delete")
}
if len(opts.Keys) == 0 {
return fmt.Errorf("no keys to add")
}
if s.keyLen == -1 {
s.keyLen = len(opts.Keys[0].Floats)
} else {
if len(opts.Keys[0].Floats) != s.keyLen {
return fmt.Errorf("Trying to delete key with length %d when existing length is %d", len(opts.Keys[0].Floats), s.keyLen)
}
}
ks := sortIntoKeySlicese(opts.Keys)
l := len(s.keys) - len(ks)
merge_ks := make([][]float32, 0, l)
merge_vs := make([][]byte, 0, l)
tail_ks := s.keys
tail_vs := s.values
for _, k := range ks {
j, found := findInSortedSlice(tail_ks, k)
if found {
merge_ks = append(merge_ks, tail_ks[:j]...)
merge_vs = append(merge_vs, tail_vs[:j]...)
tail_ks = tail_ks[j+1:]
tail_vs = tail_vs[j+1:]
} else {
assert(!hasKey(s.keys, k), fmt.Sprintf("Key exists, but was not found: t=%d, %v", len(tail_ks), k))
}
xlog.Debug("Delete", "found", found, "tailLen", len(tail_ks), "j", j, "mergeKeysLen", len(merge_ks), "mergeValuesLen", len(merge_vs))
}
merge_ks = append(merge_ks, tail_ks...)
merge_vs = append(merge_vs, tail_vs...)
assert(len(merge_ks) <= len(s.keys), fmt.Sprintf("len(merge_ks) = %d, len(s.keys) = %d", len(merge_ks), len(s.keys)))
s.keys = merge_ks
s.values = merge_vs
assert(len(s.keys) >= l, fmt.Sprintf("len(s.keys) = %d, l = %d", len(s.keys), l))
assert(isSortedKeys(s.keys), "keys are not sorted")
assert(func() bool {
for _, k := range ks {
if _, found := findInSortedSlice(s.keys, k); found {
return false
if s.keyLen != -1 {
for i, k := range keys {
if len(k) != s.keyLen {
return fmt.Errorf("local-store: Delete: key %d length %d does not match existing %d", i, len(k), s.keyLen)
}
}
return true
}(), "Keys to delete still present")
if len(s.keys) != l {
xlog.Debug("Delete: Some keys not found", "keysLen", len(s.keys), "expectedLen", l)
}
sortedKeys := append([][]float32(nil), keys...)
slices.SortFunc(sortedKeys, slices.Compare[[]float32])
mergedK := make([][]float32, 0, len(s.keys))
mergedV := make([][]byte, 0, len(s.keys))
tailK := s.keys
tailV := s.values
for _, k := range sortedKeys {
j, ok := slices.BinarySearchFunc(tailK, k, slices.Compare[[]float32])
if ok {
mergedK = append(mergedK, tailK[:j]...)
mergedV = append(mergedV, tailV[:j]...)
tailK = tailK[j+1:]
tailV = tailV[j+1:]
}
}
mergedK = append(mergedK, tailK...)
mergedV = append(mergedV, tailV...)
s.keys = mergedK
s.values = mergedV
assert(slices.IsSortedFunc(s.keys, slices.Compare[[]float32]), "Delete: s.keys not sorted post-merge")
assert(len(s.keys) == len(s.values), "Delete: keys/values length skew")
return nil
}
// StoresGet fetches values for the given keys. Missing keys are
// omitted from the result rather than reported as an error — callers
// compare returned-key length against requested-key length to detect
// them. Returned slices are aligned.
func (s *Store) StoresGet(opts *pb.StoresGetOptions) (pb.StoresGetResult, error) {
pbKeys := make([]*pb.StoresKey, 0, len(opts.Keys))
pbValues := make([]*pb.StoresValue, 0, len(opts.Keys))
ks := sortIntoKeySlicese(opts.Keys)
keys := store.UnwrapKeys(opts.Keys)
if len(s.keys) == 0 {
xlog.Debug("Get: No keys in store")
return pb.StoresGetResult{}, nil
}
if s.keyLen == -1 {
s.keyLen = len(opts.Keys[0].Floats)
} else {
if len(opts.Keys[0].Floats) != s.keyLen {
return pb.StoresGetResult{}, fmt.Errorf("Try to get a key with length %d when existing length is %d", len(opts.Keys[0].Floats), s.keyLen)
if s.keyLen != -1 {
for i, k := range keys {
if len(k) != s.keyLen {
return pb.StoresGetResult{}, fmt.Errorf("local-store: Get: key %d length %d does not match existing %d", i, len(k), s.keyLen)
}
}
}
sortedKeys := append([][]float32(nil), keys...)
slices.SortFunc(sortedKeys, slices.Compare[[]float32])
tail_k := s.keys
tail_v := s.values
for i, k := range ks {
j, found := findInSortedSlice(tail_k, k)
if found {
pbKeys = append(pbKeys, &pb.StoresKey{
Floats: k,
})
pbValues = append(pbValues, &pb.StoresValue{
Bytes: tail_v[j],
})
tail_k = tail_k[j+1:]
tail_v = tail_v[j+1:]
} else {
assert(!hasKey(s.keys, k), fmt.Sprintf("Key exists, but was not found: i=%d, %v", i, k))
var foundKeys [][]float32
var foundValues [][]byte
tailK := s.keys
tailV := s.values
for _, k := range sortedKeys {
j, ok := slices.BinarySearchFunc(tailK, k, slices.Compare[[]float32])
if !ok {
continue
}
foundKeys = append(foundKeys, tailK[j])
foundValues = append(foundValues, tailV[j])
tailK = tailK[j+1:]
tailV = tailV[j+1:]
}
if len(pbKeys) != len(opts.Keys) {
xlog.Debug("Get: Some keys not found", "pbKeysLen", len(pbKeys), "optsKeysLen", len(opts.Keys), "storeKeysLen", len(s.keys))
}
return pb.StoresGetResult{
Keys: pbKeys,
Values: pbValues,
Keys: store.WrapKeys(foundKeys),
Values: store.WrapValues(foundValues),
}, nil
}
// StoresFind returns the topK nearest stored entries by cosine
// similarity, ordered most-similar first. An empty store returns
// empty slices and no error.
func (s *Store) StoresFind(opts *pb.StoresFindOptions) (pb.StoresFindResult, error) {
query := opts.Key.Floats
topK := int(opts.TopK)
if topK < 1 {
return pb.StoresFindResult{}, fmt.Errorf("local-store: Find: topK = %d, must be >= 1", topK)
}
if len(s.keys) == 0 {
return pb.StoresFindResult{}, nil
}
if len(query) != s.keyLen {
return pb.StoresFindResult{}, fmt.Errorf("local-store: Find: query length %d does not match existing %d", len(query), s.keyLen)
}
var keys [][]float32
var values [][]byte
var sims []float32
if s.keysAreNormalized && isNormalized(query) {
keys, values, sims = s.findNormalized(query, topK)
} else {
keys, values, sims = s.findFallback(query, topK)
}
return pb.StoresFindResult{
Keys: store.WrapKeys(keys),
Values: store.WrapValues(values),
Similarities: sims,
}, nil
}
func (s *Store) findNormalized(query []float32, topK int) (keys [][]float32, values [][]byte, similarities []float32) {
assert(s.keysAreNormalized, "findNormalized: s.keysAreNormalized is false")
assert(isNormalized(query), "findNormalized: query is not unit-length")
pq := make(priorityQueue, 0, topK)
heap.Init(&pq)
for i, k := range s.keys {
var dot float32
for j := range k {
dot += query[j] * k[j]
}
assert(dot >= -1.01 && dot <= 1.01, fmt.Sprintf("findNormalized: dot %f out of [-1, 1] — keysAreNormalized invariant violated", dot))
heap.Push(&pq, &priorityItem{similarity: dot, key: k, value: s.values[i]})
if pq.Len() > topK {
heap.Pop(&pq)
}
}
return drainPQ(&pq)
}
func (s *Store) findFallback(query []float32, topK int) (keys [][]float32, values [][]byte, similarities []float32) {
var qmag float64
for _, v := range query {
qmag += float64(v) * float64(v)
}
qmag = math.Sqrt(qmag)
pq := make(priorityQueue, 0, topK)
heap.Init(&pq)
for i, k := range s.keys {
var dot, kmag float64
for j := range k {
dot += float64(query[j]) * float64(k[j])
kmag += float64(k[j]) * float64(k[j])
}
denom := qmag * math.Sqrt(kmag)
var sim float32
if denom > 0 {
sim = float32(dot / denom)
}
heap.Push(&pq, &priorityItem{similarity: sim, key: k, value: s.values[i]})
if pq.Len() > topK {
heap.Pop(&pq)
}
}
return drainPQ(&pq)
}
func isNormalized(k []float32) bool {
var sum float64
for _, v := range k {
v64 := float64(v)
sum += v64 * v64
sum += float64(v) * float64(v)
}
s := math.Sqrt(sum)
return s >= 0.99 && s <= 1.01
mag := math.Sqrt(sum)
return mag >= 0.99 && mag <= 1.01
}
// TODO: This we could replace with handwritten SIMD code
func normalizedCosineSimilarity(k1, k2 []float32) float32 {
assert(len(k1) == len(k2), fmt.Sprintf("normalizedCosineSimilarity: len(k1) = %d, len(k2) = %d", len(k1), len(k2)))
type incomingPair struct {
key []float32
value []byte
}
var dot float32
for i := range len(k1) {
dot += k1[i] * k2[i]
type pairs struct {
keys [][]float32
values [][]byte
}
// mergeSortedPairs merges (existing, incoming) into a fresh sorted
// slice. Equal keys take the incoming value — Set is upsert.
func mergeSortedPairs(existingK [][]float32, existingV [][]byte, incoming []incomingPair) pairs {
assert(slices.IsSortedFunc(existingK, slices.Compare[[]float32]), "mergeSortedPairs: existing not sorted")
assert(slices.IsSortedFunc(incoming, func(a, b incomingPair) int { return slices.Compare(a.key, b.key) }), "mergeSortedPairs: incoming not sorted")
l := len(existingK) + len(incoming)
mk := make([][]float32, 0, l)
mv := make([][]byte, 0, l)
i, j := 0, 0
for i < len(incoming) || j < len(existingK) {
switch {
case j >= len(existingK):
mk = append(mk, incoming[i].key)
mv = append(mv, incoming[i].value)
i++
case i >= len(incoming):
mk = append(mk, existingK[j])
mv = append(mv, existingV[j])
j++
default:
c := slices.Compare(incoming[i].key, existingK[j])
switch {
case c < 0:
mk = append(mk, incoming[i].key)
mv = append(mv, incoming[i].value)
i++
case c > 0:
mk = append(mk, existingK[j])
mv = append(mv, existingV[j])
j++
default:
mk = append(mk, incoming[i].key)
mv = append(mv, incoming[i].value)
i++
j++
}
}
}
assert(dot >= -1.01 && dot <= 1.01, fmt.Sprintf("dot = %f", dot))
// 2.0 * (1.0 - dot) would be the Euclidean distance
return dot
return pairs{keys: mk, values: mv}
}
type PriorityItem struct {
Similarity float32
Key []float32
Value []byte
type priorityItem struct {
similarity float32
key []float32
value []byte
}
type PriorityQueue []*PriorityItem
type priorityQueue []*priorityItem
func (pq PriorityQueue) Len() int { return len(pq) }
func (pq PriorityQueue) Less(i, j int) bool {
// Inverted because the most similar should be at the top
return pq[i].Similarity < pq[j].Similarity
}
func (pq PriorityQueue) Swap(i, j int) {
pq[i], pq[j] = pq[j], pq[i]
}
func (pq *PriorityQueue) Push(x any) {
item := x.(*PriorityItem)
*pq = append(*pq, item)
}
func (pq *PriorityQueue) Pop() any {
func (pq priorityQueue) Len() int { return len(pq) }
func (pq priorityQueue) Less(i, j int) bool { return pq[i].similarity < pq[j].similarity }
func (pq priorityQueue) Swap(i, j int) { pq[i], pq[j] = pq[j], pq[i] }
func (pq *priorityQueue) Push(x any) { *pq = append(*pq, x.(*priorityItem)) }
func (pq *priorityQueue) Pop() any {
old := *pq
n := len(old)
item := old[n-1]
@@ -380,142 +336,16 @@ func (pq *PriorityQueue) Pop() any {
return item
}
func (s *Store) StoresFindNormalized(opts *pb.StoresFindOptions) (pb.StoresFindResult, error) {
tk := opts.Key.Floats
top_ks := make(PriorityQueue, 0, int(opts.TopK))
heap.Init(&top_ks)
for i, k := range s.keys {
sim := normalizedCosineSimilarity(tk, k)
heap.Push(&top_ks, &PriorityItem{
Similarity: sim,
Key: k,
Value: s.values[i],
})
if top_ks.Len() > int(opts.TopK) {
heap.Pop(&top_ks)
}
}
similarities := make([]float32, top_ks.Len())
pbKeys := make([]*pb.StoresKey, top_ks.Len())
pbValues := make([]*pb.StoresValue, top_ks.Len())
for i := top_ks.Len() - 1; i >= 0; i-- {
item := heap.Pop(&top_ks).(*PriorityItem)
similarities[i] = item.Similarity
pbKeys[i] = &pb.StoresKey{
Floats: item.Key,
}
pbValues[i] = &pb.StoresValue{
Bytes: item.Value,
}
}
return pb.StoresFindResult{
Keys: pbKeys,
Values: pbValues,
Similarities: similarities,
}, nil
}
func cosineSimilarity(k1, k2 []float32, mag1 float64) float32 {
assert(len(k1) == len(k2), fmt.Sprintf("cosineSimilarity: len(k1) = %d, len(k2) = %d", len(k1), len(k2)))
var dot, mag2 float64
for i := range len(k1) {
dot += float64(k1[i] * k2[i])
mag2 += float64(k2[i] * k2[i])
}
sim := float32(dot / (mag1 * math.Sqrt(mag2)))
assert(sim >= -1.01 && sim <= 1.01, fmt.Sprintf("sim = %f", sim))
return sim
}
func (s *Store) StoresFindFallback(opts *pb.StoresFindOptions) (pb.StoresFindResult, error) {
tk := opts.Key.Floats
top_ks := make(PriorityQueue, 0, int(opts.TopK))
heap.Init(&top_ks)
var mag1 float64
for _, v := range tk {
mag1 += float64(v * v)
}
mag1 = math.Sqrt(mag1)
for i, k := range s.keys {
dist := cosineSimilarity(tk, k, mag1)
heap.Push(&top_ks, &PriorityItem{
Similarity: dist,
Key: k,
Value: s.values[i],
})
if top_ks.Len() > int(opts.TopK) {
heap.Pop(&top_ks)
}
}
similarities := make([]float32, top_ks.Len())
pbKeys := make([]*pb.StoresKey, top_ks.Len())
pbValues := make([]*pb.StoresValue, top_ks.Len())
for i := top_ks.Len() - 1; i >= 0; i-- {
item := heap.Pop(&top_ks).(*PriorityItem)
similarities[i] = item.Similarity
pbKeys[i] = &pb.StoresKey{
Floats: item.Key,
}
pbValues[i] = &pb.StoresValue{
Bytes: item.Value,
}
}
return pb.StoresFindResult{
Keys: pbKeys,
Values: pbValues,
Similarities: similarities,
}, nil
}
func (s *Store) StoresFind(opts *pb.StoresFindOptions) (pb.StoresFindResult, error) {
tk := opts.Key.Floats
if len(tk) != s.keyLen {
return pb.StoresFindResult{}, fmt.Errorf("Try to find key with length %d when existing length is %d", len(tk), s.keyLen)
}
if opts.TopK < 1 {
return pb.StoresFindResult{}, fmt.Errorf("opts.TopK = %d, must be >= 1", opts.TopK)
}
if s.keyLen == -1 {
s.keyLen = len(opts.Key.Floats)
} else {
if len(opts.Key.Floats) != s.keyLen {
return pb.StoresFindResult{}, fmt.Errorf("Try to add key with length %d when existing length is %d", len(opts.Key.Floats), s.keyLen)
}
}
if s.keysAreNormalized && isNormalized(tk) {
return s.StoresFindNormalized(opts)
} else {
if s.keysAreNormalized {
var sample []float32
if len(s.keys) > 5 {
sample = tk[:5]
} else {
sample = tk
}
xlog.Debug("Trying to compare non-normalized key with normalized keys", "sample", sample)
}
return s.StoresFindFallback(opts)
func drainPQ(pq *priorityQueue) (keys [][]float32, values [][]byte, similarities []float32) {
n := pq.Len()
keys = make([][]float32, n)
values = make([][]byte, n)
similarities = make([]float32, n)
for i := n - 1; i >= 0; i-- {
item := heap.Pop(pq).(*priorityItem)
keys[i] = item.key
values[i] = item.value
similarities[i] = item.similarity
}
return keys, values, similarities
}

View File

@@ -0,0 +1,13 @@
package main
import (
"testing"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
func TestLocalStore(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "local-store test suite")
}

View File

@@ -0,0 +1,284 @@
package main
// Regression suite for the local-store gRPC backend. Exercises the
// Stores{Set,Get,Find,Delete} surface — the only public contract.
// Callers (face/voice recognition, the routing KNN classifier) reach
// this code via grpc.Backend, so testing at the wire-shaped boundary
// matches the production import shape.
import (
"math"
"math/rand/v2"
"testing"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("StoresSet", func() {
It("rejects empty input", func() {
Expect(NewStore().StoresSet(&pb.StoresSetOptions{})).NotTo(Succeed(), "Set with no keys should fail")
})
It("rejects key/value length mismatch", func() {
err := NewStore().StoresSet(&pb.StoresSetOptions{
Keys: wrapKeys([][]float32{{1, 0, 0}}),
Values: wrapValues([][]byte{[]byte("a"), []byte("b")}),
})
Expect(err).To(HaveOccurred(), "len(keys) != len(values) should fail")
})
It("rejects dimension mismatch on later add", func() {
s := NewStore()
mustSet(s, [][]float32{{1, 0, 0}}, [][]byte{[]byte("3d")})
err := s.StoresSet(&pb.StoresSetOptions{
Keys: wrapKeys([][]float32{{1, 0}}),
Values: wrapValues([][]byte{[]byte("2d")}),
})
Expect(err).To(HaveOccurred(), "dimension mismatch on later Set should fail")
})
It("rejects dimension mismatch within batch", func() {
err := NewStore().StoresSet(&pb.StoresSetOptions{
Keys: wrapKeys([][]float32{{1, 0, 0}, {1, 0}}),
Values: wrapValues([][]byte{[]byte("3d"), []byte("2d")}),
})
Expect(err).To(HaveOccurred(), "mixed-dimension within one batch should fail")
})
It("merges sorted and updates existing key", func() {
s := NewStore()
mustSet(s, [][]float32{{0.3, 0, 0}, {0.1, 0, 0}}, [][]byte{[]byte("c"), []byte("a")})
mustSet(s, [][]float32{{0.2, 0, 0}, {0.1, 0, 0}}, [][]byte{[]byte("b"), []byte("a-updated")})
Expect(s.keys).To(HaveLen(3))
got := singleGet(s, []float32{0.1, 0, 0})
Expect(string(got)).To(Equal("a-updated"))
})
})
var _ = Describe("StoresGet", func() {
It("round-trips multi-key", func() {
s := NewStore()
mustSet(s,
[][]float32{{0.1, 0.2, 0.3}, {0.4, 0.5, 0.6}, {0.7, 0.8, 0.9}},
[][]byte{[]byte("a"), []byte("b"), []byte("c")},
)
res, err := s.StoresGet(&pb.StoresGetOptions{
Keys: wrapKeys([][]float32{{0.7, 0.8, 0.9}, {0.1, 0.2, 0.3}}),
})
Expect(err).NotTo(HaveOccurred())
Expect(res.Keys).To(HaveLen(2))
})
It("omits missing keys rather than erroring", func() {
s := NewStore()
mustSet(s, [][]float32{{0.1, 0, 0}}, [][]byte{[]byte("a")})
res, err := s.StoresGet(&pb.StoresGetOptions{
Keys: wrapKeys([][]float32{{0.1, 0, 0}, {0.9, 0, 0}}),
})
Expect(err).NotTo(HaveOccurred())
Expect(res.Keys).To(HaveLen(1))
})
})
var _ = Describe("StoresDelete", func() {
It("removes and preserves sort", func() {
s := NewStore()
mustSet(s,
[][]float32{{0.1, 0, 0}, {0.2, 0, 0}, {0.3, 0, 0}, {0.4, 0, 0}},
[][]byte{[]byte("a"), []byte("b"), []byte("c"), []byte("d")},
)
Expect(s.StoresDelete(&pb.StoresDeleteOptions{
Keys: wrapKeys([][]float32{{0.2, 0, 0}, {0.4, 0, 0}}),
})).To(Succeed())
Expect(s.keys).To(HaveLen(2))
})
It("tolerates missing keys", func() {
s := NewStore()
mustSet(s, [][]float32{{0.1, 0, 0}}, [][]byte{[]byte("a")})
Expect(s.StoresDelete(&pb.StoresDeleteOptions{
Keys: wrapKeys([][]float32{{0.9, 0, 0}}),
})).To(Succeed(), "delete of missing key should succeed")
Expect(s.keys).To(HaveLen(1))
})
})
var _ = Describe("StoresFind", func() {
It("returns normalized top-K", func() {
s := NewStore()
mustSet(s,
[][]float32{
normalizeVec([]float32{1, 0, 0}),
normalizeVec([]float32{0, 1, 0}),
normalizeVec([]float32{0, 0, 1}),
},
[][]byte{[]byte("x"), []byte("y"), []byte("z")},
)
res, err := s.StoresFind(&pb.StoresFindOptions{
Key: &pb.StoresKey{Floats: normalizeVec([]float32{0.9, 0.1, 0})},
TopK: 2,
})
Expect(err).NotTo(HaveOccurred())
Expect(res.Keys).To(HaveLen(2))
Expect(res.Similarities[0]).To(BeNumerically(">=", res.Similarities[1]), "results not sorted desc by similarity")
Expect(string(res.Values[0].Bytes)).To(Equal("x"))
})
It("falls back for non-normalized keys", func() {
s := NewStore()
mustSet(s, [][]float32{{2, 0, 0}, {0, 3, 0}}, [][]byte{[]byte("x"), []byte("y")})
Expect(s.keysAreNormalized).To(BeFalse(), "store should report non-normalized after Set with magnitude > 1")
res, err := s.StoresFind(&pb.StoresFindOptions{
Key: &pb.StoresKey{Floats: []float32{4, 0, 0}},
TopK: 1,
})
Expect(err).NotTo(HaveOccurred())
Expect(string(res.Values[0].Bytes)).To(Equal("x"))
Expect(res.Similarities[0]).To(BeNumerically(">=", float32(0.99)))
Expect(res.Similarities[0]).To(BeNumerically("<=", float32(1.01)))
})
It("rejects zero topK", func() {
s := NewStore()
mustSet(s, [][]float32{{1, 0, 0}}, [][]byte{[]byte("x")})
_, err := s.StoresFind(&pb.StoresFindOptions{
Key: &pb.StoresKey{Floats: []float32{1, 0, 0}},
TopK: 0,
})
Expect(err).To(HaveOccurred(), "Find with topK=0 should fail")
})
It("rejects dimension mismatch", func() {
s := NewStore()
mustSet(s, [][]float32{{1, 0, 0}}, [][]byte{[]byte("x")})
_, err := s.StoresFind(&pb.StoresFindOptions{
Key: &pb.StoresKey{Floats: []float32{1, 0}},
TopK: 1,
})
Expect(err).To(HaveOccurred(), "Find with mismatched dimension should fail")
})
It("returns empty result on empty store", func() {
res, err := NewStore().StoresFind(&pb.StoresFindOptions{
Key: &pb.StoresKey{Floats: []float32{1, 0, 0}},
TopK: 5,
})
Expect(err).NotTo(HaveOccurred(), "Find on empty store should succeed")
Expect(res.Keys).To(BeEmpty())
})
It("handles topK larger than store", func() {
s := NewStore()
mustSet(s,
[][]float32{normalizeVec([]float32{1, 0, 0}), normalizeVec([]float32{0, 1, 0})},
[][]byte{[]byte("x"), []byte("y")},
)
res, err := s.StoresFind(&pb.StoresFindOptions{
Key: &pb.StoresKey{Floats: normalizeVec([]float32{1, 0, 0})},
TopK: 10,
})
Expect(err).NotTo(HaveOccurred())
Expect(res.Keys).To(HaveLen(2))
})
})
var _ = Describe("StoresLoad", func() {
It("is a no-op", func() {
Expect(NewStore().Load(&pb.ModelOptions{Model: "any-namespace"})).To(Succeed())
})
})
func BenchmarkStoresFindNormalized(b *testing.B) {
const dim = 768
for _, n := range []int{8, 32, 128, 512} {
b.Run(fmtN(n), func(b *testing.B) {
s := buildStore(b, n, dim)
query := normalizeVec(randVec(dim, 42))
req := &pb.StoresFindOptions{Key: &pb.StoresKey{Floats: query}, TopK: 1}
b.ResetTimer()
for i := 0; i < b.N; i++ {
if _, err := s.StoresFind(req); err != nil {
b.Fatal(err)
}
}
})
}
}
// --- test helpers ---
func mustSet(s *Store, keys [][]float32, values [][]byte) {
ExpectWithOffset(1, s.StoresSet(&pb.StoresSetOptions{Keys: wrapKeys(keys), Values: wrapValues(values)})).To(Succeed())
}
func singleGet(s *Store, key []float32) []byte {
res, err := s.StoresGet(&pb.StoresGetOptions{Keys: wrapKeys([][]float32{key})})
ExpectWithOffset(1, err).NotTo(HaveOccurred())
if len(res.Values) == 0 {
return nil
}
return res.Values[0].Bytes
}
func wrapKeys(in [][]float32) []*pb.StoresKey {
out := make([]*pb.StoresKey, len(in))
for i, k := range in {
out[i] = &pb.StoresKey{Floats: k}
}
return out
}
func wrapValues(in [][]byte) []*pb.StoresValue {
out := make([]*pb.StoresValue, len(in))
for i, v := range in {
out[i] = &pb.StoresValue{Bytes: v}
}
return out
}
func buildStore(tb testing.TB, n, dim int) *Store {
tb.Helper()
s := NewStore()
keys := make([][]float32, n)
values := make([][]byte, n)
for i := 0; i < n; i++ {
keys[i] = normalizeVec(randVec(dim, int64(i)+1))
values[i] = []byte{byte(i)}
}
if err := s.StoresSet(&pb.StoresSetOptions{Keys: wrapKeys(keys), Values: wrapValues(values)}); err != nil {
tb.Fatal(err)
}
return s
}
func randVec(dim int, seed int64) []float32 {
r := rand.New(rand.NewPCG(uint64(seed), 0xabcdef))
v := make([]float32, dim)
for i := range v {
v[i] = float32(r.NormFloat64())
}
return v
}
func normalizeVec(v []float32) []float32 {
var sum float64
for _, x := range v {
sum += float64(x) * float64(x)
}
mag := math.Sqrt(sum)
if mag == 0 {
return v
}
out := make([]float32, len(v))
for i, x := range v {
out[i] = float32(float64(x) / mag)
}
return out
}
func fmtN(n int) string {
return map[int]string{8: "n=8", 32: "n=32", 128: "n=128", 512: "n=512"}[n]
}

View File

@@ -9,7 +9,7 @@ JOBS?=$(shell nproc --ignore=1)
# LocalVQE upstream version pin. Bump to a specific commit when picking up
# a new release; `main` works for development but is not reproducible.
LOCALVQE_REPO?=https://github.com/localai-org/LocalVQE
LOCALVQE_VERSION?=72bfb4c6
LOCALVQE_VERSION?=b0f0378a450e87c871b85689554801601ca56d98
# LocalVQE handles CPU feature selection internally (it ships the multiple
# libggml-cpu-*.so variants and its loader picks the best one at runtime
@@ -27,7 +27,8 @@ endif
# LocalVQE upstream supports CPU + Vulkan only. Other BUILD_TYPE values
# fall through to the default CPU build — Vulkan is already as fast as the
# specialised GPU paths would be on this 1.3 M-parameter model.
# specialised GPU paths would be on these small (1.3 M4.8 M parameter)
# models.
ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DGGML_VULKAN=ON -DLOCALVQE_VULKAN=ON
else ifeq ($(OS),Darwin)

View File

@@ -3,7 +3,6 @@ package main
import (
"encoding/binary"
"fmt"
"io"
"os"
"path/filepath"
"runtime"
@@ -11,6 +10,7 @@ import (
"strings"
"unsafe"
"github.com/go-audio/wav"
"github.com/mudler/LocalAI/pkg/grpc/base"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/xlog"
@@ -46,24 +46,24 @@ const (
// through the options builder (CppOptionsNew + setters + CppNewWithOptions)
// — the bare localvqe_new path doesn't expose backend / device selection.
var (
CppOptionsNew func() uintptr
CppOptionsFree func(opts uintptr)
CppOptionsSetModelPath func(opts uintptr, modelPath string) int32
CppOptionsSetBackend func(opts uintptr, backend string) int32
CppOptionsSetDevice func(opts uintptr, device int32) int32
CppNewWithOptions func(opts uintptr) uintptr
CppFree func(ctx uintptr)
CppProcessF32 func(ctx uintptr, mic, ref uintptr, nSamples int32, out uintptr) int32
CppProcessS16 func(ctx uintptr, mic, ref uintptr, nSamples int32, out uintptr) int32
CppProcessFrameF32 func(ctx uintptr, mic, ref uintptr, hopSamples int32, out uintptr) int32
CppProcessFrameS16 func(ctx uintptr, mic, ref uintptr, hopSamples int32, out uintptr) int32
CppReset func(ctx uintptr)
CppLastError func(ctx uintptr) string
CppSampleRate func(ctx uintptr) int32
CppHopLength func(ctx uintptr) int32
CppFFTSize func(ctx uintptr) int32
CppSetNoiseGate func(ctx uintptr, enabled int32, thresholdDBFS float32) int32
CppGetNoiseGate func(ctx uintptr, enabledOut, thresholdDBFSOut uintptr) int32
CppOptionsNew func() uintptr
CppOptionsFree func(opts uintptr)
CppOptionsSetModelPath func(opts uintptr, modelPath string) int32
CppOptionsSetBackend func(opts uintptr, backend string) int32
CppOptionsSetDevice func(opts uintptr, device int32) int32
CppNewWithOptions func(opts uintptr) uintptr
CppFree func(ctx uintptr)
CppProcessF32 func(ctx uintptr, mic, ref uintptr, nSamples int32, out uintptr) int32
CppProcessS16 func(ctx uintptr, mic, ref uintptr, nSamples int32, out uintptr) int32
CppProcessFrameF32 func(ctx uintptr, mic, ref uintptr, hopSamples int32, out uintptr) int32
CppProcessFrameS16 func(ctx uintptr, mic, ref uintptr, hopSamples int32, out uintptr) int32
CppReset func(ctx uintptr)
CppLastError func(ctx uintptr) string
CppSampleRate func(ctx uintptr) int32
CppHopLength func(ctx uintptr) int32
CppFFTSize func(ctx uintptr) int32
CppSetNoiseGate func(ctx uintptr, enabled int32, thresholdDBFS float32) int32
CppGetNoiseGate func(ctx uintptr, enabledOut, thresholdDBFSOut uintptr) int32
)
// LocalVQE speaks gRPC against LocalVQE's flat C ABI. The streaming
@@ -490,11 +490,14 @@ func (v *LocalVQE) applyStreamConfig(cfg *pb.AudioTransformStreamConfig) error {
// ---- WAV I/O ----------------------------------------------------------
//
// Minimal mono PCM WAV reader/writer. Only handles the subset LocalVQE
// cares about (mono, 16-bit signed, no extensible chunks). For broader
// audio support the HTTP layer's `audio.NormalizeAudioFile` already
// converts arbitrary input to a canonical WAV before we see it; this
// reader just decodes the canonical shape.
// Reader/writer for the mono 16-bit PCM shape LocalVQE works with. Decoding
// goes through the shared go-audio/wav decoder (as the whisper and parakeet
// backends do) so RIFF chunk walking is handled robustly — an 18/40-byte
// extensible `fmt ` chunk, or JUNK/bext/LIST metadata before or after `data`
// (e.g. ffmpeg's trailing "Lavf" tag), is skipped rather than spliced into
// the PCM stream as an audible click. The HTTP layer normalises arbitrary
// input to WAV before we see it, but that WAV is ffmpeg output and is not
// guaranteed to be the canonical 44-byte layout.
func readMonoWAVf32(path string) ([]float32, int, error) {
f, err := os.Open(path)
@@ -502,35 +505,26 @@ func readMonoWAVf32(path string) ([]float32, int, error) {
return nil, 0, err
}
defer func() { _ = f.Close() }()
header := make([]byte, 44)
if _, err := io.ReadFull(f, header); err != nil {
return nil, 0, err
buf, err := wav.NewDecoder(f).FullPCMBuffer()
if err != nil {
return nil, 0, fmt.Errorf("decode WAV: %w", err)
}
if string(header[0:4]) != "RIFF" || string(header[8:12]) != "WAVE" {
if buf == nil || buf.Format == nil {
return nil, 0, fmt.Errorf("not a WAV file")
}
channels := binary.LittleEndian.Uint16(header[22:24])
sampleRate := binary.LittleEndian.Uint32(header[24:28])
bitsPerSample := binary.LittleEndian.Uint16(header[34:36])
if channels != 1 {
return nil, 0, fmt.Errorf("only mono WAV supported (got %d channels)", channels)
if buf.Format.NumChannels != 1 {
return nil, 0, fmt.Errorf("only mono WAV supported (got %d channels)", buf.Format.NumChannels)
}
if bitsPerSample != 16 {
return nil, 0, fmt.Errorf("only 16-bit PCM supported (got %d bits)", bitsPerSample)
if buf.SourceBitDepth != 16 {
return nil, 0, fmt.Errorf("only 16-bit PCM supported (got %d bits)", buf.SourceBitDepth)
}
rest, err := io.ReadAll(f)
if err != nil {
return nil, 0, err
if len(buf.Data) == 0 {
return nil, 0, fmt.Errorf("WAV has no audio data")
}
n := len(rest) / 2
out := make([]float32, n)
for i := 0; i < n; i++ {
s := int16(binary.LittleEndian.Uint16(rest[i*2 : i*2+2]))
out[i] = float32(s) / 32768.0
}
return out, int(sampleRate), nil
// AsFloat32Buffer normalises by 2^(bitDepth-1) == /32768 for 16-bit,
// matching the model's expected [-1, 1) input range.
return buf.AsFloat32Buffer().Data, buf.Format.SampleRate, nil
}
func writeMonoWAVf32(path string, samples []float32, sampleRate int) error {
@@ -546,13 +540,13 @@ func writeMonoWAVf32(path string, samples []float32, sampleRate int) error {
binary.LittleEndian.PutUint32(header[4:8], 36+dataLen)
copy(header[8:12], []byte("WAVE"))
copy(header[12:16], []byte("fmt "))
binary.LittleEndian.PutUint32(header[16:20], 16) // fmt chunk size
binary.LittleEndian.PutUint16(header[20:22], 1) // PCM
binary.LittleEndian.PutUint16(header[22:24], 1) // mono
binary.LittleEndian.PutUint32(header[16:20], 16) // fmt chunk size
binary.LittleEndian.PutUint16(header[20:22], 1) // PCM
binary.LittleEndian.PutUint16(header[22:24], 1) // mono
binary.LittleEndian.PutUint32(header[24:28], uint32(sampleRate))
binary.LittleEndian.PutUint32(header[28:32], uint32(sampleRate*2)) // byte rate
binary.LittleEndian.PutUint16(header[32:34], 2) // block align
binary.LittleEndian.PutUint16(header[34:36], 16) // bits per sample
binary.LittleEndian.PutUint16(header[32:34], 2) // block align
binary.LittleEndian.PutUint16(header[34:36], 16) // bits per sample
copy(header[36:40], []byte("data"))
binary.LittleEndian.PutUint32(header[40:44], dataLen)
if _, err := f.Write(header); err != nil {

View File

@@ -1,7 +1,9 @@
package main
import (
"encoding/binary"
"os"
"path/filepath"
"testing"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
@@ -92,6 +94,147 @@ var _ = Describe("LocalVQE-cpp", func() {
})
})
Context("readMonoWAVf32 chunk parsing", func() {
// chunk builds a word-aligned RIFF sub-chunk (id + size + body + pad).
chunk := func(id string, body []byte) []byte {
out := append([]byte(id), 0, 0, 0, 0)
binary.LittleEndian.PutUint32(out[4:8], uint32(len(body)))
out = append(out, body...)
if len(body)&1 == 1 {
out = append(out, 0) // pad byte for odd-sized chunks
}
return out
}
// fmtBody returns a PCM `fmt ` chunk body. extra bytes simulate the
// 18/40-byte extensible form (cbSize + extension).
fmtBody := func(channels, bits uint16, rate uint32, extra int) []byte {
b := make([]byte, 16+extra)
binary.LittleEndian.PutUint16(b[0:2], 1) // PCM
binary.LittleEndian.PutUint16(b[2:4], channels)
binary.LittleEndian.PutUint32(b[4:8], rate)
binary.LittleEndian.PutUint32(b[8:12], rate*uint32(channels)*uint32(bits)/8)
binary.LittleEndian.PutUint16(b[12:14], channels*bits/8)
binary.LittleEndian.PutUint16(b[14:16], bits)
if extra >= 2 {
binary.LittleEndian.PutUint16(b[16:18], uint16(extra-2)) // cbSize
}
return b
}
// pcm encodes int16 samples little-endian.
pcm := func(samples ...int16) []byte {
b := make([]byte, len(samples)*2)
for i, s := range samples {
binary.LittleEndian.PutUint16(b[i*2:i*2+2], uint16(s))
}
return b
}
riff := func(chunks ...[]byte) []byte {
body := []byte("WAVE")
for _, c := range chunks {
body = append(body, c...)
}
out := append([]byte("RIFF"), 0, 0, 0, 0)
binary.LittleEndian.PutUint32(out[4:8], uint32(len(body)))
return append(out, body...)
}
writeWAV := func(b []byte) string {
p := filepath.Join(GinkgoT().TempDir(), "in.wav")
Expect(os.WriteFile(p, b, 0o600)).To(Succeed())
return p
}
// A canonical sample run with distinct values so any off-by-one /
// misalignment shows up as wrong numbers, not just wrong length.
samples := []int16{1000, -2000, 3000, -4000, 5000, -6000}
expectSamples := func(got []float32) {
Expect(got).To(HaveLen(len(samples)))
for i, s := range samples {
Expect(got[i]).To(BeNumerically("~", float32(s)/32768.0, 1e-6))
}
}
It("reads a canonical 44-byte WAV", func() {
p := writeWAV(riff(chunk("fmt ", fmtBody(1, 16, 16000, 0)), chunk("data", pcm(samples...))))
out, sr, err := readMonoWAVf32(p)
Expect(err).ToNot(HaveOccurred())
Expect(sr).To(Equal(16000))
expectSamples(out)
})
It("ignores a LIST/JUNK chunk placed before data (no leading-impulse splice)", func() {
p := writeWAV(riff(
chunk("fmt ", fmtBody(1, 16, 16000, 0)),
chunk("JUNK", []byte("padding-bytes-here!")), // odd length → exercises pad
chunk("LIST", []byte("INFOISFTLavf60.0")),
chunk("data", pcm(samples...)),
))
out, sr, err := readMonoWAVf32(p)
Expect(err).ToNot(HaveOccurred())
Expect(sr).To(Equal(16000))
expectSamples(out) // not corrupted by the preceding chunks
})
It("honours the data chunk size and drops a trailing metadata chunk", func() {
p := writeWAV(riff(
chunk("fmt ", fmtBody(1, 16, 16000, 0)),
chunk("data", pcm(samples...)),
chunk("LIST", []byte("INFOISFTLavf60.16.100")), // ffmpeg trailer tag
))
out, _, err := readMonoWAVf32(p)
Expect(err).ToNot(HaveOccurred())
expectSamples(out) // trailing LIST bytes not decoded as PCM
})
It("handles the 18-byte extensible fmt chunk", func() {
p := writeWAV(riff(chunk("fmt ", fmtBody(1, 16, 16000, 2)), chunk("data", pcm(samples...))))
out, sr, err := readMonoWAVf32(p)
Expect(err).ToNot(HaveOccurred())
Expect(sr).To(Equal(16000))
expectSamples(out)
})
It("rejects non-mono input", func() {
p := writeWAV(riff(chunk("fmt ", fmtBody(2, 16, 16000, 0)), chunk("data", pcm(samples...))))
_, _, err := readMonoWAVf32(p)
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("mono"))
})
It("rejects non-16-bit input", func() {
p := writeWAV(riff(chunk("fmt ", fmtBody(1, 8, 16000, 0)), chunk("data", pcm(samples...))))
_, _, err := readMonoWAVf32(p)
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("16-bit"))
})
It("rejects a non-WAV file", func() {
p := writeWAV([]byte("not a riff file at all"))
_, _, err := readMonoWAVf32(p)
Expect(err).To(HaveOccurred())
})
It("errors when the data chunk is missing", func() {
// fmt but no data: the decoder must fail rather than return an
// empty (or garbage) sample slice. The exact message is the
// decoder's, so just assert it errors.
p := writeWAV(riff(chunk("fmt ", fmtBody(1, 16, 16000, 0))))
_, _, err := readMonoWAVf32(p)
Expect(err).To(HaveOccurred())
})
It("round-trips through writeMonoWAVf32", func() {
p := filepath.Join(GinkgoT().TempDir(), "rt.wav")
in := []float32{0.1, -0.2, 0.3, -0.4}
Expect(writeMonoWAVf32(p, in, 16000)).To(Succeed())
out, sr, err := readMonoWAVf32(p)
Expect(err).ToNot(HaveOccurred())
Expect(sr).To(Equal(16000))
Expect(out).To(HaveLen(len(in)))
for i := range in {
Expect(out[i]).To(BeNumerically("~", in[i], 1e-4))
}
})
})
Context("model-gated integration (LOCALVQE_MODEL_PATH)", func() {
It("load + sample rate + hop + fft", func() {
path := modelPathOrSkip()

11
backend/go/parakeet-cpp/.gitignore vendored Normal file
View File

@@ -0,0 +1,11 @@
.cache/
sources/
build/
package/
parakeet-cpp-grpc
# build artifacts staged in-tree by the Makefile (cp from sources/) or
# symlinked for local dev; the real sources live in parakeet.cpp upstream.
*.so
*.so.*
parakeet_capi.h
compile_commands.json

View File

@@ -0,0 +1,93 @@
# parakeet-cpp backend Makefile.
#
# Upstream pin lives below as PARAKEET_VERSION?=b11fe5bca78ad8b342dd559a43d76df3984bb447
# (.github/bump_deps.sh) can find and update it - matches the
# whisper.cpp / ds4 / vibevoice-cpp convention.
#
# Local dev shortcut: if you already have an out-of-tree parakeet.cpp
# build, you can symlink the .so + header into this directory and skip
# the clone/cmake steps entirely, e.g.:
#
# ln -sf /path/to/parakeet.cpp/build-shared/libparakeet.so .
# ln -sf /path/to/parakeet.cpp/include/parakeet_capi.h .
# go build -o parakeet-cpp-grpc .
#
# That's what the L0 smoke test uses. The default target below does the
# proper clone-at-pin + cmake build so CI doesn't need a side-checkout.
PARAKEET_VERSION?=b11fe5bca78ad8b342dd559a43d76df3984bb447
PARAKEET_REPO?=https://github.com/mudler/parakeet.cpp
GOCMD?=go
GO_TAGS?=
JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
BUILD_TYPE?=
NATIVE?=false
# Build ggml statically into libparakeet.so (PIC) so the shared lib is
# self-contained: dlopen needs no libggml*.so alongside it, only system libs
# (libstdc++/libgomp/libc) that the runtime image already provides.
CMAKE_ARGS?=-DCMAKE_BUILD_TYPE=Release -DPARAKEET_SHARED=ON -DPARAKEET_BUILD_CLI=OFF -DBUILD_SHARED_LIBS=OFF -DCMAKE_POSITION_INDEPENDENT_CODE=ON
ifeq ($(NATIVE),false)
CMAKE_ARGS+=-DGGML_NATIVE=OFF
endif
# parakeet.cpp gates its GGML backends behind PARAKEET_GGML_* options and does
# set(GGML_CUDA ${PARAKEET_GGML_CUDA} CACHE BOOL "" FORCE), so a bare -DGGML_CUDA=ON
# is overwritten back to OFF and the build silently falls back to CPU. Forward the
# PARAKEET_GGML_* options instead. (openblas is not gated, so -DGGML_BLAS passes through.)
ifeq ($(BUILD_TYPE),cublas)
CMAKE_ARGS+=-DPARAKEET_GGML_CUDA=ON
else ifeq ($(BUILD_TYPE),openblas)
CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
else ifeq ($(BUILD_TYPE),hipblas)
CMAKE_ARGS+=-DPARAKEET_GGML_HIP=ON
else ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DPARAKEET_GGML_VULKAN=ON
endif
.PHONY: parakeet-cpp-grpc package build clean purge test all
all: parakeet-cpp-grpc
# Clone the upstream parakeet.cpp source at the pinned commit. Directory
# acts as the target so make only re-clones when missing. After a
# PARAKEET_VERSION bump, run 'make purge && make' to refetch.
sources/parakeet.cpp:
mkdir -p sources/parakeet.cpp
cd sources/parakeet.cpp && \
git init -q && \
git remote add origin $(PARAKEET_REPO) && \
git fetch --depth 1 origin $(PARAKEET_VERSION) && \
git checkout FETCH_HEAD && \
git submodule update --init --recursive --depth 1 --single-branch
# Build the shared lib + header out-of-tree, then stage them next to the
# Go sources so purego.Dlopen("libparakeet.so") and the cgo-less build
# both pick them up.
libparakeet.so: sources/parakeet.cpp
cmake -B sources/parakeet.cpp/build-shared -S sources/parakeet.cpp $(CMAKE_ARGS)
cmake --build sources/parakeet.cpp/build-shared --config Release -j$(JOBS)
cp -fv sources/parakeet.cpp/build-shared/libparakeet.so* ./ 2>/dev/null || true
cp -fv sources/parakeet.cpp/include/parakeet_capi.h ./
parakeet-cpp-grpc: libparakeet.so main.go goparakeetcpp.go
CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o parakeet-cpp-grpc .
package: parakeet-cpp-grpc
bash package.sh
build: package
# Test target. Smoke test is gated on PARAKEET_BACKEND_TEST_MODEL +
# PARAKEET_BACKEND_TEST_WAV; without them the spec auto-skips.
test:
LD_LIBRARY_PATH=$(CURDIR):$$LD_LIBRARY_PATH $(GOCMD) test ./... -count=1
clean: purge
rm -rf libparakeet.so* parakeet_capi.h package parakeet-cpp-grpc
purge:
rm -rf sources/parakeet.cpp

View File

@@ -0,0 +1,79 @@
package main
import "time"
// batchRequest is one in-flight unary transcription waiting to be batched.
// In production pcm/decoder are set; tag is an opaque marker used by tests.
type batchRequest struct {
pcm []float32
decoder int32
tag string
reply chan batchReply
}
// batchReply carries one per-item JSON object string (an element of the C-API's
// JSON array) or an error back to the waiting handler goroutine.
type batchReply struct {
json string
err error
}
// batcher coalesces concurrent batchRequests into batched runBatch calls. A
// single run() goroutine is the sole caller of runBatch, so runBatch (which in
// production calls the thread-unsafe C engine) is never entered concurrently.
type batcher struct {
submit chan *batchRequest
maxSize int
maxWait time.Duration
runBatch func(reqs []*batchRequest) // must deliver a reply to every req
}
func newBatcher(maxSize int, maxWait time.Duration, runBatch func([]*batchRequest)) *batcher {
if maxSize < 1 {
maxSize = 1
}
return &batcher{
submit: make(chan *batchRequest),
maxSize: maxSize,
maxWait: maxWait,
runBatch: runBatch,
}
}
// run is the dispatcher loop: accumulate submitted requests until either maxSize
// is reached or maxWait elapses since the first queued request, then dispatch.
// Exits when stop is closed (draining any partially-filled batch first).
func (b *batcher) run(stop <-chan struct{}) {
for {
var first *batchRequest
select {
case first = <-b.submit:
case <-stop:
return
}
batch := []*batchRequest{first}
// maxSize==1 disables batching: dispatch immediately (passthrough).
if b.maxSize == 1 {
b.runBatch(batch)
continue
}
timer := time.NewTimer(b.maxWait)
fill:
for len(batch) < b.maxSize {
select {
case r := <-b.submit:
batch = append(batch, r)
case <-timer.C:
break fill
case <-stop:
timer.Stop()
b.runBatch(batch)
return
}
}
timer.Stop()
b.runBatch(batch)
}
}

View File

@@ -0,0 +1,108 @@
package main
import (
"sync"
"time"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("batcher", func() {
echoReply := func(reqs []*batchRequest) {
for _, r := range reqs {
r.reply <- batchReply{json: r.tag}
}
}
It("coalesces concurrent submits into batches", func() {
var mu sync.Mutex
var sizes []int
run := func(reqs []*batchRequest) {
mu.Lock()
sizes = append(sizes, len(reqs))
mu.Unlock()
echoReply(reqs)
}
b := newBatcher(4, 50*time.Millisecond, run)
stop := make(chan struct{})
go b.run(stop)
defer close(stop)
const N = 4
var wg sync.WaitGroup
got := make([]string, N)
for i := 0; i < N; i++ {
wg.Add(1)
go func(i int) {
defer wg.Done()
rep := make(chan batchReply, 1)
b.submit <- &batchRequest{tag: string(rune('a' + i)), reply: rep}
got[i] = (<-rep).json
}(i)
}
wg.Wait()
mu.Lock()
defer mu.Unlock()
total, maxBatch := 0, 0
for _, s := range sizes {
total += s
if s > maxBatch {
maxBatch = s
}
}
Expect(total).To(Equal(N))
Expect(maxBatch).To(BeNumerically(">=", 2), "expected at least one batch to coalesce >1 request")
})
It("dispatches when max size is reached", func() {
dispatched := make(chan int, 8)
run := func(reqs []*batchRequest) {
dispatched <- len(reqs)
echoReply(reqs)
}
b := newBatcher(2, time.Hour, run) // huge window: only size can trigger
stop := make(chan struct{})
go b.run(stop)
defer close(stop)
for i := 0; i < 2; i++ {
rep := make(chan batchReply, 1)
b.submit <- &batchRequest{tag: "x", reply: rep}
go func(rep chan batchReply) { <-rep }(rep)
}
Eventually(dispatched, "2s").Should(Receive(Equal(2)))
})
It("dispatches when the wait window elapses", func() {
dispatched := make(chan int, 8)
run := func(reqs []*batchRequest) {
dispatched <- len(reqs)
echoReply(reqs)
}
b := newBatcher(8, 20*time.Millisecond, run) // size unreachable; window fires
stop := make(chan struct{})
go b.run(stop)
defer close(stop)
rep := make(chan batchReply, 1)
b.submit <- &batchRequest{tag: "x", reply: rep}
go func() { <-rep }()
Eventually(dispatched, "2s").Should(Receive(Equal(1)))
})
It("bypasses batching when max size is 1", func() {
dispatched := make(chan int, 8)
run := func(reqs []*batchRequest) {
dispatched <- len(reqs)
echoReply(reqs)
}
b := newBatcher(1, time.Hour, run) // size 1 => immediate dispatch
stop := make(chan struct{})
go b.run(stop)
defer close(stop)
rep := make(chan batchReply, 1)
b.submit <- &batchRequest{tag: "x", reply: rep}
go func() { <-rep }()
Eventually(dispatched, "2s").Should(Receive(Equal(1)))
})
})

View File

@@ -0,0 +1,556 @@
package main
import (
"context"
"encoding/json"
"errors"
"fmt"
"os"
"path/filepath"
"strconv"
"strings"
"sync"
"time"
"unsafe"
"github.com/go-audio/wav"
"github.com/mudler/LocalAI/pkg/grpc/base"
"github.com/mudler/LocalAI/pkg/grpc/grpcerrors"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/LocalAI/pkg/utils"
"github.com/mudler/xlog"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
)
// purego-bound entry points from libparakeet.so. Names match
// parakeet_capi.h exactly so a `nm libparakeet.so | grep parakeet_capi`
// is enough to spot drift.
//
// Functions that return char* are declared as uintptr so we can call
// parakeet_capi_free_string on the same pointer after copying, the
// C-API contract is "caller owns and must free the returned buffer".
var (
CppAbiVersion func() int32
CppLoad func(ggufPath string) uintptr
CppFree func(ctx uintptr)
CppTranscribePath func(ctx uintptr, wavPath string, decoder int32) uintptr
CppTranscribePathJSON func(ctx uintptr, wavPath string, decoder int32) uintptr
CppFreeString func(s uintptr)
CppLastError func(ctx uintptr) string
// Batched JSON transcription: takes a concatenated float buffer of clips
// plus their per-clip sample counts (sum(nSamples)==len(samplesConcat))
// and returns a malloc'd char* JSON ARRAY of per-clip {"text","words",
// "tokens"} objects (uintptr, freed via CppFreeString). purego passes the
// Go slices as the base pointer of their backing array (kept alive for the
// call), matching the CppStreamFeed pcm []float32 binding pattern; the C
// side reads them as const float*/const int*.
CppTranscribePcmBatchJSON func(ctx uintptr, samplesConcat []float32, nSamples []int32, nClips int32, sampleRate int32, decoder int32) uintptr
// Cache-aware streaming (RNN-T) entry points. stream_begin returns 0 for
// non-streaming models. feed/finalize return a malloc'd char* (uintptr,
// freed via CppFreeString); feed writes 1 to *eouOut on an <EOU>/<EOB>.
CppStreamBegin func(ctx uintptr) uintptr
CppStreamFeed func(s uintptr, pcm []float32, nSamples int32, eouOut unsafe.Pointer) uintptr
CppStreamFinalize func(s uintptr) uintptr
CppStreamFree func(s uintptr)
)
// streamChunkSamples is how much 16 kHz mono PCM we hand to stream_feed per
// call (1 s). The session buffers internally and decodes once a full
// cache-aware encoder chunk is available, so this only bounds how often we
// poll for newly-finalized text, not the model's actual chunk size.
const streamChunkSamples = 16000
// transcriptJSON mirrors the document returned by
// parakeet_capi_transcribe_path_json (see parakeet_capi.h):
//
// {"text":"...",
// "words":[{"w":"...","start":0.480,"end":0.640,"conf":0.9100}, ...],
// "tokens":[{"id":123,"t":0.480,"conf":0.9100}, ...]}
//
// "start"/"end"/"t" are seconds; "conf" is confidence in (0,1].
type transcriptJSON struct {
Text string `json:"text"`
Words []transcriptWord `json:"words"`
Tokens []transcriptToken `json:"tokens"`
}
type transcriptWord struct {
W string `json:"w"`
Start float64 `json:"start"`
End float64 `json:"end"`
Conf float64 `json:"conf"`
}
type transcriptToken struct {
ID int32 `json:"id"`
T float64 `json:"t"`
Conf float64 `json:"conf"`
}
// ParakeetCpp owns a single loaded parakeet_ctx. The C engine is a
// thread-unsafe singleton (mirrors whisper.cpp / vibevoice.cpp). Rather than
// serialize every call through base.SingleThread, we route unary
// transcription through an in-process batcher (its sole dispatcher goroutine
// is the only caller of the engine on that path) and guard the shared engine
// with engineMu so a streaming session and a batched-unary dispatch never
// touch it concurrently.
type ParakeetCpp struct {
base.Base
ctxPtr uintptr
engineMu sync.Mutex // sole guard of the one C engine (dispatcher + streaming)
bat *batcher
batStop chan struct{}
}
// Load is the LocalAI gRPC entry point for LoadModel: it calls
// parakeet_capi_load with the GGUF path and stashes the resulting
// opaque context pointer for AudioTranscription.
func (p *ParakeetCpp) Load(opts *pb.ModelOptions) error {
if opts.ModelFile == "" {
return errors.New("parakeet-cpp: ModelFile is required")
}
ctx := CppLoad(opts.ModelFile)
if ctx == 0 {
// No ctx to ask for last_error (the C-API's last-error buffer
// lives on the ctx that was never returned). Surface the path
// so the operator at least knows which load failed.
return fmt.Errorf("parakeet-cpp: parakeet_capi_load failed for %q", opts.ModelFile)
}
p.ctxPtr = ctx
// Dynamic batching knobs (model YAML options:, key:value form). Batching is
// OFF by default (batch_max_size:1): each request runs on its own. On GPU,
// raising batch_max_size coalesces concurrent requests into one batched
// engine call and improves throughput under load; leave it at 1 on CPU and
// for low-concurrency setups, where batching only adds latency.
maxSize := optInt(opts, "batch_max_size", 1)
maxWaitMs := optInt(opts, "batch_max_wait_ms", 15)
if maxWaitMs < 0 {
maxWaitMs = 0
}
if CppTranscribePcmBatchJSON != nil {
p.batStop = make(chan struct{})
p.bat = newBatcher(maxSize, time.Duration(maxWaitMs)*time.Millisecond, p.runBatch)
go p.bat.run(p.batStop) // dispatcher runs until Free closes batStop
if maxSize > 1 {
xlog.Info("parakeet-cpp: dynamic batching enabled",
"batch_max_size", maxSize, "batch_max_wait_ms", maxWaitMs)
} else {
xlog.Info("parakeet-cpp: dynamic batching off (batch_max_size=1); " +
"set batch_max_size>1 to coalesce concurrent requests on GPU")
}
} else {
xlog.Info("parakeet-cpp: batched C-API not present in libparakeet.so; " +
"batching disabled, using per-request transcription")
}
return nil
}
// optInt reads an integer model option (key:value form) from ModelOptions,
// returning def when absent or unparseable. The options array carries the
// model YAML's options: entries (see core/config; siblings such as
// acestep-cpp parse the same key:value form via strings.Cut on ":").
func optInt(opts *pb.ModelOptions, key string, def int) int {
for _, o := range opts.GetOptions() {
k, v, ok := strings.Cut(o, ":")
if ok && strings.TrimSpace(k) == key {
if n, err := strconv.Atoi(strings.TrimSpace(v)); err == nil {
return n
}
}
}
return def
}
// runBatch is the dispatcher's batch handler and the ONLY caller of the C
// engine on the unary path. It concatenates the batch PCM, calls the batched
// JSON C-API under engineMu, splits the JSON array, and replies to each request.
func (p *ParakeetCpp) runBatch(reqs []*batchRequest) {
// Observability: the actual coalesced batch size per engine call. Debug-level
// so it stays silent in normal operation but lets operators confirm/tune batching.
xlog.Debug("parakeet-cpp: dispatching batch", "size", len(reqs))
nSamples := make([]int32, len(reqs))
total := 0
for i, r := range reqs {
nSamples[i] = int32(len(r.pcm))
total += len(r.pcm)
}
concat := make([]float32, 0, total)
for _, r := range reqs {
concat = append(concat, r.pcm...)
}
var dec int32
if len(reqs) > 0 {
dec = reqs[0].decoder
}
p.engineMu.Lock()
cstr := CppTranscribePcmBatchJSON(p.ctxPtr, concat, nSamples, int32(len(reqs)), 16000, dec)
p.engineMu.Unlock()
if cstr == 0 {
err := fmt.Errorf("parakeet-cpp: batch transcribe failed: %s", CppLastError(p.ctxPtr))
for _, r := range reqs {
r.reply <- batchReply{err: err}
}
return
}
raw := goStringFromCPtr(cstr)
CppFreeString(cstr)
var docs []json.RawMessage
if err := json.Unmarshal([]byte(raw), &docs); err != nil || len(docs) != len(reqs) {
e := fmt.Errorf("parakeet-cpp: batch json: got %d results for %d reqs (%v)", len(docs), len(reqs), err)
for _, r := range reqs {
r.reply <- batchReply{err: e}
}
return
}
for i, r := range reqs {
r.reply <- batchReply{json: string(docs[i])}
}
}
// AudioTranscription decodes the wav at opts.Dst to 16 kHz mono PCM and
// submits it to the in-process batcher, which coalesces concurrent requests
// into a single batched engine call (parakeet_capi_transcribe_pcm_batch_json)
// with the default decoder (decoder=0, which selects the right head per
// architecture: transducer for tdt/rnnt/hybrid, CTC for ctc) and shapes the
// per-word timestamps into a LocalAI TranscriptResult.
//
// Parakeet emits word- and token-level timestamps but no native segment
// boundaries, so we synthesise a single whole-clip segment spanning the first
// word start to the last word end. Word-level timings are attached only when
// the caller opts in via timestamp_granularities=["word"] (matching the
// OpenAI API, whose default is segment-level); token ids always populate
// Segment.Tokens.
//
// translate/diarize/prompt/temperature/language/threads are not applicable to
// parakeet and are ignored; streaming is handled by AudioTranscriptionStream
// (L2).
func (p *ParakeetCpp) AudioTranscription(ctx context.Context, opts *pb.TranscriptRequest) (pb.TranscriptResult, error) {
if p.ctxPtr == 0 {
return pb.TranscriptResult{}, grpcerrors.ModelNotLoaded("parakeet-cpp")
}
if opts.Dst == "" {
return pb.TranscriptResult{}, errors.New("parakeet-cpp: TranscriptRequest.dst (audio path) is required")
}
// Fallback when the batched C-API is unavailable: transcribe from a file
// path (original behavior, no batching). The C library's audio loader only
// understands 16 kHz mono WAV/PCM, so convert the input first - otherwise
// any non-WAV upload (MP3, etc.) fails with "failed to load audio". This
// mirrors what every other audio backend (whisper, crispasr) does via
// utils.AudioToWav before handing the file to the engine.
if p.bat == nil {
converted, cleanup, err := convertToWavMono16k(opts.Dst)
if err != nil {
return pb.TranscriptResult{}, err
}
defer cleanup()
cstr := CppTranscribePathJSON(p.ctxPtr, converted, 0)
if cstr == 0 {
return pb.TranscriptResult{}, fmt.Errorf("parakeet-cpp: transcribe_path_json failed: %s", CppLastError(p.ctxPtr))
}
raw := goStringFromCPtr(cstr)
CppFreeString(cstr)
var doc transcriptJSON
if err := json.Unmarshal([]byte(raw), &doc); err != nil {
return pb.TranscriptResult{}, fmt.Errorf("parakeet-cpp: decode transcript json: %w", err)
}
return transcriptResultFromDoc(doc, opts), nil
}
// Batched path: decode to PCM, submit to the batcher, wait for this request's
// JSON element. The dispatcher is the sole engine caller on this path; both
// sends honour ctx cancellation.
pcm, _, err := decodeWavMono16k(opts.Dst)
if err != nil {
return pb.TranscriptResult{}, err
}
rep := make(chan batchReply, 1)
select {
case p.bat.submit <- &batchRequest{pcm: pcm, decoder: 0, reply: rep}:
case <-ctx.Done():
return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled")
}
var res batchReply
select {
case res = <-rep:
case <-ctx.Done():
return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled")
}
if res.err != nil {
return pb.TranscriptResult{}, res.err
}
var doc transcriptJSON
if err := json.Unmarshal([]byte(res.json), &doc); err != nil {
return pb.TranscriptResult{}, fmt.Errorf("parakeet-cpp: decode transcript json: %w", err)
}
return transcriptResultFromDoc(doc, opts), nil
}
// transcriptResultFromDoc maps a decoded transcriptJSON to a TranscriptResult,
// synthesising a single whole-clip segment and attaching word timings only when
// the caller requested word granularity. Shared by the batched and direct paths.
func transcriptResultFromDoc(doc transcriptJSON, opts *pb.TranscriptRequest) pb.TranscriptResult {
text := strings.TrimSpace(doc.Text)
words := make([]*pb.TranscriptWord, 0, len(doc.Words))
for _, w := range doc.Words {
words = append(words, &pb.TranscriptWord{Start: secondsToNanos(w.Start), End: secondsToNanos(w.End), Text: w.W})
}
tokens := make([]int32, 0, len(doc.Tokens))
for _, t := range doc.Tokens {
tokens = append(tokens, t.ID)
}
var segStart, segEnd int64
if len(words) > 0 {
segStart = words[0].Start
segEnd = words[len(words)-1].End
}
seg := &pb.TranscriptSegment{Id: 0, Start: segStart, End: segEnd, Text: text, Tokens: tokens}
if wordsRequested(opts.TimestampGranularities) {
seg.Words = words
}
return pb.TranscriptResult{Text: text, Segments: []*pb.TranscriptSegment{seg}}
}
// wordsRequested reports whether the caller asked for word-level timestamps.
// The OpenAI transcription API gates word timings behind
// timestamp_granularities[] containing "word" and defaults to segment-level
// otherwise; we follow that contract.
func wordsRequested(granularities []string) bool {
for _, g := range granularities {
if strings.EqualFold(strings.TrimSpace(g), "word") {
return true
}
}
return false
}
// secondsToNanos converts the C-API's fractional-second timestamps into the
// int64 nanoseconds LocalAI carries on TranscriptSegment/TranscriptWord, the
// same nanosecond convention the whisper backend uses.
func secondsToNanos(sec float64) int64 {
return int64(sec * 1e9)
}
// AudioTranscriptionStream drives the cache-aware streaming RNN-T over the
// audio at opts.Dst: it decodes the file to 16 kHz mono PCM, feeds it in
// chunks to parakeet_capi_stream_feed, and emits each newly-finalized text
// run as a TranscriptStreamResponse delta. <EOU>/<EOB> events close the
// current segment; a closing FinalResult carries the full transcript and the
// per-utterance segments.
//
// stream_begin returns 0 for models that are not cache-aware streaming models
// (only e.g. nvidia/parakeet_realtime_eou_120m-v1 qualifies). For those we fall
// back to a single offline transcription emitted as one delta plus a closing
// FinalResult, matching LocalAI's non-streaming streaming contract (and the
// whisper backend), so the streaming endpoint works for every model.
func (p *ParakeetCpp) AudioTranscriptionStream(ctx context.Context, opts *pb.TranscriptRequest, results chan *pb.TranscriptStreamResponse) error {
defer close(results)
if p.ctxPtr == 0 {
return grpcerrors.ModelNotLoaded("parakeet-cpp")
}
if opts.Dst == "" {
return errors.New("parakeet-cpp: TranscriptRequest.dst (audio path) is required")
}
if err := ctx.Err(); err != nil {
return status.Error(codes.Canceled, "transcription cancelled")
}
stream := CppStreamBegin(p.ctxPtr)
if stream == 0 {
// Not a cache-aware streaming model: run a normal offline
// transcription and emit it as one delta + a closing final result.
res, err := p.AudioTranscription(ctx, opts)
if err != nil {
return err
}
if t := strings.TrimSpace(res.Text); t != "" {
results <- &pb.TranscriptStreamResponse{Delta: t}
}
results <- &pb.TranscriptStreamResponse{FinalResult: &res}
return nil
}
defer CppStreamFree(stream)
// The C engine is a single shared context: a streaming session and a batched
// unary dispatch must never touch it at once, so hold engineMu for the whole
// stream. This lock is intentionally taken AFTER the non-streaming fallback
// above returns: that fallback goes through AudioTranscription -> the batcher
// -> runBatch, which itself acquires engineMu, so locking here first would
// deadlock. Do not hoist this lock above the fallback.
p.engineMu.Lock()
defer p.engineMu.Unlock()
data, duration, err := decodeWavMono16k(opts.Dst)
if err != nil {
return err
}
var (
full strings.Builder
segText strings.Builder
segments []*pb.TranscriptSegment
segID int32
)
flushSegment := func() {
t := strings.TrimSpace(segText.String())
segText.Reset()
if t == "" {
return
}
segments = append(segments, &pb.TranscriptSegment{Id: segID, Text: t})
segID++
}
// emitDelta consumes the malloc'd char* returned by feed/finalize: frees
// it, accumulates the text, and sends a delta when non-empty. A 0 return
// is an error (vs the "" empty-but-non-NULL no-new-text case).
emitDelta := func(ret uintptr) error {
if ret == 0 {
msg := CppLastError(p.ctxPtr)
if msg == "" {
msg = "unknown error"
}
return fmt.Errorf("parakeet-cpp: stream feed/finalize failed: %s", msg)
}
delta := goStringFromCPtr(ret)
CppFreeString(ret)
if delta == "" {
return nil
}
full.WriteString(delta)
segText.WriteString(delta)
results <- &pb.TranscriptStreamResponse{Delta: delta}
return nil
}
for off := 0; off < len(data); off += streamChunkSamples {
if err := ctx.Err(); err != nil {
return status.Error(codes.Canceled, "transcription cancelled")
}
end := min(off+streamChunkSamples, len(data))
chunk := data[off:end]
var eou int32
ret := CppStreamFeed(stream, chunk, int32(len(chunk)), unsafe.Pointer(&eou))
if err := emitDelta(ret); err != nil {
return err
}
if eou != 0 {
flushSegment()
}
}
// Flush the streaming tail (final encoder chunk).
if err := emitDelta(CppStreamFinalize(stream)); err != nil {
return err
}
flushSegment()
text := strings.TrimSpace(full.String())
if len(segments) == 0 && text != "" {
segments = append(segments, &pb.TranscriptSegment{Id: 0, Text: text})
}
results <- &pb.TranscriptStreamResponse{
FinalResult: &pb.TranscriptResult{
Text: text,
Segments: segments,
Duration: duration,
},
}
return nil
}
// decodeWavMono16k converts any input audio to 16 kHz mono PCM and returns the
// float samples plus the clip duration in seconds. Mirrors the whisper
// backend: utils.AudioToWav (ffmpeg) normalises rate/channels, go-audio
// decodes the PCM.
// convertToWavMono16k converts an arbitrary audio file to a 16 kHz mono WAV in
// a fresh temp dir and returns the path together with a cleanup func the caller
// must defer. WAV inputs already at 16 kHz/mono/16-bit are passed through by
// utils.AudioToWav (hardlink/copy), everything else is transcoded via ffmpeg.
// Used by the direct (non-batched) transcription path, which hands a file path
// to the C library's WAV-only audio loader.
func convertToWavMono16k(path string) (string, func(), error) {
dir, err := os.MkdirTemp("", "parakeet")
if err != nil {
return "", func() {}, err
}
cleanup := func() { _ = os.RemoveAll(dir) }
converted := filepath.Join(dir, "converted.wav")
if err := utils.AudioToWav(path, converted); err != nil {
cleanup()
return "", func() {}, err
}
return converted, cleanup, nil
}
func decodeWavMono16k(path string) ([]float32, float32, error) {
converted, cleanup, err := convertToWavMono16k(path)
if err != nil {
return nil, 0, err
}
defer cleanup()
fh, err := os.Open(converted)
if err != nil {
return nil, 0, err
}
defer func() { _ = fh.Close() }()
buf, err := wav.NewDecoder(fh).FullPCMBuffer()
if err != nil {
return nil, 0, err
}
data := buf.AsFloat32Buffer().Data
var duration float32
if buf.Format != nil && buf.Format.SampleRate > 0 {
duration = float32(len(data)) / float32(buf.Format.SampleRate)
}
return data, duration, nil
}
// Free releases the underlying parakeet_ctx. Called by LocalAI when the
// model is unloaded.
func (p *ParakeetCpp) Free() error {
// Stop the dispatcher before releasing the engine so no in-flight runBatch
// can touch a freed ctx (close leak / use-after-free on reload).
if p.batStop != nil {
close(p.batStop)
p.batStop = nil
}
if p.ctxPtr != 0 {
CppFree(p.ctxPtr)
p.ctxPtr = 0
}
return nil
}
// goStringFromCPtr copies a NUL-terminated C string into Go memory.
// cptr is the raw pointer returned by purego from the C-API (a malloc'd
// buffer the caller owns); callers must free it via CppFreeString after
// the copy lands.
//
// The uintptr->unsafe.Pointer conversion below trips go vet's unsafeptr
// check, which can't distinguish a C-owned heap pointer from Go-managed
// memory. It is safe here: the pointer addresses a malloc'd C buffer the
// Go GC neither tracks nor moves, and we dereference it immediately to
// copy the bytes out, the same pattern (and the same tolerated warning)
// as the whisper backend's unsafe.Slice over segsPtr.
func goStringFromCPtr(cptr uintptr) string {
if cptr == 0 {
return ""
}
p := unsafe.Pointer(cptr) //nolint:govet // C-owned malloc'd buffer, not Go-GC memory (see doc above)
n := 0
for *(*byte)(unsafe.Add(p, n)) != 0 {
n++
}
return string(unsafe.Slice((*byte)(p), n))
}

View File

@@ -0,0 +1,221 @@
package main
import (
"context"
"os"
"path/filepath"
"strings"
"sync"
"testing"
"github.com/ebitengine/purego"
"github.com/go-audio/audio"
"github.com/go-audio/wav"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
func TestParakeetCpp(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "parakeet-cpp Backend Suite")
}
var (
libLoadOnce sync.Once
libLoadErr error
)
// ensureLibLoaded mirrors main.go's bootstrap so a Go test can drive
// the C-API bridge without spinning up the gRPC server. Skips the
// current spec when libparakeet.so isn't loadable from cwd
// ($LD_LIBRARY_PATH or a symlink in ./).
func ensureLibLoaded() {
libLoadOnce.Do(func() {
libName := os.Getenv("PARAKEET_LIBRARY")
if libName == "" {
libName = "libparakeet.so"
}
lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
if err != nil {
libLoadErr = err
return
}
purego.RegisterLibFunc(&CppAbiVersion, lib, "parakeet_capi_abi_version")
purego.RegisterLibFunc(&CppLoad, lib, "parakeet_capi_load")
purego.RegisterLibFunc(&CppFree, lib, "parakeet_capi_free")
purego.RegisterLibFunc(&CppTranscribePath, lib, "parakeet_capi_transcribe_path")
purego.RegisterLibFunc(&CppTranscribePathJSON, lib, "parakeet_capi_transcribe_path_json")
if sym, err := purego.Dlsym(lib, "parakeet_capi_transcribe_pcm_batch_json"); err == nil && sym != 0 {
purego.RegisterLibFunc(&CppTranscribePcmBatchJSON, lib, "parakeet_capi_transcribe_pcm_batch_json")
}
purego.RegisterLibFunc(&CppStreamBegin, lib, "parakeet_capi_stream_begin")
purego.RegisterLibFunc(&CppStreamFeed, lib, "parakeet_capi_stream_feed")
purego.RegisterLibFunc(&CppStreamFinalize, lib, "parakeet_capi_stream_finalize")
purego.RegisterLibFunc(&CppStreamFree, lib, "parakeet_capi_stream_free")
purego.RegisterLibFunc(&CppFreeString, lib, "parakeet_capi_free_string")
purego.RegisterLibFunc(&CppLastError, lib, "parakeet_capi_last_error")
})
if libLoadErr != nil {
Skip("libparakeet.so not loadable: " + libLoadErr.Error())
}
}
// fixturesOrSkip returns the model + audio paths or skips the spec if
// either env var is unset. The smoke test never runs in default CI; it
// needs a real parakeet GGUF and a 16 kHz mono WAV on disk.
func fixturesOrSkip() (string, string) {
modelPath := os.Getenv("PARAKEET_BACKEND_TEST_MODEL")
audioPath := os.Getenv("PARAKEET_BACKEND_TEST_WAV")
if modelPath == "" || audioPath == "" {
Skip("set PARAKEET_BACKEND_TEST_MODEL and PARAKEET_BACKEND_TEST_WAV to run this spec")
}
return modelPath, audioPath
}
// writeMono16kWav writes `samples` frames of 16 kHz mono 16-bit silence to
// path. The result is already in AudioToWav's target format, so the conversion
// helper copies it through without invoking ffmpeg.
func writeMono16kWav(path string, samples int) {
GinkgoHelper()
f, err := os.Create(path)
Expect(err).ToNot(HaveOccurred())
enc := wav.NewEncoder(f, 16000, 16, 1, 1)
buf := &audio.IntBuffer{
Format: &audio.Format{NumChannels: 1, SampleRate: 16000},
SourceBitDepth: 16,
Data: make([]int, samples),
}
Expect(enc.Write(buf)).To(Succeed())
Expect(enc.Close()).To(Succeed())
Expect(f.Close()).To(Succeed())
}
var _ = Describe("ParakeetCpp", func() {
Context("AudioTranscription", func() {
It("transcribes a WAV via the parakeet C-API", func() {
modelPath, audioPath := fixturesOrSkip()
ensureLibLoaded()
p := &ParakeetCpp{}
Expect(p.Load(&pb.ModelOptions{ModelFile: modelPath})).To(Succeed())
defer func() { _ = p.Free() }()
res, err := p.AudioTranscription(context.Background(), &pb.TranscriptRequest{
Dst: audioPath,
})
Expect(err).ToNot(HaveOccurred())
Expect(strings.TrimSpace(res.Text)).ToNot(BeEmpty(),
"expected non-empty transcript for %s", audioPath)
Expect(res.Segments).To(HaveLen(1),
"synthesises a single whole-clip segment")
Expect(res.Segments[0].Text).To(Equal(res.Text),
"single segment text must equal the top-level text")
// Default (no granularities) is segment-level: no per-word timings.
Expect(res.Segments[0].Words).To(BeEmpty(),
"word timings are opt-in via timestamp_granularities")
})
It("emits word-level timestamps when granularity=word", func() {
modelPath, audioPath := fixturesOrSkip()
ensureLibLoaded()
p := &ParakeetCpp{}
Expect(p.Load(&pb.ModelOptions{ModelFile: modelPath})).To(Succeed())
defer func() { _ = p.Free() }()
res, err := p.AudioTranscription(context.Background(), &pb.TranscriptRequest{
Dst: audioPath,
TimestampGranularities: []string{"word"},
})
Expect(err).ToNot(HaveOccurred())
Expect(res.Segments).To(HaveLen(1))
seg := res.Segments[0]
Expect(seg.Words).ToNot(BeEmpty(),
"expected per-word timestamps with granularity=word")
// Monotonic, non-negative timings spanning the segment.
Expect(seg.Words[0].Start).To(BeNumerically(">=", int64(0)))
Expect(seg.End).To(BeNumerically(">=", seg.Start))
Expect(seg.Words[len(seg.Words)-1].End).To(Equal(seg.End),
"segment end tracks the last word")
})
})
Context("convertToWavMono16k", func() {
// The non-batched transcription path hands a file path to the C
// library's WAV-only audio loader, so it must convert first.
// utils.AudioToWav passes an already-16kHz/mono/16-bit WAV through
// without ffmpeg, which lets us exercise the helper (and the
// regression: the direct path used to skip conversion entirely)
// without a model, the C library, or ffmpeg.
It("returns a decodable 16kHz mono WAV copy and cleans it up", func() {
dir := GinkgoT().TempDir()
src := filepath.Join(dir, "input.wav")
writeMono16kWav(src, 16000) // 1s of silence at 16 kHz
converted, cleanup, err := convertToWavMono16k(src)
Expect(err).ToNot(HaveOccurred())
// It must produce a fresh temp file, not return the original path.
Expect(converted).ToNot(Equal(src))
Expect(converted).To(BeAnExistingFile())
pcm, _, err := decodeWavMono16k(converted)
Expect(err).ToNot(HaveOccurred())
Expect(pcm).To(HaveLen(16000), "round-trips the sample count")
cleanup()
Expect(converted).ToNot(BeAnExistingFile(), "cleanup removes the temp dir")
})
It("errors on a non-existent input rather than passing the path through", func() {
_, _, err := convertToWavMono16k(filepath.Join(GinkgoT().TempDir(), "missing.mp3"))
Expect(err).To(HaveOccurred())
})
})
Context("AudioTranscriptionStream", func() {
It("streams deltas and a closing FinalResult from a cache-aware model", func() {
// Streaming needs a cache-aware streaming model (e.g.
// realtime_eou); the offline test model would fail stream_begin.
modelPath := os.Getenv("PARAKEET_BACKEND_TEST_STREAM_MODEL")
audioPath := os.Getenv("PARAKEET_BACKEND_TEST_WAV")
if modelPath == "" || audioPath == "" {
Skip("set PARAKEET_BACKEND_TEST_STREAM_MODEL (cache-aware streaming model) and PARAKEET_BACKEND_TEST_WAV")
}
ensureLibLoaded()
p := &ParakeetCpp{}
Expect(p.Load(&pb.ModelOptions{ModelFile: modelPath})).To(Succeed())
defer func() { _ = p.Free() }()
results := make(chan *pb.TranscriptStreamResponse, 64)
errCh := make(chan error, 1)
go func() {
errCh <- p.AudioTranscriptionStream(context.Background(),
&pb.TranscriptRequest{Dst: audioPath}, results)
}()
var deltas []string
var final *pb.TranscriptResult
for r := range results {
if r.Delta != "" {
deltas = append(deltas, r.Delta)
}
if r.FinalResult != nil {
final = r.FinalResult
}
}
Expect(<-errCh).ToNot(HaveOccurred())
Expect(final).ToNot(BeNil(), "expected a closing FinalResult")
Expect(strings.TrimSpace(final.Text)).ToNot(BeEmpty(),
"expected a non-empty streamed transcript")
Expect(final.Segments).ToNot(BeEmpty(),
"FinalResult always carries at least one segment")
// The concatenated deltas reconstruct the final transcript.
Expect(strings.TrimSpace(strings.Join(deltas, ""))).To(Equal(strings.TrimSpace(final.Text)),
"deltas should reconstruct the final text")
})
})
})

View File

@@ -0,0 +1,75 @@
package main
// Started internally by LocalAI - one gRPC server per loaded model.
//
// Loads libparakeet.so via purego and registers the flat C-API entry
// points declared in parakeet_capi.h. The library name can be overridden
// with PARAKEET_LIBRARY (mirrors the WHISPER_LIBRARY / VIBEVOICECPP_LIBRARY
// convention in the sibling backends); the default looks for the .so next
// to this binary.
import (
"flag"
"fmt"
"os"
"github.com/ebitengine/purego"
grpc "github.com/mudler/LocalAI/pkg/grpc"
)
var (
addr = flag.String("addr", "localhost:50051", "the address to connect to")
)
type LibFuncs struct {
FuncPtr any
Name string
}
func main() {
libName := os.Getenv("PARAKEET_LIBRARY")
if libName == "" {
libName = "libparakeet.so"
}
lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
if err != nil {
panic(fmt.Errorf("parakeet-cpp: dlopen %q: %w", libName, err))
}
// Bound 1:1 to parakeet_capi.h. The C-API returns malloc'd char*
// buffers from transcribe_*; we register those as uintptr so we get
// the raw pointer back and can call parakeet_capi_free_string on it
// (purego's string return would copy and forget the original pointer,
// leaking it on every call).
libFuncs := []LibFuncs{
{&CppAbiVersion, "parakeet_capi_abi_version"},
{&CppLoad, "parakeet_capi_load"},
{&CppFree, "parakeet_capi_free"},
{&CppTranscribePath, "parakeet_capi_transcribe_path"},
{&CppTranscribePathJSON, "parakeet_capi_transcribe_path_json"},
{&CppStreamBegin, "parakeet_capi_stream_begin"},
{&CppStreamFeed, "parakeet_capi_stream_feed"},
{&CppStreamFinalize, "parakeet_capi_stream_finalize"},
{&CppStreamFree, "parakeet_capi_stream_free"},
{&CppFreeString, "parakeet_capi_free_string"},
{&CppLastError, "parakeet_capi_last_error"},
}
for _, lf := range libFuncs {
purego.RegisterLibFunc(lf.FuncPtr, lib, lf.Name)
}
// The batched-JSON entry point exists only in newer libparakeet.so (ABI >= 2).
// Probe with Dlsym and register only if present, so the backend still loads
// against an older library (it falls back to per-request transcription).
if sym, err := purego.Dlsym(lib, "parakeet_capi_transcribe_pcm_batch_json"); err == nil && sym != 0 {
purego.RegisterLibFunc(&CppTranscribePcmBatchJSON, lib, "parakeet_capi_transcribe_pcm_batch_json")
}
fmt.Fprintf(os.Stderr, "[parakeet-cpp] ABI=%d\n", CppAbiVersion())
flag.Parse()
if err := grpc.StartServer(*addr, &ParakeetCpp{}); err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,23 @@
#!/bin/bash
#
# L0 packaging stub: copy the binary, run.sh and libparakeet.so* into
# package/. The full ldd walk (libc, libstdc++, libgomp, GPU runtimes,
# arch detection) lands in L3, mirroring backend/go/whisper/package.sh.
set -e
CURDIR=$(dirname "$(realpath "$0")")
mkdir -p "$CURDIR/package/lib"
cp -avf "$CURDIR/parakeet-cpp-grpc" "$CURDIR/package/"
cp -avf "$CURDIR/run.sh" "$CURDIR/package/"
# libparakeet.so + any soname symlinks (libparakeet.so.X, libparakeet.so.X.Y).
cp -avf "$CURDIR"/libparakeet.so* "$CURDIR/package/lib/" 2>/dev/null || {
echo "ERROR: libparakeet.so not found in $CURDIR, run 'make' first" >&2
exit 1
}
echo "L0 package layout (full ldd walk lands in L3):"
ls -liah "$CURDIR/package/" "$CURDIR/package/lib/"

16
backend/go/parakeet-cpp/run.sh Executable file
View File

@@ -0,0 +1,16 @@
#!/bin/bash
set -e
CURDIR=$(dirname "$(realpath "$0")")
export LD_LIBRARY_PATH="$CURDIR/lib:$CURDIR:${LD_LIBRARY_PATH:-}"
# If a self-contained ld.so was packaged, route through it so the
# packaged libc / libstdc++ are used instead of the host's (matches the
# whisper backend's runtime layout).
if [ -f "$CURDIR/lib/ld.so" ]; then
echo "Using lib/ld.so"
exec "$CURDIR/lib/ld.so" "$CURDIR/parakeet-cpp-grpc" "$@"
fi
exec "$CURDIR/parakeet-cpp-grpc" "$@"

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# qwen3-tts.cpp version
QWEN3TTS_REPO?=https://github.com/predict-woo/qwen3-tts.cpp
QWEN3TTS_CPP_VERSION?=7a762e2ad4bacc6fdda81d81bf10a09ffb546f29
QWEN3TTS_CPP_VERSION?=136e5d36c17083da0321fd96512dc7b263f94a44
SO_TARGET?=libgoqwen3ttscpp.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF

View File

@@ -4,6 +4,7 @@ import (
"fmt"
"os"
"path/filepath"
"strings"
"github.com/mudler/LocalAI/pkg/grpc/base"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
@@ -21,6 +22,43 @@ type Qwen3TtsCpp struct {
threads int
}
// languageNameAliases maps common full language names to the canonical
// two-letter code understood by the C++ language_to_id table.
var languageNameAliases = map[string]string{
"english": "en",
"russian": "ru",
"chinese": "zh",
"japanese": "ja",
"korean": "ko",
"german": "de",
"french": "fr",
"spanish": "es",
"italian": "it",
"portuguese": "pt",
}
// normalizeLanguage coerces a caller-supplied language into the canonical code
// the model expects. It lowercases, trims, strips any region/locale suffix
// (en-US, en_US, ja.JP -> en/ja), and resolves common full names (english -> en).
// An empty input stays empty so the C++ side applies its English default; an
// unrecognized value is returned normalized so C++ can log it and default.
func normalizeLanguage(lang string) string {
lang = strings.ToLower(strings.TrimSpace(lang))
if lang == "" {
return ""
}
// Strip region/locale suffix: keep the segment before the first separator.
if i := strings.IndexAny(lang, "-_."); i >= 0 {
lang = lang[:i]
}
if code, ok := languageNameAliases[lang]; ok {
return code
}
return lang
}
func (q *Qwen3TtsCpp) Load(opts *pb.ModelOptions) error {
// ModelFile is the model directory path (containing GGUF files)
modelDir := opts.ModelFile
@@ -54,7 +92,7 @@ func (q *Qwen3TtsCpp) TTS(req *pb.TTSRequest) error {
dst := req.Dst
language := ""
if req.Language != nil {
language = *req.Language
language = normalizeLanguage(*req.Language)
}
// Synthesis parameters with sensible defaults

View File

@@ -0,0 +1,53 @@
package main
import (
"testing"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
func TestLanguageNormalization(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "qwen3-tts-cpp language normalization")
}
var _ = Describe("normalizeLanguage", func() {
DescribeTable("maps caller input to the canonical model language code",
func(input, expected string) {
Expect(normalizeLanguage(input)).To(Equal(expected))
},
// Canonical codes pass through unchanged
Entry("canonical en", "en", "en"),
Entry("canonical zh", "zh", "zh"),
Entry("canonical pt", "pt", "pt"),
// Case-insensitive
Entry("uppercase", "EN", "en"),
Entry("mixed case", "Ja", "ja"),
// Surrounding whitespace
Entry("trims whitespace", " en ", "en"),
// Region/locale stripping
Entry("BCP-47 region", "en-US", "en"),
Entry("underscore region", "en_US", "en"),
Entry("dotted locale", "ja.JP", "ja"),
Entry("region + case", "ZH-CN", "zh"),
// Full-name aliases
Entry("english name", "english", "en"),
Entry("chinese name cased", "Chinese", "zh"),
Entry("japanese name", "japanese", "ja"),
Entry("russian name", "russian", "ru"),
Entry("portuguese name", "portuguese", "pt"),
// Empty stays empty (C++ applies the English default)
Entry("empty", "", ""),
Entry("whitespace only", " ", ""),
// Unknown values pass through normalized so C++ can log + default
Entry("unknown code", "klingon", "klingon"),
Entry("unknown with region", "xx-YY", "xx"),
)
})

7
backend/go/rfdetr-cpp/.gitignore vendored Normal file
View File

@@ -0,0 +1,7 @@
sources/
build*/
package/
librfdetrcpp*.so
rfdetr-cpp
test-models/
test-data/

View File

@@ -0,0 +1,79 @@
cmake_minimum_required(VERSION 3.18)
project(librfdetrcpp LANGUAGES C CXX)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
# Static-link ggml + rfdetr so the resulting .so has no runtime dependency on
# extra ggml/rfdetr shared libraries — only on libc/libstdc++/libgomp, which
# the LocalAI package step bundles into the docker image.
set(BUILD_SHARED_LIBS OFF CACHE BOOL "Build static libraries" FORCE)
# rfdetr.cpp build switches: skip CLI/tests, keep static lib.
set(RFDETR_BUILD_CLI OFF CACHE BOOL "Disable rfdetr CLI" FORCE)
set(RFDETR_BUILD_TESTS OFF CACHE BOOL "Disable rfdetr tests" FORCE)
set(RFDETR_SHARED OFF CACHE BOOL "Build rfdetr as static lib" FORCE)
# rt-detr.cpp's top-level CMakeLists invokes
# `bash ${CMAKE_SOURCE_DIR}/scripts/apply_ggml_patches.sh` to apply its
# in-tree ggml patches before descending into the submodule. When we
# `add_subdirectory` it from a parent project, `CMAKE_SOURCE_DIR` points
# at *our* directory, not theirs, so the script path resolves wrong.
#
# Run the patches script ourselves up front (it's idempotent — re-running
# is a no-op once patches are applied) so the rt-detr.cpp configure step
# is essentially a no-op for the patch hook.
set(RFDETR_CPP_SRC ${CMAKE_CURRENT_SOURCE_DIR}/sources/rt-detr.cpp)
if(EXISTS ${RFDETR_CPP_SRC}/scripts/apply_ggml_patches.sh)
execute_process(
COMMAND bash ${RFDETR_CPP_SRC}/scripts/apply_ggml_patches.sh
RESULT_VARIABLE _rfdetr_patch_result
OUTPUT_VARIABLE _rfdetr_patch_output
ERROR_VARIABLE _rfdetr_patch_error
OUTPUT_STRIP_TRAILING_WHITESPACE
ERROR_STRIP_TRAILING_WHITESPACE)
if(NOT _rfdetr_patch_result EQUAL 0)
message(FATAL_ERROR
"Failed to apply ggml patches (exit ${_rfdetr_patch_result}):\n"
"stdout:\n${_rfdetr_patch_output}\n"
"stderr:\n${_rfdetr_patch_error}")
endif()
message(STATUS "${_rfdetr_patch_output}")
endif()
# Stage a shim 'scripts/apply_ggml_patches.sh' under our source dir so that
# rt-detr.cpp's CMakeLists — which calls
# bash ${CMAKE_SOURCE_DIR}/scripts/apply_ggml_patches.sh
# — finds an idempotent no-op there. The real patches have already been
# applied above; this just satisfies the path lookup.
file(MAKE_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/scripts)
file(WRITE ${CMAKE_CURRENT_SOURCE_DIR}/scripts/apply_ggml_patches.sh
"#!/usr/bin/env bash
# Shim - patches were already applied by the parent CMakeLists.
exit 0
")
execute_process(COMMAND chmod +x ${CMAKE_CURRENT_SOURCE_DIR}/scripts/apply_ggml_patches.sh)
add_subdirectory(./sources/rt-detr.cpp)
# rfdetr.cpp's C-API symbols already live inside librfdetr (src/rfdetr_capi.cpp
# is compiled into the lib). We re-export them via a MODULE library that
# whole-archive-links rfdetr so the symbols are visible at dlopen time.
add_library(rfdetrcpp MODULE
sources/rt-detr.cpp/src/rfdetr_capi.cpp)
target_include_directories(rfdetrcpp PRIVATE
sources/rt-detr.cpp/include
sources/rt-detr.cpp/src
sources/rt-detr.cpp/third_party/stb
)
target_link_libraries(rfdetrcpp PRIVATE rfdetr ggml)
if(CMAKE_CXX_COMPILER_ID MATCHES "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.0)
target_link_libraries(rfdetrcpp PRIVATE stdc++fs)
endif()
set_property(TARGET rfdetrcpp PROPERTY CXX_STANDARD 17)
set_target_properties(rfdetrcpp PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})

View File

@@ -0,0 +1,135 @@
CMAKE_ARGS?=
BUILD_TYPE?=
NATIVE?=false
GOCMD?=go
GO_TAGS?=
JOBS?=$(shell nproc --ignore=1)
# rt-detr.cpp (GitHub redirects the historical mudler/rt-detr.cpp to the new
# mudler/rf-detr.cpp slug). Pin to a specific commit if you need a stable
# build; leaving this on `master` always picks up the latest C-API surface
# (incl. the per-detection accessor functions used by gorfdetrcpp.go).
RFDETR_REPO?=https://github.com/mudler/rf-detr.cpp.git
RFDETR_VERSION?=65c0ffcc9a9bc9dae38252f63d0417c9845a6cf7
ifeq ($(NATIVE),false)
CMAKE_ARGS+=-DGGML_NATIVE=OFF
endif
# Forward LocalAI's BUILD_TYPE to the matching ggml backend switch.
ifeq ($(BUILD_TYPE),cublas)
CMAKE_ARGS+=-DGGML_CUDA=ON -DRFDETR_GGML_CUDA=ON
else ifeq ($(BUILD_TYPE),openblas)
CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
else ifeq ($(BUILD_TYPE),clblas)
CMAKE_ARGS+=-DGGML_CLBLAST=ON
else ifeq ($(BUILD_TYPE),hipblas)
ROCM_HOME ?= /opt/rocm
ROCM_PATH ?= /opt/rocm
export CXX=$(ROCM_HOME)/llvm/bin/clang++
export CC=$(ROCM_HOME)/llvm/bin/clang
AMDGPU_TARGETS?=gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1200,gfx1201
CMAKE_ARGS+=-DGGML_HIPBLAS=ON -DRFDETR_GGML_HIPBLAS=ON -DAMDGPU_TARGETS=$(AMDGPU_TARGETS)
else ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DGGML_VULKAN=ON -DRFDETR_GGML_VULKAN=ON
else ifeq ($(OS),Darwin)
ifneq ($(BUILD_TYPE),metal)
CMAKE_ARGS+=-DGGML_METAL=OFF
else
CMAKE_ARGS+=-DGGML_METAL=ON
CMAKE_ARGS+=-DGGML_METAL_EMBED_LIBRARY=ON
CMAKE_ARGS+=-DRFDETR_GGML_METAL=ON
endif
endif
ifeq ($(BUILD_TYPE),sycl_f16)
CMAKE_ARGS+=-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx \
-DGGML_SYCL_F16=ON
endif
ifeq ($(BUILD_TYPE),sycl_f32)
CMAKE_ARGS+=-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx
endif
sources/rt-detr.cpp:
mkdir -p sources && \
git clone --recursive $(RFDETR_REPO) sources/rt-detr.cpp && \
cd sources/rt-detr.cpp && \
git checkout $(RFDETR_VERSION) && \
git submodule update --init --recursive --depth 1 --single-branch
# Detect OS
UNAME_S := $(shell uname -s)
# Only build CPU variants on Linux
ifeq ($(UNAME_S),Linux)
VARIANT_TARGETS = librfdetrcpp-avx.so librfdetrcpp-avx2.so librfdetrcpp-avx512.so librfdetrcpp-fallback.so
else
# On non-Linux (e.g., Darwin), build only fallback variant
VARIANT_TARGETS = librfdetrcpp-fallback.so
endif
rfdetr-cpp: main.go gorfdetrcpp.go $(VARIANT_TARGETS)
CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o rfdetr-cpp ./
package: rfdetr-cpp
bash package.sh
build: package
clean: purge
rm -rf librfdetrcpp*.so rfdetr-cpp package sources
purge:
rm -rf build*
# Build all variants (Linux only)
ifeq ($(UNAME_S),Linux)
librfdetrcpp-avx.so: sources/rt-detr.cpp
rm -rfv build-$@
$(info ${GREEN}I rfdetr-cpp build info:avx${RESET})
SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) librfdetrcpp-custom
rm -rfv build-$@
librfdetrcpp-avx2.so: sources/rt-detr.cpp
rm -rfv build-$@
$(info ${GREEN}I rfdetr-cpp build info:avx2${RESET})
SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) librfdetrcpp-custom
rm -rfv build-$@
librfdetrcpp-avx512.so: sources/rt-detr.cpp
rm -rfv build-$@
$(info ${GREEN}I rfdetr-cpp build info:avx512${RESET})
SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) librfdetrcpp-custom
rm -rfv build-$@
endif
# Build fallback variant (all platforms)
librfdetrcpp-fallback.so: sources/rt-detr.cpp
rm -rfv build-$@
$(info ${GREEN}I rfdetr-cpp build info:fallback${RESET})
SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) librfdetrcpp-custom
rm -rfv build-$@
librfdetrcpp-custom: CMakeLists.txt
mkdir -p build-$(SO_TARGET) && \
cd build-$(SO_TARGET) && \
cmake .. $(CMAKE_ARGS) && \
cmake --build . --config Release -j$(JOBS) && \
cd .. && \
mv build-$(SO_TARGET)/librfdetrcpp.so ./$(SO_TARGET)
all: rfdetr-cpp package
# `test` is invoked by the top-level Makefile's `test-extra` target. It builds
# the backend binary + the fallback shared library (needed for dlopen at
# runtime), then runs test.sh which downloads the test models + COCO image
# and exercises the gRPC Load/Detect wire path via the Go smoke test in
# main_test.go for both the detection and segmentation models.
test: rfdetr-cpp librfdetrcpp-fallback.so
bash test.sh

View File

@@ -0,0 +1,195 @@
package main
// gorfdetrcpp.go - gRPC handlers (Load, Detect) for the rfdetr-cpp backend.
//
// Embeds base.SingleThread to default unimplemented RPCs to "not supported"
// while we only implement object detection.
import (
"encoding/base64"
"fmt"
"os"
"path/filepath"
"strconv"
"unsafe"
"github.com/mudler/LocalAI/pkg/grpc/base"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
)
// Default upper bound on detections returned per image. RF-DETR's decoder
// queries are limited to a few hundred; 300 is a safe ceiling.
const defaultTopK = 300
// rfdetr_handle_t is a uintptr-typed opaque handle (see include/rfdetr_capi.h).
var (
// rfdetr_capi_load(const char* model_path, int n_threads, rfdetr_handle_t* out_handle) -> int
CapiLoad func(modelPath string, nThreads int32, outHandle *uintptr) int32
// rfdetr_capi_unload(rfdetr_handle_t handle) -> int
CapiUnload func(handle uintptr) int32
// rfdetr_capi_detect_path(handle, image_path, threshold, top_k, out_json) -> int
CapiDetectPath func(handle uintptr, imagePath string, threshold float32, topK uint32, outJSON *uintptr) int32
// rfdetr_capi_detect_buffer(handle, bytes, len, threshold, top_k, out_json) -> int
CapiDetectBuffer func(handle uintptr, bytes uintptr, length uintptr, threshold float32, topK uint32, outJSON *uintptr) int32
// rfdetr_capi_free_string(char* s)
CapiFreeString func(s uintptr)
// rfdetr_capi_get_n_detections(handle) -> int
CapiGetNDetections func(handle uintptr) int32
// rfdetr_capi_get_detection_class_id(handle, i) -> int
CapiGetDetectionClassID func(handle uintptr, i int32) int32
// rfdetr_capi_get_detection_box(handle, i, out_xyxy[4]) -> int (0 on success)
CapiGetDetectionBox func(handle uintptr, i int32, outXYXY uintptr) int32
// rfdetr_capi_get_detection_score(handle, i) -> float
CapiGetDetectionScore func(handle uintptr, i int32) float32
// rfdetr_capi_get_detection_class_name(handle, i, buf, buf_size) -> int (needed/written; two-call sizing)
CapiGetDetectionClassName func(handle uintptr, i int32, buf uintptr, bufSize int32) int32
// rfdetr_capi_get_detection_mask_png(handle, i, buf, buf_size) -> int (needed/written; 0 means no mask)
CapiGetDetectionMaskPNG func(handle uintptr, i int32, buf uintptr, bufSize int32) int32
)
type RFDetrCpp struct {
base.SingleThread
handle uintptr
}
// Load loads the GGUF model at opts.ModelFile (joined with opts.ModelPath if relative)
// and stores the handle for later Detect calls.
func (r *RFDetrCpp) Load(opts *pb.ModelOptions) error {
modelFile := opts.ModelFile
if modelFile == "" {
modelFile = opts.Model
}
if modelFile == "" {
return fmt.Errorf("rfdetr-cpp: ModelFile is empty")
}
var modelPath string
if filepath.IsAbs(modelFile) {
modelPath = modelFile
} else {
modelPath = filepath.Join(opts.ModelPath, modelFile)
}
if _, err := os.Stat(modelPath); err != nil {
return fmt.Errorf("rfdetr-cpp: model file not found: %s: %w", modelPath, err)
}
threads := opts.Threads
if threads <= 0 {
threads = 4
}
// Release previous model if any (re-Load).
if r.handle != 0 {
CapiUnload(r.handle)
r.handle = 0
}
var h uintptr
rc := CapiLoad(modelPath, threads, &h)
if rc != 0 || h == 0 {
return fmt.Errorf("rfdetr-cpp: rfdetr_capi_load failed with rc=%d for %s", rc, modelPath)
}
r.handle = h
return nil
}
// Detect runs object detection on the base64-encoded image in opts.Src at
// opts.Threshold, returning one pb.Detection per result. Seg models also
// populate Detection.Mask with PNG-encoded mask bytes.
func (r *RFDetrCpp) Detect(opts *pb.DetectOptions) (pb.DetectResponse, error) {
if r.handle == 0 {
return pb.DetectResponse{}, fmt.Errorf("rfdetr-cpp: model not loaded")
}
// Decode base64 image and write to temp file.
imgData, err := base64.StdEncoding.DecodeString(opts.Src)
if err != nil {
return pb.DetectResponse{}, fmt.Errorf("rfdetr-cpp: failed to decode base64 image: %w", err)
}
tmpFile, err := os.CreateTemp("", "rfdetr-*.img")
if err != nil {
return pb.DetectResponse{}, fmt.Errorf("rfdetr-cpp: failed to create temp file: %w", err)
}
defer func() { _ = os.Remove(tmpFile.Name()) }()
if _, err := tmpFile.Write(imgData); err != nil {
_ = tmpFile.Close()
return pb.DetectResponse{}, fmt.Errorf("rfdetr-cpp: failed to write temp file: %w", err)
}
if err := tmpFile.Close(); err != nil {
return pb.DetectResponse{}, fmt.Errorf("rfdetr-cpp: failed to close temp file: %w", err)
}
threshold := opts.Threshold
if threshold <= 0 {
threshold = 0.5
}
// JSON output from detect_path is unused: we read structured detections via
// the accessor functions. Still must free the returned string.
var jsonPtr uintptr
rc := CapiDetectPath(r.handle, tmpFile.Name(), threshold, uint32(defaultTopK), &jsonPtr)
if jsonPtr != 0 {
CapiFreeString(jsonPtr)
}
if rc != 0 {
return pb.DetectResponse{}, fmt.Errorf("rfdetr-cpp: detect failed with rc=%d", rc)
}
n := CapiGetNDetections(r.handle)
if n < 0 {
return pb.DetectResponse{}, fmt.Errorf("rfdetr-cpp: invalid n_detections=%d", n)
}
detections := make([]*pb.Detection, 0, n)
for i := int32(0); i < n; i++ {
var bbox [4]float32 // x1, y1, x2, y2
if rc := CapiGetDetectionBox(r.handle, i, uintptr(unsafe.Pointer(&bbox[0]))); rc != 0 {
continue
}
cid := CapiGetDetectionClassID(r.handle, i)
score := CapiGetDetectionScore(r.handle, i)
// Two-call sizing for class_name.
var className string
nameSize := CapiGetDetectionClassName(r.handle, i, 0, 0)
if nameSize > 1 {
buf := make([]byte, nameSize)
written := CapiGetDetectionClassName(r.handle, i, uintptr(unsafe.Pointer(&buf[0])), nameSize)
// `written` is the same number (needed bytes including NUL); strip NUL.
if written > 0 && int(written) <= len(buf) {
className = string(buf[:written-1])
} else {
className = string(buf[:len(buf)-1])
}
}
if className == "" {
className = strconv.Itoa(int(cid))
}
// Two-call sizing for mask PNG (returns 0 when no mask).
var mask []byte
maskSize := CapiGetDetectionMaskPNG(r.handle, i, 0, 0)
if maskSize > 0 {
maskBuf := make([]byte, maskSize)
CapiGetDetectionMaskPNG(r.handle, i, uintptr(unsafe.Pointer(&maskBuf[0])), maskSize)
mask = maskBuf
}
detections = append(detections, &pb.Detection{
X: bbox[0],
Y: bbox[1],
Width: bbox[2] - bbox[0],
Height: bbox[3] - bbox[1],
Confidence: score,
ClassName: className,
Mask: mask,
})
}
return pb.DetectResponse{
Detections: detections,
}, nil
}

View File

@@ -0,0 +1,61 @@
package main
// main.go - entry point for the rfdetr-cpp gRPC backend.
//
// Dlopens librfdetrcpp-<variant>.so via purego at the path in
// RFDETR_LIBRARY (set by run.sh based on /proc/cpuinfo), registers the
// rfdetr_capi_* C ABI symbols, then starts the gRPC server.
import (
"flag"
"os"
"github.com/ebitengine/purego"
grpc "github.com/mudler/LocalAI/pkg/grpc"
)
var (
addr = flag.String("addr", "localhost:50051", "the address to connect to")
)
type LibFuncs struct {
FuncPtr any
Name string
}
func main() {
// Get library name from environment variable, default to fallback
libName := os.Getenv("RFDETR_LIBRARY")
if libName == "" {
libName = "./librfdetrcpp-fallback.so"
}
rfdetrLib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
if err != nil {
panic(err)
}
libFuncs := []LibFuncs{
{&CapiLoad, "rfdetr_capi_load"},
{&CapiUnload, "rfdetr_capi_unload"},
{&CapiDetectPath, "rfdetr_capi_detect_path"},
{&CapiDetectBuffer, "rfdetr_capi_detect_buffer"},
{&CapiFreeString, "rfdetr_capi_free_string"},
{&CapiGetNDetections, "rfdetr_capi_get_n_detections"},
{&CapiGetDetectionClassID, "rfdetr_capi_get_detection_class_id"},
{&CapiGetDetectionBox, "rfdetr_capi_get_detection_box"},
{&CapiGetDetectionScore, "rfdetr_capi_get_detection_score"},
{&CapiGetDetectionClassName, "rfdetr_capi_get_detection_class_name"},
{&CapiGetDetectionMaskPNG, "rfdetr_capi_get_detection_mask_png"},
}
for _, lf := range libFuncs {
purego.RegisterLibFunc(lf.FuncPtr, rfdetrLib, lf.Name)
}
flag.Parse()
if err := grpc.StartServer(*addr, &RFDetrCpp{}); err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,220 @@
package main
// main_test.go - end-to-end smoke test for the rfdetr-cpp gRPC backend.
//
// Spawns the compiled rfdetr-cpp binary on a free local port, dials it via
// gRPC, and exercises LoadModel + Detect against the test fixtures
// downloaded by test.sh. Two scenarios:
//
// 1. detection — loads rfdetr-nano-q8_0.gguf and asserts at least one
// detection comes back with a non-empty class name and a bounding box
// of non-zero size.
// 2. segmentation — loads rfdetr-seg-nano-q8_0.gguf and additionally
// asserts that at least one detection carries a PNG-encoded mask blob
// (verified by PNG magic bytes).
//
// Both specs Skip cleanly if their fixtures are missing so the test target
// stays usable on a fresh checkout where models haven't been downloaded.
import (
"context"
"encoding/base64"
"fmt"
"net"
"os"
"os/exec"
"path/filepath"
"testing"
"time"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
)
func TestRFDetrCpp(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "rfdetr-cpp backend smoke suite")
}
// freePort grabs an ephemeral TCP port and immediately releases it so the
// spawned backend can bind to it. There is a tiny TOCTOU window here but in
// practice it's adequate for a smoke test on a quiet runner.
func freePort() int {
l, err := net.Listen("tcp", "127.0.0.1:0")
Expect(err).ToNot(HaveOccurred(), "freePort listen")
port := l.Addr().(*net.TCPAddr).Port
Expect(l.Close()).To(Succeed())
return port
}
// startBackend spawns the rfdetr-cpp binary on the given port and waits
// until it accepts TCP connections (up to 10s). The returned cleanup func
// kills the process and reaps it.
func startBackend(port int) func() {
binary, err := filepath.Abs("./rfdetr-cpp")
Expect(err).ToNot(HaveOccurred())
if _, err := os.Stat(binary); err != nil {
Skip(fmt.Sprintf("backend binary not built: %s (run `make rfdetr-cpp` first)", binary))
}
libPath, err := filepath.Abs("./librfdetrcpp-fallback.so")
Expect(err).ToNot(HaveOccurred())
if _, err := os.Stat(libPath); err != nil {
Skip(fmt.Sprintf("fallback library not built: %s (run `make librfdetrcpp-fallback.so` first)", libPath))
}
addr := fmt.Sprintf("127.0.0.1:%d", port)
cmd := exec.Command(binary, "--addr", addr)
cmd.Env = append(os.Environ(), "RFDETR_LIBRARY="+libPath)
cmd.Stdout = os.Stderr
cmd.Stderr = os.Stderr
Expect(cmd.Start()).To(Succeed())
cleanup := func() {
if cmd.Process != nil {
_ = cmd.Process.Kill()
_, _ = cmd.Process.Wait()
}
}
deadline := time.Now().Add(10 * time.Second)
for time.Now().Before(deadline) {
c, err := net.DialTimeout("tcp", addr, 200*time.Millisecond)
if err == nil {
_ = c.Close()
return cleanup
}
time.Sleep(200 * time.Millisecond)
}
cleanup()
Fail(fmt.Sprintf("backend did not become ready on %s within 10s", addr))
return func() {}
}
// loadTestImage reads the COCO test image downloaded by test.sh and returns
// its base64-encoded content (the wire format accepted by the Detect RPC).
func loadTestImage() string {
imgPath, err := filepath.Abs("test-data/test.jpg")
Expect(err).ToNot(HaveOccurred())
imgBytes, err := os.ReadFile(imgPath)
if err != nil {
Skip(fmt.Sprintf("test image not present: %s (run test.sh first)", imgPath))
}
return base64.StdEncoding.EncodeToString(imgBytes)
}
// dialBackend opens a gRPC client connection to the spawned backend.
func dialBackend(port int) (pb.BackendClient, func()) {
addr := fmt.Sprintf("127.0.0.1:%d", port)
conn, err := grpc.NewClient(addr, grpc.WithTransportCredentials(insecure.NewCredentials()))
Expect(err).ToNot(HaveOccurred())
return pb.NewBackendClient(conn), func() { _ = conn.Close() }
}
// modelPathOrSkip resolves a model file under ./test-models/ and Skip()s
// the current spec if it's missing.
func modelPathOrSkip(name string) string {
modelDir, err := filepath.Abs("test-models")
Expect(err).ToNot(HaveOccurred())
modelPath := filepath.Join(modelDir, name)
if _, err := os.Stat(modelPath); err != nil {
Skip(fmt.Sprintf("model not present: %s (run test.sh first)", modelPath))
}
return modelPath
}
var _ = Describe("rfdetr-cpp backend", func() {
It("runs object detection against a known-good COCO image", func() {
modelPath := modelPathOrSkip("rfdetr-nano-q8_0.gguf")
imgB64 := loadTestImage()
port := freePort()
cleanup := startBackend(port)
defer cleanup()
client, closeConn := dialBackend(port)
defer closeConn()
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
defer cancel()
loadResp, err := client.LoadModel(ctx, &pb.ModelOptions{
Model: "rfdetr-nano-q8_0.gguf",
ModelFile: modelPath,
Threads: 2,
})
Expect(err).ToNot(HaveOccurred(), "LoadModel")
Expect(loadResp.GetSuccess()).To(BeTrue(), "LoadModel reported failure: %s", loadResp.GetMessage())
detResp, err := client.Detect(ctx, &pb.DetectOptions{
Src: imgB64,
Threshold: 0.5,
})
Expect(err).ToNot(HaveOccurred(), "Detect")
Expect(detResp.GetDetections()).ToNot(BeEmpty(), "no detections returned on a known-good COCO image")
_, _ = fmt.Fprintf(GinkgoWriter, "detection OK: %d detections\n", len(detResp.GetDetections()))
for i, d := range detResp.GetDetections() {
Expect(d.GetClassName()).ToNot(BeEmpty(), "detection %d has empty class_name", i)
Expect(d.GetConfidence()).To(BeNumerically(">=", float32(0.5)),
"detection %d below threshold", i)
Expect(d.GetWidth()).To(BeNumerically(">", float32(0)),
"detection %d has non-positive width", i)
Expect(d.GetHeight()).To(BeNumerically(">", float32(0)),
"detection %d has non-positive height", i)
}
})
It("runs segmentation and returns PNG-encoded masks", func() {
modelPath := modelPathOrSkip("rfdetr-seg-nano-q8_0.gguf")
imgB64 := loadTestImage()
port := freePort()
cleanup := startBackend(port)
defer cleanup()
client, closeConn := dialBackend(port)
defer closeConn()
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
defer cancel()
loadResp, err := client.LoadModel(ctx, &pb.ModelOptions{
Model: "rfdetr-seg-nano-q8_0.gguf",
ModelFile: modelPath,
Threads: 2,
})
Expect(err).ToNot(HaveOccurred(), "LoadModel")
Expect(loadResp.GetSuccess()).To(BeTrue(), "LoadModel reported failure: %s", loadResp.GetMessage())
detResp, err := client.Detect(ctx, &pb.DetectOptions{
Src: imgB64,
Threshold: 0.5,
})
Expect(err).ToNot(HaveOccurred(), "Detect")
Expect(detResp.GetDetections()).ToNot(BeEmpty(), "no detections returned from segmentation model")
haveMask := false
for i, d := range detResp.GetDetections() {
m := d.GetMask()
if len(m) == 0 {
continue
}
haveMask = true
// Verify PNG magic: 89 50 4E 47 ("\x89PNG").
Expect(len(m)).To(BeNumerically(">=", 4), "detection %d mask too short", i)
Expect([]byte{m[0], m[1], m[2], m[3]}).To(Equal([]byte{0x89, 'P', 'N', 'G'}),
"detection %d mask is not a PNG", i)
}
Expect(haveMask).To(BeTrue(),
"segmentation model returned %d detections but none carried a mask",
len(detResp.GetDetections()))
_, _ = fmt.Fprintf(GinkgoWriter, "segmentation OK: %d detections, at least one with PNG mask\n",
len(detResp.GetDetections()))
})
})

View File

@@ -0,0 +1,59 @@
#!/bin/bash
# Script to copy the appropriate libraries based on architecture
set -e
CURDIR=$(dirname "$(realpath $0)")
REPO_ROOT="${CURDIR}/../../.."
# Create lib directory
mkdir -p $CURDIR/package/lib
cp -avf $CURDIR/librfdetrcpp-*.so $CURDIR/package/
cp -avf $CURDIR/rfdetr-cpp $CURDIR/package/
cp -fv $CURDIR/run.sh $CURDIR/package/
# Detect architecture and copy appropriate libraries
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
# x86_64 architecture
echo "Detected x86_64 architecture, copying x86_64 libraries..."
cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
# ARM64 architecture
echo "Detected ARM64 architecture, copying ARM64 libraries..."
cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ $(uname -s) = "Darwin" ]; then
echo "Detected Darwin"
else
echo "Error: Could not detect architecture"
exit 1
fi
# Package GPU libraries based on BUILD_TYPE
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
if [ -f "$GPU_LIB_SCRIPT" ]; then
echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
package_gpu_libs
fi
echo "Packaging completed successfully"
ls -liah $CURDIR/package/
ls -liah $CURDIR/package/lib/

52
backend/go/rfdetr-cpp/run.sh Executable file
View File

@@ -0,0 +1,52 @@
#!/bin/bash
set -ex
# Get the absolute current dir where the script is located
CURDIR=$(dirname "$(realpath $0)")
cd /
echo "CPU info:"
if [ "$(uname)" != "Darwin" ]; then
grep -e "model\sname" /proc/cpuinfo | head -1
grep -e "flags" /proc/cpuinfo | head -1
fi
LIBRARY="$CURDIR/librfdetrcpp-fallback.so"
if [ "$(uname)" != "Darwin" ]; then
if grep -q -e "\savx\s" /proc/cpuinfo ; then
echo "CPU: AVX found OK"
if [ -e $CURDIR/librfdetrcpp-avx.so ]; then
LIBRARY="$CURDIR/librfdetrcpp-avx.so"
fi
fi
if grep -q -e "\savx2\s" /proc/cpuinfo ; then
echo "CPU: AVX2 found OK"
if [ -e $CURDIR/librfdetrcpp-avx2.so ]; then
LIBRARY="$CURDIR/librfdetrcpp-avx2.so"
fi
fi
# Check avx 512
if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
echo "CPU: AVX512F found OK"
if [ -e $CURDIR/librfdetrcpp-avx512.so ]; then
LIBRARY="$CURDIR/librfdetrcpp-avx512.so"
fi
fi
fi
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
export RFDETR_LIBRARY=$LIBRARY
# If there is a lib/ld.so, use it
if [ -f $CURDIR/lib/ld.so ]; then
echo "Using lib/ld.so"
echo "Using library: $LIBRARY"
exec $CURDIR/lib/ld.so $CURDIR/rfdetr-cpp "$@"
fi
echo "Using library: $LIBRARY"
exec $CURDIR/rfdetr-cpp "$@"

Some files were not shown because too many files have changed in this diff Show More