Compare commits

...

104 Commits

Author SHA1 Message Date
dependabot[bot]
b15627c864 chore(deps): bump the pip group across 1 directory with 2 updates
Bumps the pip group with 2 updates in the /backend/python/coqui directory: [transformers](https://github.com/huggingface/transformers) and torch.


Updates `transformers` from 4.48.3 to 5.0.0rc3
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](https://github.com/huggingface/transformers/compare/v4.48.3...v5.0.0rc3)

Updates `torch` from 2.4.1 to 2.7.1+cpu

---
updated-dependencies:
- dependency-name: transformers
  dependency-version: 5.0.0rc3
  dependency-type: direct:production
  dependency-group: pip
- dependency-name: torch
  dependency-version: 2.7.1+cpu
  dependency-type: direct:production
  dependency-group: pip
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-06-05 23:31:21 +00:00
Copilot
352b7ec604 Harden gallery-agent Hugging Face fetches against transient rate limiting (#10187)
* Initial plan

* fix: retry HuggingFace trending fetch on transient rate limits

* fix: handle body close/write errors in huggingface retry paths

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
2026-06-05 23:43:06 +02:00
LocalAI [bot]
ba706422fb chore: ⬆️ Update vllm-project/vllm cu130 wheel to 0.22.1 (#10188)
⬆️ Update vllm-project/vllm cu130 wheel

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-05 23:42:50 +02:00
LocalAI [bot]
e837921c2c feat: forward reasoning_effort to the backend so jinja models honor it (#10184)
* feat: forward reasoning_effort to the backend so jinja models honor it

reasoning_effort was only mapped to the binary enable_thinking toggle and
otherwise reached Go-side templates — it was never sent to the backend. So
jinja-templated models whose chat template keys on reasoning_effort (gpt-oss
Harmony, LFM2.5) could not be driven by it: LFM2.5 ignores enable_thinking and
kept emitting <think>.

Forward the effective reasoning_effort to the backend as a chat_template_kwarg
(mirroring enable_thinking) in grpc-server.cpp, and put it in PredictOptions
metadata (gRPCPredictOpts). Add a config-level default: ModelConfig.reasoning_effort
and Pipeline.reasoning_effort, resolved by ModelConfig.ApplyReasoningEffort
(request value overrides config default, none->disable / level->enable, an
operator's reasoning.disable wins). request.go now uses that helper.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): set the pipeline LLM's reasoning_effort

Apply Pipeline.ReasoningEffort to the pipeline's LLM config when the realtime
model is built (per-session copy, overrides the LLM's own reasoning_effort),
and surface the resolved effort on the template input so Go-templated models
get it too. jinja models receive it via the backend metadata. This lets a
realtime pipeline disable thinking on models that only honor reasoning_effort
(e.g. LFM2.5), which enable_thinking can't.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-05 13:45:43 +00:00
Richard Palethorpe
73385713ca feat(distributed): enforce registration token for worker file transfer (#10183)
The worker HTTP file-transfer server is authenticated by the registration
token via checkBearerToken, which fails open on an empty token: every
/v1/files, /v1/files-list and /v1/backend-logs request is then served
unauthenticated, granting read/write to the worker's models/staging/data
directories. The fail-open was also silent (the only auth log sat on the
unreachable reject branch), and the worker process never runs
DistributedConfig.Validate(), so the existing frontend warning did not
cover the component that exposes the server.

Mirror the NatsRequireAuth pattern: keep anonymous as the default but make
it loud and opt-in enforceable.

- Log a prominent warning when the file-transfer server starts tokenless.
- Add LOCALAI_REGISTRATION_REQUIRE_AUTH: DistributedConfig.Validate() errors
  on an empty token (frontend) and the worker refuses to start (fail-fast,
  before registration), so production can fail closed. Also satisfies the
  F-003 suggestion to fail Validate() on distributed + empty token.
- Add LOCALAI_DISTRIBUTED_REQUIRE_AUTH umbrella switch implying both
  RegistrationRequireAuth and NatsRequireAuth — one production knob locking
  down the registration/file-transfer layer and the NATS bus together; the
  granular flags remain available as single-layer overrides. Wired into the
  frontend, supervisor worker, and agent worker (vLLM worker has neither a
  NATS connection nor a file-transfer server, so it is left untouched).
- Document in distributed-mode.md (warning callout + flag tables).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-06-05 14:34:28 +02:00
LocalAI [bot]
a4e671779a chore: ⬆️ Update ggml-org/whisper.cpp to 99613cb720b65036237d44b52f753b51f75c2797 (#10178)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-05 09:04:25 +02:00
LocalAI [bot]
7051b2e0a1 chore: ⬆️ Update ggml-org/llama.cpp to 7c158fbb4aec1bdc9c81d6ca0e785139f4826fae (#10179)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-05 09:04:10 +02:00
LocalAI [bot]
469737101a chore: ⬆️ Update ikawrakow/ik_llama.cpp to 1520eda980564241434b791ce2bbbd128c4be9ea (#10180)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-05 09:03:08 +02:00
LocalAI [bot]
858257eaf0 fix(distributed): self-heal stale 'model not loaded' routing (#10181)
* fix(distributed): self-heal stale 'model not loaded' routing

In distributed mode the registry can list a model as loaded on a node
while the worker has evicted it (autonomous LRU eviction, an out-of-band
unload, etc.) yet the backend process survives. The router's cached-node
check only verifies the process is alive (probeHealth), so it routes there
and inference fails with "<backend>: model not loaded" — and stays broken
until the controller restarts and rebuilds its registry.

InFlightTrackingClient now reconciles this: when a tracked inference call
returns a model-not-loaded error, it drops the stale replica row
(RemoveNodeModel) so the next request reloads the model on a healthy node
instead of routing back to the evicted one. The original error is returned
unchanged; only the registry is corrected.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(distributed): typed model-not-loaded error via gRPC status code

Replace the controller-side error-string match with a shared, code-aware
helper. Go error types don't survive the gRPC boundary, so the signal is
carried as a status code (FailedPrecondition):

- pkg/grpc/grpcerrors: ModelNotLoaded(backend) constructor +
  IsModelNotLoaded(err) checker (status-code first, message fallback for
  backends not yet migrated).
- InFlightTrackingClient.reconcile now uses grpcerrors.IsModelNotLoaded.
- Migrate the Go backends that emit this error (parakeet-cpp, cloud-proxy,
  rfdetr-cpp) to the typed constructor.

Acting on a false positive is harmless (the model is just reloaded).

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-05 09:01:36 +02:00
Adira
ef80a0e825 fix(config): add face/speaker recognition constants and register insightface + speaker-recognition (#10110)
FLAG_FACE_RECOGNITION and FLAG_SPEAKER_RECOGNITION already existed as
ModelConfigUsecase bitmask flags, and GuessUsecases already gate-checks
both backends by name — but BackendCapabilities had no entries for
either, so the UI could not classify them.

Also missing were the Method* constants for the five proto-defined RPCs
these backends implement (FaceVerify, FaceAnalyze, VoiceVerify,
VoiceEmbed, VoiceAnalyze) and the corresponding Usecase* strings
and UsecaseInfoMap entries needed to wire them into the rest of the
capability system.

Changes:
- Add MethodFaceVerify, MethodFaceAnalyze, MethodVoiceVerify,
  MethodVoiceEmbed, MethodVoiceAnalyze GRPCMethod constants
- Add UsecaseFaceRecognition ("face_recognition") and
  UsecaseSpeakerRecognition ("speaker_recognition") Usecase constants
- Add UsecaseInfoMap entries for both new usecases, referencing the
  existing FLAG_FACE_RECOGNITION and FLAG_SPEAKER_RECOGNITION flags
- Register insightface: Embedding + Detect + FaceVerify + FaceAnalyze
- Register speaker-recognition: VoiceVerify + VoiceEmbed + VoiceAnalyze

Follows up on #10107 which left these two out because they needed new
constants first.

Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
2026-06-04 21:48:01 +02:00
LocalAI [bot]
92726f7631 fix(distributed): stage directory-based models to remote nodes (#10175)
Distributed file-staging treated every model path field (ModelFile, etc.)
as a single regular file: it os.Open'd the path and streamed its fd as the
HTTP PUT body. For directory-based models — e.g. qwen3-tts-cpp, whose
weights and tokenizer ggufs live under one directory referenced by
parameters.model — opening the directory succeeds but reading its fd
returns EISDIR, so routing the model to a remote NATS worker failed with
"read /models/<model>: is a directory". Single-file models were unaffected,
so only multi-file pipelines (e.g. the realtime TTS stage) broke.

stageModelFiles now detects a directory path field and stages each
contained file individually (via the new stageDirectory helper), preserving
structure with the existing StagingKeyMapper and rewriting the field to the
remote directory (deriving ModelPath as before). countStageableFiles makes
the progress total count a directory's files so the staging tracker stays
accurate.

Assisted-by: Claude:claude-opus-4-8 go vet

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-04 18:05:38 +02:00
LocalAI [bot]
994063ba9a feat(qwen3-tts-cpp): normalize request language for flexible matching (#10174)
The qwen3-tts.cpp backend honored the request `language` field only via exact lowercase two-letter codes in the C++ language_to_id table, silently defaulting to English for anything else (en-US, EN, english, ...).

Add normalizeLanguage() in the Go handler: lowercase + trim, strip the region/locale suffix (en-US, pt_BR, zh-Hans -> en/pt/zh), and resolve common English full names (english -> en). The canonical codes match the existing C++ table, so no C++ change is needed. Covered by a pure-Go Ginkgo spec. Also document the language field and accepted forms under the Qwen3-TTS docs.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-04 17:26:31 +02:00
LocalAI [bot]
c1a55cf72d chore: ⬆️ Update mudler/parakeet.cpp to b11fe5bca78ad8b342dd559a43d76df3984bb447 (#10167)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-04 12:07:09 +02:00
LocalAI [bot]
96758841d8 chore: ⬆️ Update predict-woo/qwen3-tts.cpp to 136e5d36c17083da0321fd96512dc7b263f94a44 (#10165)
⬆️ Update predict-woo/qwen3-tts.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-04 12:06:55 +02:00
LocalAI [bot]
7a59260621 chore: ⬆️ Update CrispStrobe/CrispASR to 13d54e110e1538e0f0bc3af0680b9ab246cfb48d (#10145)
⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-04 12:06:32 +02:00
LocalAI [bot]
27e63b9a78 feat(tts): support per-request instructions and params (#10172)
The OpenAI-compatible TTS endpoint accepts an `instructions` field, but it
was silently dropped at the HTTP->gRPC boundary: neither schema.TTSRequest
nor the gRPC TTSRequest proto carried it, so backends could only read such a
value from static YAML options (identical for every request). This blocked
per-line emotion/style and, for Qwen3-TTS VoiceDesign, limited a model config
to a single designed voice.

Plumb a generic per-request instruction string end to end, plus an optional
backend-specific params map:

- proto: add `optional string instructions` and `map<string,string> params`
  to TTSRequest.
- schema: add Instructions (maps OpenAI `instructions`) and Params (LocalAI
  extension) to schema.TTSRequest.
- core: thread both through ModelTTS/ModelTTSStream via a newTTSRequest helper
  that attaches instructions only when non-empty (so backends can fall back to
  YAML when unset); forward them from the /v1/audio/speech handler.
- qwen-tts: prefer the per-request instruction over the YAML `instruct` option
  (used by both mode detection and generation) and merge per-request params.
- chatterbox: merge per-request params (coerced to float/int/bool) over YAML
  options into generate() kwargs.

Fully backward compatible: empty instructions fall back to the YAML option and
backends that don't support style/voice instructions ignore the field.

Closes #10164


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-04 11:45:02 +02:00
LocalAI [bot]
55c0911c23 chore: ⬆️ Update leejet/stable-diffusion.cpp to 1f9ee88e09c258053fa59d5e05e23dfb10fa0b13 (#10166)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-04 09:34:34 +02:00
LocalAI [bot]
f6cb6ab6d9 chore: ⬆️ Update ggml-org/llama.cpp to 94a220cd6745e6e3f8de62870b66fd5b9bc92700 (#10168)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-04 09:34:13 +02:00
LocalAI [bot]
9f11b09c6a chore(model-gallery): ⬆️ update checksum (#10169)
⬆️ Checksum updates in gallery/index.yaml

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-04 00:32:15 +02:00
LocalAI [bot]
a5c4f822f0 chore: ⬆️ Update antirez/ds4 to 477c0e82e2699b35a65fd0a1ed6fe66b41087dfe (#10142)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-03 19:45:23 +02:00
LocalAI [bot]
fb36c262fe chore(model gallery): 🤖 add 1 new models via gallery agent (#10163)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-03 19:44:51 +02:00
LocalAI [bot]
0e4e8980e6 chore: ⬆️ Update ggml-org/llama.cpp to 5c394fdc8b564eff6faacc50a139529d875f0e36 (#10143)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-03 19:44:21 +02:00
Richard Palethorpe
3a932a9803 feat(distributed): Add NATS JWT authentication and TLS/mTLS options (#10159)
* feat(distributed): NATS JWT auth, TLS/mTLS options, and e2e coverage

Mint per-node NATS user JWTs at registration when LOCALAI_NATS_ACCOUNT_SEED
is set, and connect workers with scoped credentials from the register response.
Add optional LOCALAI_NATS_TLS_CA/CERT/KEY for private CA and mTLS alongside
tls:// URLs, plus test-e2e-distributed and NatsJWT container e2e specs.

Document JWT setup (nats-auth-setup.sh) and TLS env vars in distributed-mode.

Assisted-by: Grok:grok grok-build
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(distributed): correct NATS JWT scoping and harden client auth

The JWT-auth path added in 46467cc7 had several gaps that fail silently
under LOCALAI_NATS_REQUIRE_AUTH:

- Agent-worker minted JWTs did not allow the subjects the agent worker
  actually subscribes to (jobs.mcp-ci.new and nodes.<id>.backend.stop),
  so MCP-CI jobs and backend-stop session cleanup were silently dropped.
  Scope the agent permission set to those subjects.
- NATS subscription permission violations were swallowed (Subscribe
  returned a live-but-dead subscription). Confirm subscriptions with a
  server round-trip so a denial surfaces synchronously, and log async
  permission errors.
- The backend worker connected anonymously when given a JWT without its
  paired seed; reject the unpaired credential instead.
- The documented service-user permissions in nats-auth-setup.sh omitted
  prefixcache.>, which the frontend publishes and subscribes; add it.

Also: add a credential-provider hook to the messaging client (consumed by
the follow-up credential-lifecycle change), drop the always-nil error from
NatsMessagingOptions, run go mod tidy (jwt/v2 and nkeys are now direct),
and gofmt the feature's files.

Tests: an agent-JWT e2e spec that connects to the enforcing NATS server
and exercises every subscription the agent worker makes, plus permission
allow-list coverage unit tests.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(distributed): acquire and auto-refresh worker NATS credentials

Workers fetched NATS credentials once at startup, which broke two cases
under JWT auth: a worker that registered while still pending admin
approval never received a minted JWT (it connected unauthenticated and
gave up), and a long-running worker's 24h JWT expired with no way to renew
it.

Introduce workerregistry.NATSCredentialManager, built on idempotent
re-registration (the frontend preserves the node row and mints a fresh JWT
each call):

- Acquire re-registers through admin approval until the node is approved
  and credentials are minted (or returns the first success when auth is
  not required, preserving anonymous-NATS behavior).
- RefreshLoop re-registers before the JWT expires (~75% of its lifetime),
  updating the credentials served to the connection.
- Both are bounded (default 100 attempts / consecutive failures) and
  return an error on exhaustion, so an unapprovable or unrenewable worker
  exits non-zero and surfaces the problem instead of hanging or drifting
  toward an expired credential.

The messaging client gains WithUserJWTProvider, fetching credentials on
each (re)connect so the connection transparently adopts a refreshed JWT
when the server expires the old one. RegisterFull exposes the approval
status and full response; Register delegates to it.

Both the backend worker and the agent worker are wired to this: explicit
env credentials are used as-is, minted credentials are acquired-with-wait
and refreshed, and a permanent refresh failure shuts the worker down so it
restarts and re-acquires.

Tests cover Acquire (wait-through-pending, bounded give-up, context
cancel), RefreshLoop (refresh-before-expiry, bounded failure, no-expiry
exit) and jwtExpiry decoding. Docs updated in distributed-mode.md.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-06-03 19:43:56 +02:00
LocalAI [bot]
9d10418593 fix(parakeet-cpp): convert audio before the non-batched transcribe path (#10161)
The direct (non-batched) transcription path handed the original upload
path straight to the C library via parakeet_capi_transcribe_path_json.
That loader only understands 16 kHz mono WAV/PCM, so any other format
(MP3, etc.) failed with "parakeet: failed to load audio: <file>".

Only the batched path converted the input (via decodeWavMono16k ->
utils.AudioToWav). Every other audio backend (whisper, crispasr)
converts unconditionally with utils.AudioToWav before handing the file
to its engine; the parakeet-cpp fallback was the lone exception.

Extract a convertToWavMono16k helper (reused by decodeWavMono16k) that
produces a 16 kHz mono WAV in a temp dir, and run the non-batched path
through it before calling the C loader. WAV inputs already in the target
format are passed through without ffmpeg.

Add specs covering the helper (decodable copy + cleanup, and an error on
a missing input) that need neither the model, the C library, nor ffmpeg.


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-03 15:06:57 +02:00
dependabot[bot]
5470051d4d chore(deps): bump grpcio from 1.80.0 to 1.81.0 in /backend/python/transformers (#10158)
chore(deps): bump grpcio in /backend/python/transformers

Bumps [grpcio](https://github.com/grpc/grpc) from 1.80.0 to 1.81.0.
- [Release notes](https://github.com/grpc/grpc/releases)
- [Commits](https://github.com/grpc/grpc/compare/v1.80.0...v1.81.0)

---
updated-dependencies:
- dependency-name: grpcio
  dependency-version: 1.81.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-03 10:38:43 +02:00
LocalAI [bot]
68c5eeebc3 chore: ⬆️ Update ggml-org/whisper.cpp to 610e664ba7cfe3af46125ed1b5a1184fccb51bcd (#10140)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-03 10:38:28 +02:00
dependabot[bot]
1531fabe23 chore(deps): bump securego/gosec from 2.22.9 to 2.27.1 (#10147)
Bumps [securego/gosec](https://github.com/securego/gosec) from 2.22.9 to 2.27.1.
- [Release notes](https://github.com/securego/gosec/releases)
- [Commits](https://github.com/securego/gosec/compare/v2.22.9...v2.27.1)

---
updated-dependencies:
- dependency-name: securego/gosec
  dependency-version: 2.27.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-03 10:38:07 +02:00
LocalAI [bot]
b7673d5b76 chore: ⬆️ Update leejet/stable-diffusion.cpp to 2d40a8b2adcdf8b5b0ca0535f3bb7801b6ba13e5 (#10144)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-03 10:37:51 +02:00
dependabot[bot]
b64bdaf406 chore(deps): bump github.com/google/go-containerregistry from 0.21.5 to 0.21.6 (#10149)
chore(deps): bump github.com/google/go-containerregistry

Bumps [github.com/google/go-containerregistry](https://github.com/google/go-containerregistry) from 0.21.5 to 0.21.6.
- [Release notes](https://github.com/google/go-containerregistry/releases)
- [Commits](https://github.com/google/go-containerregistry/compare/v0.21.5...v0.21.6)

---
updated-dependencies:
- dependency-name: github.com/google/go-containerregistry
  dependency-version: 0.21.6
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-03 10:37:33 +02:00
dependabot[bot]
eebf08ff1d chore(deps): bump grpcio from 1.80.0 to 1.81.0 in /backend/python/vllm (#10157)
Bumps [grpcio](https://github.com/grpc/grpc) from 1.80.0 to 1.81.0.
- [Release notes](https://github.com/grpc/grpc/releases)
- [Commits](https://github.com/grpc/grpc/compare/v1.80.0...v1.81.0)

---
updated-dependencies:
- dependency-name: grpcio
  dependency-version: 1.81.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-03 10:37:16 +02:00
dependabot[bot]
42e51894c3 chore(deps): bump go.opentelemetry.io/otel/exporters/prometheus from 0.65.0 to 0.66.0 (#10151)
chore(deps): bump go.opentelemetry.io/otel/exporters/prometheus

Bumps [go.opentelemetry.io/otel/exporters/prometheus](https://github.com/open-telemetry/opentelemetry-go) from 0.65.0 to 0.66.0.
- [Release notes](https://github.com/open-telemetry/opentelemetry-go/releases)
- [Changelog](https://github.com/open-telemetry/opentelemetry-go/blob/main/CHANGELOG.md)
- [Commits](https://github.com/open-telemetry/opentelemetry-go/compare/exporters/prometheus/v0.65.0...metric/x/v0.66.0)

---
updated-dependencies:
- dependency-name: go.opentelemetry.io/otel/exporters/prometheus
  dependency-version: 0.66.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-03 09:14:42 +02:00
LocalAI [bot]
d9ae6481fb chore: ⬆️ Update mudler/parakeet.cpp to 9edf17c3ada66e0f881dcff155492867db7ac4cf (#10141)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-03 08:49:47 +02:00
dependabot[bot]
f1c495a748 chore(deps): bump github.com/mudler/edgevpn from 0.32.2 to 0.34.0 (#10153)
Bumps [github.com/mudler/edgevpn](https://github.com/mudler/edgevpn) from 0.32.2 to 0.34.0.
- [Release notes](https://github.com/mudler/edgevpn/releases)
- [Commits](https://github.com/mudler/edgevpn/compare/v0.32.2...v0.34.0)

---
updated-dependencies:
- dependency-name: github.com/mudler/edgevpn
  dependency-version: 0.34.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-03 08:34:16 +02:00
LocalAI [bot]
415b561947 docs: fix distributed-mode diagram (workers use NATS, not PostgreSQL) (#10138)
docs: fix distributed-mode diagram - workers coordinate via NATS, not PostgreSQL

The architecture diagram drew the worker-bound arrows from the PostgreSQL area of the control plane, implying workers connect to PostgreSQL. They do not: PostgreSQL is the frontends shared state, while workers coordinate over NATS (backend.install events) and receive LoadModel over gRPC from a frontend. Re-route the worker arrows to originate from the NATS chip.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-02 22:05:33 +02:00
Ettore Di Giacinto
e6a0d4c375 Remove diagram from distributed mode documentation
Removed ASCII diagram of distributed mode architecture from the documentation.

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-06-02 18:48:12 +02:00
LocalAI [bot]
7e59a5c7c5 docs: architecture & feature diagrams (blueprint style) (#10137)
* docs: add 'how LocalAI works' architecture diagram

Add a blueprint-style architecture diagram: clients -> small core (API,
router, WebUI, agents) -> gRPC -> backend processes pulled on demand as
OCI images. Place it on the overview page and replace the stale external
architecture image on the reference page.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: add blueprint diagrams across feature, distributed & getting-started docs

Add 24 architecture/flow/comparison diagrams (PNG + HTML source) under
docs/static/images/diagrams/, wired into their docs pages, from an
impact-vs-effort audit of the docs. Broaden the API surface on the
overview architecture diagram (OpenAI, Anthropic, ElevenLabs, Ollama,
and LocalAI's own API) and move the gRPC boundary label clear of the arrows.

Pages: distributed mode (architecture, scheduling, ds4 layer-split),
distributed inferencing, MLX, realtime, quantization, MCP, agents,
mitm & cloud proxy, middleware, reverse-proxy TLS, VRAM, voice & face
recognition, reranker, function calling, fine-tuning (recipe + jobs),
diarization, audio transform, quickstart, model resolution.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: add composable-core diagram to README hero

Commit the composable-core card (small core + on-demand backend tiles)
alongside the other diagrams and reference it from the README hero via a
repo-relative path, so it renders on GitHub.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: fix composable-core connectors/badge and federated-vs-worker layout

- composable-core: thicken the plug-in connectors so they read clearly, and
  widen the SEPARATE IMAGE badge so its text no longer overflows the box.
- federated-vs-worker: shorten the WHOLE/SPLIT REQUEST pills to fit, and
  replace the tangled node-to-node activation arrows with a clean fan-out
  (request split across all sharded nodes), mirroring the federated panel.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-02 18:43:22 +02:00
LocalAI [bot]
aea954a482 docs: position LocalAI as a composable engine, not a bundle (#10136)
Reframe the README hero and docs (homepage, overview, FAQ) around the
composable architecture: a small core, with backends built as dedicated
gRPC services around best-in-class engines, shipped as separate OCI
images and pulled on demand. Lead from strength: drop the "36+ backends"
kitchen-sink framing and the "All-in-One Complete AI Stack" / "single
binary that gives you everything" lines that read as a monolith.

- README: small-core differentiator; composable + open/extensible bullets
- _index.md: composable tagline; install only what you use
- overview.md: core vs on-demand backends; gRPC/OCI mechanics as benefits;
  bring-your-own model and backend
- faq.md: "Do I need to install all the backends?" and
  "Can I bring my own model or backend?"

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-02 17:34:43 +02:00
Ettore Di Giacinto
595e448714 docs(llama.cpp): note tensor split now works with quantized KV cache (#10135)
The split_mode: tensor description claimed tensor parallelism requires
KV-cache quantization to be disabled. ggml-org/llama.cpp#23792 lifts that
restriction by extending the meta backend to preserve shape information
through KV-cache flatten/reshape, so cache_type_k/cache_type_v
quantization can be combined with -sm tensor on builds that include it.

Documentation only: no backend code, grpc-server.cpp comment, or
llama.cpp pin changes.


Assisted-by: Claude Code:claude-opus-4-8

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-02 15:52:23 +02:00
LocalAI [bot]
860f9d63ad feat(parakeet-cpp): dynamic batching for concurrent transcription requests (#10112)
* feat(parakeet-cpp): dynamic-batching scheduler (queue + dispatcher)

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(parakeet-cpp): dynamic batching for AudioTranscription via batched JSON C-API

Drop SingleThread; route unary transcription through the in-process batcher
which coalesces concurrent requests into one batched engine call. Streaming
stays mutually exclusive via engineMu. Adds batch_max_size / batch_max_wait_ms
options (size=1 disables; recommended on CPU).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(parakeet-cpp): tear down dispatcher in Free; log batch config; preallocate; clarify stream lock

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(parakeet-cpp): Ginkgo batcher tests; optional batch C-API binding with per-request fallback

The batched JSON C-API symbol exists only in newer libparakeet.so (ABI >= 2);
probe it with Dlsym and register optionally so the backend still loads against
an older library, falling back to per-request transcription. Rewrites the
batcher unit tests as Ginkgo/Gomega specs (forbidigo bans t.Fatal in tests).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(parakeet-cpp): debug-log coalesced batch size in runBatch

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(parakeet-cpp): default batch_max_size to 1 (batching opt-in)

Dynamic batching now defaults off (batch_max_size:1, one request at a
time). Raise batch_max_size to opt in: it is a large throughput win on
GPU under concurrent load, but on CPU and low-concurrency setups it only
adds latency, so off is the safer default. The startup log now states
whether batching is on or off, and the audio-to-text docs are updated to
match.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(parakeet-cpp): bump parakeet.cpp to 8a7c482 (batched decode + B=1 fast-path)

parakeet.cpp PR #1 merged the batched encoder/decode and the B=1 encoder
fast-path to master. Point PARAKEET_VERSION at that commit so the backend
builds the batched C-API (parakeet_capi_transcribe_pcm_batch_json) that the
dynamic batcher calls; the prior pin (30a3075) predated it, so only the
per-request fallback path was exercised. Verified the shared lib builds with
the backend's CMake flags and exports the batch symbol.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-02 14:49:02 +02:00
LocalAI [bot]
a5a0b3dc4e chore: ⬆️ Update CrispStrobe/CrispASR to 05e60432bcb5bc2113f8c395a41e86497c11504a (#10115)
⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-02 14:48:47 +02:00
番茄摔成番茄酱
94eca04c60 fix(nemo): pin texterrors to 1.1.6 for GLIBCXX compatibility (#10134)
Pin texterrors==1.1.6 before nemo_toolkit[asr] in requirements-cublas13.txt.

The texterrors package (a NeMo transitive dependency) contains a compiled
C++ extension (texterrors_align.so) that may be built from source during
OCI image creation. When built on systems with GCC 14+ (e.g. Ubuntu 24.04),
the resulting binary requires GLIBCXX_3.4.32, which is not available in
the default LocalAI container (Ubuntu 22.04, GLIBCXX up to 3.4.30).

Pinning to 1.1.6 (the latest release) ensures:
- Reproducible builds across environments
- pip resolves the pre-built manylinux2014 wheel (needs only GLIBCXX_3.4.11)
  instead of potentially building from source with a newer toolchain

Fixes #10056

Signed-off-by: 番茄摔成番茄酱 <fqscfqj@outlook.com>
2026-06-02 14:48:27 +02:00
LocalAI [bot]
35bd485d6a chore: ⬆️ Update ggml-org/llama.cpp to 5dcb71166686799f0d873eab7386234302d05ecf (#10128)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-02 09:06:35 +02:00
LocalAI [bot]
1fe96f8d9a chore: ⬆️ Update mudler/parakeet.cpp to 8a7c48209d7882a7ce79a6b306270e4703194543 (#10129)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-02 09:06:19 +02:00
LocalAI [bot]
c508e9d7c6 chore: ⬆️ Update leejet/stable-diffusion.cpp to 7948df8ac1070f5f6881b8d34675821893eb97d6 (#10127)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-02 09:06:03 +02:00
LocalAI [bot]
55e754fd05 chore: ⬆️ Update ggml-org/whisper.cpp to 23ee03506a91ac3d3f0071b40e66a430eebdfa1d (#10130)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-02 01:43:03 +02:00
LocalAI [bot]
a17753f7d1 chore(model-gallery): ⬆️ update checksum (#10131)
⬆️ Checksum updates in gallery/index.yaml

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-01 23:39:47 +02:00
Zhao73
c61838dba6 docs: fix documentation typos (#10125)
Correct clear spelling mistakes in documentation without changing behavior.

Confidence: high
Scope-risk: narrow
Tested: git diff --check; uvx codespell on changed files
Not-tested: Full docs build not run; text-only changes
Assisted-by: Codex:gpt-5 codespell
2026-06-01 14:31:08 +02:00
LocalAI [bot]
7013e13f05 chore: ⬆️ Update ggml-org/llama.cpp to 399739d5c5978351f39e3454bfbfbab4f369088f (#10119)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-01 14:24:51 +02:00
Richard Palethorpe
5a0013defe test(react-ui): add page render-smoke specs, reset the coverage gate (#10122)
The UI coverage gate was tightened to 0.1pp against a fast-local
measurement (39.86% baseline); CI's slower runners measure ~0.9pp lower,
so tests-ui-e2e failed there. UI e2e coverage is diffusely
non-deterministic and tracks machine speed — a 0.1pp band can't hold
across environments.

Rather than loosen the gate, raise the floor under it: a render-smoke
spec mounts each lazy page (navigate + assert the header renders),
covering a dozen previously-untested pages and lifting coverage from
~39% to ~42.7% locally. Restore the tolerance to 0.8pp and set the
baseline conservatively (40.0), below the slow-CI floor, so the ratchet
holds without flapping.

Document the coverage policy — install the git hooks and don't bypass
them (no --no-verify, no hand-lowering the baseline or widening the
tolerance); raise coverage by adding tests instead; set the UI baseline
below the slow-CI floor — in AGENTS.md, CONTRIBUTING.md and
.agents/building-and-testing.md.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-06-01 14:24:36 +02:00
LocalAI [bot]
c01ed631d6 refactor(routing): extract replica picker into pkg/clusterrouting (#10123)
Move ReplicaCandidate and PickBestReplica out of core/services/nodes (which depends on gorm) into a new dependency-light leaf package pkg/clusterrouting, so the p2p federation server can later share the same replica-selection policy without pulling in a database driver.

core/services/nodes keeps a type alias and a thin delegator, so every existing reference (the LoadedReplicaStats interface method, the ReplicaCandidate row conversion in registry.go, and the SQL policy-mirror test) compiles and behaves unchanged. This is a pure, behavior-preserving refactor: the full nodes suite, including the policy-mirror spec that pins the SQL ORDER BY to PickBestReplica, stays green.

Assisted-by: Claude Code:claude-opus-4-8

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-01 09:38:55 +02:00
LocalAI [bot]
d47464cb06 docs: ⬆️ update docs version mudler/LocalAI (#10114)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-01 08:16:29 +02:00
LocalAI [bot]
63f176346e chore: ⬆️ Update leejet/stable-diffusion.cpp to be65ac7511b30379b003626c15224798929e33d4 (#10118)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-01 00:43:50 +02:00
LocalAI [bot]
af94d08729 chore: ⬆️ Update ggml-org/whisper.cpp to fe69461618ffc50ba8afa65c25cc6c6e34d4537f (#10117)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-01 00:43:34 +02:00
LocalAI [bot]
6795d38f50 chore: ⬆️ Update mudler/parakeet.cpp to cb45f68068081af01e7092e91b038ee353eb56be (#10116)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-31 23:57:15 +02:00
Richard Palethorpe
718223f33b feat(localvqe/audio): v1.3 release and add spectrograms to audio transform UI (#10113)
* chore(localvqe): update backend to v1.3, add v1.2/v1.3 gallery models

Bump the LocalVQE backend pin 72bfb4c6 -> b0f0378a, which adds the v1.2
(1.3 M) and v1.3 (4.8 M) GGUF SHA-256s to the upstream released-models
allowlist (and the arch_version=3 loader) so both load without
LOCALVQE_ALLOW_UNHASHED.

Add gallery entries for localvqe-v1.2-1.3m and localvqe-v1.3-4.8m
(SHA-256 verified against the downloaded weights) and update the
audio-transform docs to make v1.3 the current default while noting the
compact v1.1/v1.2 alternatives.

Assisted-by: Claude:claude-opus-4-8 Claude-Code
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* chore(flake): add ffmpeg-headless to the dev shell

pkg/utils/ffmpeg_test.go shells out to the `ffmpeg` CLI, and the
pre-commit gate runs those tests via `make test-coverage`. Without
ffmpeg in the dev shell the gate fails with "executable file not found
in $PATH". The headless build provides the CLI without GUI/X deps.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(localvqe): parse WAV by walking RIFF sub-chunks

Walk the RIFF chunk list instead of assuming the canonical 44-byte
header layout. Real inputs (browser-recorded clips, ffmpeg output with
an 18/40-byte extensible `fmt ` chunk or trailing LIST/INFO metadata)
would otherwise splice header/metadata bytes into the PCM stream as an
audible impulse. Honour the `data` chunk size and validate that both
`fmt ` and `data` chunks are present.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(security-headers): allow blob: in connect-src for waveform fetch

The waveform renderer XHRs/fetches a freshly-created blob: object URL
(e.g. an uploaded or enhanced clip before it has a server URL). XHR/fetch
of blob: is governed by connect-src, not media-src, so it was blocked by
the CSP. Add blob: to connect-src.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(react-ui): add input/output spectrogram view to AudioTransform

The transform page only showed time-domain amplitude waveforms, so you
could see how loud a clip was but not which frequencies the model
touched. Add a time x frequency spectrogram heatmap and render the input
and output spectrums side by side, so it's visible which bands the
enhancement attenuates (bright input bands that go dark in the output).

Computed client-side via a Hann-windowed STFT over both clips (a small
dependency-free radix-2 FFT), defaulting to the LocalVQE 512/256 frame
geometry. This shows the net input->output spectral change; the model's
internal gain mask is not exposed by the backend.

- src/utils/fft.js            radix-2 FFT
- src/hooks/useSpectrogram.js decode + STFT -> normalised dB magnitude grid
- src/components/audio/Spectrogram.jsx  canvas heatmap (magma colormap)
- AudioTransform.jsx          dual-spectrogram panel + CSS
- e2e spec + UI coverage baseline bump (38.29 -> 39.0; measured ~39.4-40.2)

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* test(react-ui): make UI coverage deterministic, tighten the gate

UI e2e line coverage swung ~1pp run-to-run (39.1% <-> 40.2%), which forced
a loose 0.8pp tolerance on the monotonic gate — a band wide enough to let
a real ~300-line regression through silently. The swing was a bug, not
inherent jitter: the 'Create Agent navigates' spec ended on the URL
assertion, so AgentCreate.jsx's ~400 lines were collected only when its
render happened to beat the coverage teardown.

Wait for the page to actually render (assert its heading) so those lines
are covered every run. With the race gone, repeated runs land within
~0.013pp of each other, so:

- tighten UI_COVERAGE_TOLERANCE 0.8 -> 0.1 (noise floor, not a drift band)
- set the baseline to the real, reliably-achieved value (39.0 -> 39.86)

Localised by running the V8-coverage suite repeatedly and diffing per-file
line coverage; AgentCreate.jsx was the sole ~1pp flipper.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-31 23:56:46 +02:00
LocalAI [bot]
39e050d9e2 fix(parakeet-cpp): cublas/hipblas/vulkan builds were silently CPU-only (#10120)
fix(parakeet-cpp): forward PARAKEET_GGML_* so cublas/hipblas/vulkan builds aren't silently CPU-only

parakeet.cpp gates its GGML backends behind PARAKEET_GGML_CUDA/HIP/VULKAN and
does set(GGML_CUDA ${PARAKEET_GGML_CUDA} CACHE BOOL "" FORCE), which overwrites
a bare -DGGML_CUDA=ON back to OFF. So the backend's BUILD_TYPE=cublas (and hipblas,
vulkan) produced a CPU-only libparakeet.so. Forward the PARAKEET_GGML_* options
instead. Verified on a GB10 (CUDA 13): the lib now links libcudart/libcublas and
registers the CUDA backend, vs a CPU-only lib before.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-31 23:56:07 +02:00
LocalAI [bot]
c222161291 feat(distributed): resumable file uploads via HTTP Content-Range (#10109)
Large model GGUFs (multi-GB) transferred between master and worker over
flaky / bandwidth-throttled paths (e.g. libp2p relays with byte caps) used
to restart from byte 0 on every transport error. This change adds standard
HTTP Range/resume semantics to the worker's PUT /v1/files/<key> endpoint
and teaches the master-side HTTPFileStager to consult the worker for the
last accepted offset and resume from there.

Server side (file_transfer_server.go):
- PUT now honors Content-Range: bytes <start>-<end>/<total>. The handler
  validates that <start> matches the current on-disk size; mismatches
  return 416 with the actual size in X-File-Size.
- Mid-upload chunks return 308 Permanent Redirect ("Resume Incomplete")
  with the new size, so the client can keep going.
- An optional X-Content-SHA256 request header binds an upload to a target
  hash; cross-attempt drift returns 409. On the final chunk the server
  re-computes SHA-256 and returns 400 if it doesn't match.
- HEAD now advertises Accept-Ranges: bytes and Content-Length, and exposes
  X-Target-SHA256 for in-progress files (so clients can resume only when
  the partial bytes belong to the file they want to upload).
- Legacy PUTs with no Content-Range keep the original truncate-create
  semantics — zero behavior change on the happy path.

Client side (file_stager_http.go):
- Pre-PUT HEAD probe reads X-File-Size + X-Target-SHA256 to determine the
  resume offset.
- doUpload seeks to that offset and sends Content-Range + X-Content-SHA256.
- Retry loop switches from fixed 3 attempts / 5s-10s-20s backoff to an
  outer time budget
  with exponential backoff (1s -> 30s cap), so a 5GB upload over a flaky
  link can outlast many short disconnects.
- 308 and 416 responses are treated as transient: the next iteration
  re-HEADs to learn the correct offset.

Tests:
- Two-chunk Content-Range round-trip produces the correct file + sidecar.
- 416 on a Content-Range/file-size mismatch.
- 409 on X-Content-SHA256 drift between chunks.
- 400 on final-hash mismatch.
- HEAD on a partial upload exposes X-Target-SHA256 (not a misleading
  hash-of-partial-bytes via X-Content-SHA256).
- Pre-existing finished file with a different hash is transparently
  overwritten when a new PUT starts at byte 0.
- End-to-end resume: EnsureRemote against a worker that already holds a
  partial file transfers only the remainder.
- Mid-stream connection drop on attempt #1 is recovered by attempt #2
  resuming from the partial offset.

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-31 11:02:20 +00:00
LocalAI [bot]
aa80d4681b chore: ⬆️ Update ggml-org/llama.cpp to d6588daa800058dfa54f1d7ea695b1a810c8ae18 (#10093)
* ⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix(llama-cpp): skip begin-of-stream null partial in PredictStream

Upstream llama.cpp (ggml-org/llama.cpp#23884), pulled in by this bump,
now emits an initial "begin" partial whose to_json() returns null. It
exists only to signal the HTTP layer to flush 200 status headers before
any token is produced.

gRPC has no such concept, and PredictStream had no guard: the null result
was fed straight into build_reply_from_json, which threw an uncaught
exception. That surfaced as a generic "Unexpected error in RPC handling"
and the task was cancelled the instant it launched, breaking the
PredictStream e2e spec.

Skip null results in both the first-result handling and the streaming
loop, mirroring upstream's own `if (first_result_json == nullptr)` guard.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-31 10:26:03 +00:00
LocalAI [bot]
0d57957ebb feat(worker): add LOCALAI_PREFETCH_MODELS for boot-time gallery prefetch (#10108)
In LocalAI distributed mode the master streams a model GGUF to a
worker on first inference. On bandwidth-constrained cluster networks
(libp2p circuit-v2 relays under NAT, double-NAT residential, slow
overlays) that transfer can be slow or unreliable — meanwhile each
worker's outbound internet is usually fine.

LOCALAI_PREFETCH_MODELS lets the operator name gallery model IDs to
download at worker boot, BEFORE the worker subscribes to backend.install
events. Reuses gallery.InstallModelFromGallery so the on-disk /models
layout matches what the master would have pushed, and the master can
still push files on demand if the gallery is unreachable at boot
(prefetch is non-fatal on every error path).

The installer is wrapped in a function-value indirection so tests can
swap a fake without touching the real gallery; production never
reassigns the binding.

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-31 12:22:45 +02:00
LocalAI [bot]
76fe0bb929 feat(crispasr): add CrispASR backend — multi-architecture ASR + TTS (#10099)
* feat(crispasr): backend source files (Go gRPC server, C-ABI shim, build files)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* polish(crispasr): brand error strings + fix stale shim comment

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* build(crispasr): register backend in root Makefile

Mirror the whisper Go backend registration for the new crispasr
backend: NOTPARALLEL entry, prepare-test-extra/test-extra hooks,
BACKEND_CRISPASR definition, docker-build target generation, and the
docker-build-backends aggregate target.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(crispasr): add backend build matrix entries

Mirror the 11 whisper golang Dockerfile matrix entries (CPU amd64/arm64,
CUDA 12/13, L4T CUDA 13, Intel SYCL f32/f16, Vulkan amd64/arm64, L4T
arm64, ROCm hipblas) with backend and tag-suffix substituted to crispasr.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add crispasr backend gallery entries

Add the crispasr meta anchor and its full set of image gallery entries
(cpu, metal, cuda12/13, rocm, intel-sycl f32/f16, vulkan, L4T arm64,
L4T cuda13 arm64, plus -development variants), mirroring the whisper
backend gallery block.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(crispasr): bump CRISPASR_VERSION via bump_deps workflow

Track CrispStrobe/CrispASR main branch and bump CRISPASR_VERSION in
backend/go/crispasr/Makefile.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* build(crispasr): don't wire fixture-gated test into test-extra

Mirror the whisper Go backend: its AudioTranscription test is gated on
model/audio fixtures and skips in CI, so building crispasr (the heaviest
ggml compile in the tree) inside the unit-test lane adds a long compile
for zero coverage. The backend image build in backend-matrix.yml remains
the authoritative compile check.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(crispasr): add darwin metal build entry (mirror whisper)

The metal-crispasr gallery entries and capabilities.metal mapping
reference -metal-darwin-arm64-crispasr, which is only produced by an
includeDarwin entry. Mirror whisper's darwin metal entry so the tag
actually gets built.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(crispasr): place hipblas matrix entry next to whisper twin

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): register crispasr as pref-only ASR backend + test

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(crispasr): port whisper behavioral suite (cancellation + streaming)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(crispasr): fix skip message env var names to CRISPASR_*

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): switch shim to crispasr_session_* multi-architecture API

The shim used whisper_full(), which in CrispASR is the whisper-only path:
libcrispasr only transcribes Whisper GGUFs through it. Multi-architecture
transcription (Parakeet, Voxtral, Qwen3-ASR, Canary, Granite, FunASR,
Paraformer, SenseVoice, ...) goes through the crispasr_session_* C-ABI,
which auto-detects the architecture from the GGUF and dispatches to the
matching backend.

Rewrite the C shim around crispasr_session_open / _transcribe_lang /
_result_* and add get_backend() so the selected backend is logged.
load_model now takes a threads param (session_open binds n_threads at
open). The session result is segment+word based with no token IDs and no
per-decode callback, so drop n_tokens / get_token_id /
get_segment_speaker_turn_next / set_new_segment_callback. set_abort is
kept for API parity but is best-effort: the session transcribe is blocking
with no abort hook.

Update the purego bindings and gocrispasr.go to match: tokens are left
empty, speaker-turn handling is removed, and AudioTranscriptionStream
emits one delta per non-empty segment after the blocking decode returns
(no progressive streaming via the session API), preserving the
concat(deltas) == final.Text invariant.

crispasr_session_set_translate is exported by libcrispasr but not declared
in crispasr.h, so it is forward-declared in the shim alongside the
open/transcribe/result functions.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* build(crispasr): link full CrispASR backend set for multi-arch support

The shim's crispasr_session_* dispatch calls into the per-architecture
backend libs (parakeet, voxtral, qwen3_asr, canary, funasr, paraformer,
sensevoice, ...), which CrispASR builds as static archives. Linking only
crispasr + ggml dead-stripped every backend object from the final module
(nm backend-symbol count: 0), leaving a whisper-only .so.

Link the same backend set as crispasr-cli so the static archives are
pulled in. After this the module carries the backend symbols (nm count
407, .so grows from ~2.1MB to ~6.7MB) and the session API can dispatch to
every compiled-in architecture.

Also rewrite ${CMAKE_SOURCE_DIR}/examples/talk-llama to
${PROJECT_SOURCE_DIR}/... in the vendored src/CMakeLists.txt: CrispASR
locates its vendored llama.cpp via ${CMAKE_SOURCE_DIR}, which is wrong when
CrispASR is add_subdirectory'd (CMAKE_SOURCE_DIR points at this backend
dir, not the CrispASR root). PROJECT_SOURCE_DIR is correct both standalone
and as a subproject; the sed is idempotent.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(crispasr): adapt suite to session API (blocking, no decode callback)

Register the new symbol set (drop the removed token/speaker/callback funcs,
add get_backend; load_model now takes 2 args). The session transcribe is
blocking with no abort hook, so a mid-decode cancel can't interrupt it:
change the cancellation spec to cancel the context before the call and
assert codes.Canceled from the pre-call ctx.Err() check, dropping the
<5s mid-decode timing assertion. The streaming spec still holds with
per-segment post-decode emission (>=2 deltas, concat(deltas) == final.Text).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add CrispASR ASR model entries (-crispasr)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(gallery): keep only session-auto-detectable CrispASR ASR models

The crispasr backend loads models via crispasr_session_open, which
auto-detects the backend from the GGUF general.architecture using
crispasr_detect_backend_from_gguf. Architectures not in that detect
map cannot be opened, so those gallery entries fail to load.

Removed entries whose architecture is not wired into CrispASR
v0.6.11's session auto-detect router (they can be re-added when
upstream maps them):

- Not in the detect map: data2vec, firered-asr, funasr,
  fun-asr-mlt-nano, glm-asr, hubert, kyutai-stt, mega-asr, mimo-asr,
  moonshine{,-de,-streaming,-tiny-de}, omniasr{,-llm,-llm-1b},
  paraformer, sensevoice.
- Pending verification (filename-heuristic routed, not arch-detected):
  parakeet-ctc-0.6b, parakeet-ctc-1.1b. Their GGUFs are routed to the
  fastconformer-ctc backend by a filename heuristic in the model
  registry, which implies general.architecture is not a mapped string.

Kept the parakeet rnnt/tdt_ctc variants: convert-parakeet-to-gguf.py
writes general.architecture="parakeet" unconditionally and encodes the
rnnt/ctc distinction in metadata fields, so they session-auto-detect.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): TTS synthesis via crispasr_session_synthesize (24kHz)

Add tts_synthesize/tts_free/tts_set_voice to the C-ABI shim. They reuse
the already-open g_session (crispasr_session_open auto-detects a TTS
model) and dispatch to the upstream synthesis call, which returns
malloc'd 24 kHz mono float PCM. Orpheus needs a SNAC codec path that we
do not set, so it returns NULL here and surfaces as an error Go-side.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): implement TTS/TTSStream gRPC methods

Bind the new shim functions via purego and implement TTS, TTSStream and
a writeWAV24k helper. synthesize copies the C-owned PCM out before
freeing it; TTS writes a 24 kHz mono 16-bit WAV to req.Dst via
go-audio/wav. CrispASR has no progressive synth, so TTSStream
synthesizes fully, encodes to WAV, and emits the bytes as a single
chunk; it owns the results-channel close (the gRPC server wrapper ranges
until close), mirroring vibevoice-cpp's TTSStream.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): log when a TTS voice override is not honored

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add CrispASR vibevoice-tts model entry

Only vibevoice-tts works through the current shim: qwen3-tts, chatterbox,
and orpheus require companion codec/s3gen/SNAC paths (set_codec_path /
set_s3gen_path) that the shim doesn't wire yet, and kokoro/indextts/voxcpm2
aren't in the session auto-detect map. Those are follow-ups.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(crispasr): gated TTS synthesis spec

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(crispasr): satisfy golangci-lint (errcheck defers + unsafeptr nolint)

The crispasr Go file is entirely new, so new-from-merge-base lints every
line (unlike the grandfathered whisper backend it was forked from):
- handle os.RemoveAll / fh.Close return values in AudioTranscription
- annotate the two intentional C-pointer unsafe.Slice sites with //nolint:govet

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): backend: and codec: model options (explicit arch + companion files)

Add two model-config options to the CrispASR backend via opts.Options:

- backend:<name> selects an explicit CrispASR backend (bypassing
  auto-detect) by routing load_model through
  crispasr_session_open_explicit, unlocking architectures the
  detector won't pick on its own (qwen3, cohere, granite, voxtral,
  moonshine, mimo-asr, orpheus, kokoro, chatterbox, etc.).
- codec:<path> loads a companion file (qwen3-tts codec, orpheus SNAC,
  chatterbox s3gen, or mimo-asr tokenizer) via the universal
  crispasr_session_set_codec_path setter after the session opens. A
  relative path resolves against the model directory. rc==0 means
  success or not-applicable; only a negative rc is fatal.

The C shim load_model gains a backend_name argument and a new
set_codec_path entry point; the Go bridge parses the prefix:value
options and registers the new symbol. The vad_only path is unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): expand CrispASR models via backend:/codec: options (explicit arch + companions)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(gallery): use virtual.yaml base for crispasr models

The crispasr entries are just backend + model + a couple options, fully
expressed inline via overrides:/files: in gallery/index.yaml. Point each
url: at the shared gallery/virtual.yaml (the established 'virtual' model
trick) and drop the 36 redundant per-model gallery/*-crispasr.yaml files.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(gallery): drop voice-requiring TTS entries (keep vibevoice-tts)

Real e2e showed qwen3-tts/orpheus/chatterbox don't synthesize through the
current shim: the codec: companion loads fine, but these engines additionally
need a voice pack / voice prompt / reference clip (qwen3-tts base errors
'no voice'; chatterbox is zero-shot cloning; orpheus uses named voices) that
the backend doesn't wire. (qwen3-tts also can't auto-detect: its GGUF arch is
'qwen3tts', unmapped by the detector — would need backend:qwen3-tts.) Removed
to avoid shipping non-working gallery entries; vibevoice-tts (built-in voice,
e2e-verified) remains the working TTS. Voice-pack wiring is a follow-up.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): speaker: and voice: TTS options (baked speakers + voice packs/prompts)

speaker:<name> -> crispasr_session_set_speaker_name (baked speakers: qwen3-tts
CustomVoice, orpheus). voice:<path>(+voice_text:<ref>) -> crispasr_session_set_voice
(voice-pack GGUF, or WAV zero-shot clone with ref text). Applied at Load as the
default voice; req.Voice still overrides the speaker per request.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): re-add e2e-verified TTS engines (chatterbox, qwen3-tts-customvoice, orpheus)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-31 12:11:03 +02:00
Adira
baa11133f1 fix(config): register parakeet-cpp as a transcript backend (#9718) (#10106)
parakeet-cpp was added in #10084 but not registered in
BackendCapabilities, so GuessUsecases only allowed "whisper" for
FLAG_TRANSCRIPT and the UI could not classify parakeet-cpp models as
speech-to-text. The result was that parakeet models appeared only in
the LLM selector in the speech-to-speech pipeline, making them
unusable for transcription through the UI.

Closes #9718

Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 11:15:15 +02:00
Adira
1bdd3338a6 fix(config): register 5 backends missing from BackendCapabilities (#10107)
Cross-referencing backend/ directories against BackendCapabilities found
five backends that exist and work but have no entry in the map, so
GuessUsecases falls back to heuristics that mis-classify them (e.g.
a TTS backend appears as an LLM in the UI).

Added entries, each modelled on the corresponding Python twin or the
nearest equivalent already in the map:

  sglang        — LLM (Predict/PredictStream/TokenizeString, vision)
  vibevoice-cpp — ASR + TTS/TTSStream (mirrors vibevoice Python)
  sherpa-onnx   — ASR + TTS/TTSStream + VAD (multi-model toolkit)
  qwen3-tts-cpp — TTS (mirrors qwen-tts Python)
  rfdetr-cpp    — object detection (mirrors rfdetr Python)

Found by diffing `ls backend/{go,python}/` against the keys in
BackendCapabilities. Remaining gaps (insightface, speaker-recognition,
sam3-cpp) use custom gRPC methods not yet in the Method* constants —
left for a follow-up.

Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 11:14:52 +02:00
LocalAI [bot]
e08492a2c3 chore: ⬆️ Update leejet/stable-diffusion.cpp to d2797b86670622b6538123b4aeb5fbb6be2653c5 (#10094)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-31 00:42:13 +02:00
LocalAI [bot]
d5d8fe909d docs: ⬆️ update docs version mudler/LocalAI (#10091)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-31 00:11:41 +02:00
LocalAI [bot]
8a82753277 chore: ⬆️ Update antirez/ds4 to ba00a8a88c4c5810a3d1fed6b7b8fa2b44b82fdc (#10095)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-31 00:10:33 +02:00
LocalAI [bot]
51ca109067 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 3f40e73c367ad9f0c1b1819f28c7348c26aa340d (#10097)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-31 00:10:16 +02:00
LocalAI [bot]
07f6c15a37 feat(ds4): layer-split distributed inference (#10098)
* feat(ds4): add standalone ds4-worker distributed worker binary

Add worker_main.c, a minimal standalone worker that owns a slice of the
model's transformer layers and serves activations over ds4's own TCP
transport via ds4_dist_run(). It links the same engine objects the
backend already builds (including ds4_distributed.o) and has NO
gRPC/protobuf dependency, so it builds even on hosts lacking protobuf/grpc
dev headers. Launched by `local-ai worker ds4-distributed`.

Wire the ds4-worker CMake target (mirrors grpc-server's object/GPU/native
handling) and have the Makefile copy + clean the binary alongside
grpc-server. Ignore the built ds4-worker artifact.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(ds4): package ds4-worker alongside grpc-server

Copy the standalone ds4-worker binary into the backend package (Linux
package.sh) and the Darwin OCI tar (ds4-darwin.sh: both the explicit copy
and the otool dylib-bundling loop) so distributed workers ship with the
backend.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(ds4): tighten ds4-worker integer arg validation to match upstream

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(ds4): wire grpc-server as distributed coordinator

Add distributed COORDINATOR support to the ds4 backend's gRPC server.
Distributed inference is an engine backend: when LoadModel receives
'ds4_role:coordinator', the process populates ds4_engine_options.distributed
(role, layer slice, listen host/port) before ds4_engine_open, then the normal
ds4_session_* generation path runs transparently once the worker route covers
all layers.

- New LoadModel options: ds4_role, ds4_layers (START:END or START:output),
  ds4_listen (host:port), ds4_route_timeout.
- parse_layers_spec() maps the layer spec onto ds4_distributed_layers.
- wait_route_ready() blocks generation until
  ds4_session_distributed_route_ready() reports full coverage (or timeout),
  gating both Predict and PredictStream; returns UNAVAILABLE on timeout/error.
- No ds4_role => g_distributed stays false and wait_route_ready is a no-op,
  so single-node behavior is unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(ds4): don't block Status during route wait; validate coordinator opts

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(cli): add ds4-distributed worker exec helper

Add the ds4WorkerArgs helper plus findDS4Backend/DS4Distributed.Run that
resolve the ds4 backend via the gallery and exec the packaged ds4-worker
binary. Unlike worker_llamacpp.go, ds4 bundles its own dynamic loader
(lib/ld.so) for glibc compatibility, so when present we exec ds4-worker
through that loader with LD_LIBRARY_PATH=<backend>/lib, mirroring
backend/cpp/ds4/run.sh; otherwise we exec it directly.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(cli): register the ds4-distributed worker subcommand

Wire DS4Distributed into the Worker kong command tree so
`local-ai worker ds4-distributed` is available.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* docs(ds4): document layer-split distributed inference

Add a ds4 section to the distributed-mode feature docs (coordinator
model YAML, manual worker command, layer-range semantics, the
'GGUF on every machine' requirement, coordinator-listens dial
direction vs llama.cpp) and a terse Distributed mode section to the
ds4 backend agent guide.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* test(ds4): opt-in hardware-gated distributed e2e spec

Add a self-contained, opt-in Ginkgo spec to the backend e2e suite that
spins a ds4 coordinator (via the packaged run.sh, loaded with
ds4_role/ds4_layers/ds4_listen options) plus a ds4-worker process for
the upper layers, then uses Eventually to assert a short successful
Predict once the layer route forms, before tearing the worker down.

Gated by BACKEND_TEST_DS4_DISTRIBUTED=1 (plus the existing
BACKEND_BINARY + BACKEND_TEST_MODEL_FILE and optional layer/listen/accel
knobs); compiles and skips cleanly with no env, hardware, or model.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* test(ds4): pass coordinator ctx to worker; lowercase error string

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* docs(ds4): note distributed transport is plaintext/unauthenticated

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* style(ds4): replace em dashes in distributed docs/agent/test per repo convention

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(ds4): link ds4-worker with the C++ driver for CUDA/Metal builds

The ds4-worker target is built from worker_main.c (C), so CMake linked it
with the C driver. The nvcc-built ds4_cuda.o (and Obj-C++ ds4_metal.o)
reference the C++ runtime, so the CUDA/Metal builds failed with undefined
libstdc++ symbols (std::__throw_length_error). The CPU build passed because
ds4_cpu.o is pure C. Force LINKER_LANGUAGE CXX so libstdc++ is linked.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-31 00:09:55 +02:00
LocalAI [bot]
a44bdb29d4 feat: prefix-cache-aware routing for distributed mode (#10071)
* feat(radixtree): generic prefix tree skeleton with longest-match

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(radixtree): Insert with path recency refresh and entry cap

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(radixtree): TTL idle-expiry and Evict sweep with branch pruning

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(radixtree): recency-weighted per-value Weight

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(radixtree): Remove all entries for a value

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(radixtree): race-free concurrency smoke test

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(radixtree): reclaim empty branches, RWMutex reads, TTL boundary, empty-key guard

Address review findings on the generic prefix tree:

- Extract a shared pruneWalk helper parameterized by a shouldClear
  predicate and use it from Evict, Remove, and the MaxEntries path.
  Previously evictOldestLocked cleared a victim's value but never
  removed the now value-less node or its childless ancestors, so
  internal nodes accumulated under sustained churn at the cap. The
  MaxEntries path now prunes the victim and its empty ancestors.
- DRY: pruneWalk replaces the duplicated logic in the former
  pruneLocked and Remove's inner closure.
- Switch Tree.mu to sync.RWMutex; LongestMatch, Weight and Len take
  the read lock (RLock) while Insert, Evict and Remove keep the write
  lock. Confirmed race-clean under go test -race.
- Document the strict greater-than TTL boundary on Options.TTL and
  expired: age exactly equal to TTL is still live.
- Guard Insert against an empty key (no-op): the root never holds a
  value.

Adds Ginkgo specs covering MaxEntries eviction, ancestor reclamation,
the no-growth-past-cap invariant, the TTL boundary, and empty-key
behavior for both Insert and LongestMatch.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): RoutePolicy enum with parse/resolve

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): Config with defaults and validation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): deterministic xxhash prefix-chain extractor

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): pure filter-then-score replica selection

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): Provider interface and radix-tree-backed Index

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* style(prefixcache): gofmt policy enum comment alignment

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(prefixcache): head-first prefix chunking and hoist Weight out of sort

Address code-quality review findings in the prefixcache package.

Correctness: ExtractChain now chunks from absolute offset 0 with fixed
[0,W),[W,2W),... boundaries and caps the chain to the FIRST MaxDepth
head blocks. The previous tail-keeping logic shifted the byte offset by a
non-window amount once a conversation grew past MaxDepth*WindowBytes,
changing every hash each turn and silently breaking cross-turn
longest-prefix matching. The reusable KV/prefix cache lives at the head
of the prompt, so anchoring at offset 0 makes the chain a true
prefix-chain: P and P+suffix share their full leading overlap. Add a
regression spec proving cross-turn stability past the cap.

Performance: Index.Decide precomputes each candidate's Weight once
(decorate-sort-undecorate) instead of calling the O(tree size) Weight
inside the O(n log n) sort comparator. Behavior is unchanged.

Lint: encode prev with binary.LittleEndian.PutUint64 instead of a manual
byte loop, clearing the modernize rangeint finding.

Also add a concurrent Decide/Observe/Invalidate spec to exercise Index's
documented concurrency safety under go test -race.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(messaging): prefixcache observe/invalidate subjects and payloads

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): NATS sync publish/apply for observe and invalidate

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributedhdr): ctx carrier for prefix-hash chain

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributedhdr): PrefixChainHook indirection for backend-side chain build

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend): stash prompt prefix chain on ctx before distributed routing

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend): mirror modelID fallback for prefix-chain salt parity

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): scheduling config columns for prefix-cache routing

Add RoutePolicy and per-model balance/prefix-match override columns to
ModelSchedulingConfig and include them in the SetModelScheduling upsert
DoUpdates list so updates are not dropped on conflict.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): optional route preference in FindAndLockNodeWithModel

Add a RoutePreference type and a new pref parameter so the atomic
pick+lock+increment can be biased toward a preferred node without
weakening atomicity. A nil preference reproduces the previous ORDER BY
behavior exactly. Update the ModelRouter interface, both router.go call
sites (pass nil for now; Phase 5 builds the real preference), the test
doubles, and the distributed e2e caller.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): make Sync satisfy Provider with Evict

Sync.Observe now returns whether the local index treated the assignment as
new or extended, and Sync gains an Evict method that delegates to the wrapped
index. Together these let SmartRouter hold a single prefixcache.Provider that
broadcasts via NATS. Adds a compile-time Provider assertion and an
Evict-delegates behavioral test.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): prefix-cache-aware preference and observe in SmartRouter.Route

Add a PrefixProvider + PrefixConfig to SmartRouterOptions/SmartRouter (nil
keeps routing byte-for-byte the round-robin floor). On each request Route now
calls buildPreference: it reads the prompt prefix chain from ctx
(distributedhdr.PrefixChain), resolves the per-model policy/thresholds over
the global config, loads candidate replica in-flight via a new registry read
LoadedReplicaStats (deduped to one entry per node using the MIN in-flight
across that node's replicas), asks the provider to Decide, and runs
prefixcache.Select. The chosen node is passed as the RoutePreference to
FindAndLockNodeWithModel on all three pick paths (cache hit, locked re-pick,
cold scheduleAndLoad), and the served node is recorded via Observe only when
the resolved policy is prefix_cache so round-robin models never pollute the
tree.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): invalidate prefix-cache entries on unload and stale removal

UnloadModel and both staleness fall-through paths in Route (after a failed
gRPC probe and RemoveNodeModel) now call prefixProvider.Invalidate(model,
nodeID), guarded by a nil-provider check so the round-robin floor is
unchanged. At runtime the provider is the *prefixcache.Sync, so invalidations
also broadcast to peer frontends. Adds a test that a previously hot prefix no
longer Decides to a node after UnloadModel.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): rolling forced-disturb pressure counter

Add a concurrency-safe per-model rolling counter that tracks how many
times a request had a usable hot prefix match but the load guard forced
it off the warm node. Entries outside the window are dropped lazily on
Count so the backing slice stays bounded.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): autoscale on prefix-cache forced-disturb pressure

Wire the rolling forced-disturb counter into the SmartRouter and the
ReplicaReconciler.

Router: in buildPreference, after Decide + Select, record a forced-disturb
when a usable hot prefix match existed (d.HotNodeID != "" and
d.MatchRatio >= cfg.MinPrefixMatch) but Select chose a different node (or
nothing) because the load guard ruled the warm node out. This is the
scale-worthy signal: the cache-warm replica is saturated. It deliberately
does not fire for all-unique workloads (no hot match), avoiding
false-positive scale-ups. Pressure is optional on SmartRouterOptions; nil
keeps the path a no-op.

Reconciler: read the same Pressure instance in reconcileModel as an extra
scale-up reason, reusing the existing MaxReplicas + ClusterCapacityForModel
guards and the UnsatisfiableUntil cooldown that gates the whole method.
Pressure never overrides MaxReplicas and never force-evicts; a no-capacity
model does not spin. Window and threshold come from prefixcache.Config
(PressureWindow default 1m, PressureScaleThreshold default 1) and are
configurable via ReplicaReconcilerOptions.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(prefixcache): bound Pressure slice in Record; drop dead reconciler pressureWindow

Record now prunes entries older than the rolling window (the same prune
Count does), via a shared pruneLocked helper, so a model that takes
forced-disturb records but is never Counted (e.g. one with zero loaded
replicas the reconciler skips) no longer grows its backing slice
unbounded.

Also removes the dead pressureWindow struct field and the
ReplicaReconcilerOptions.PressureWindow option from the reconciler: they
were stored but never read (the window lives inside the *prefixcache.Pressure
instance). The scale block now reads pressure.Count once into a local.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(api): prefix-cache fields in scheduling endpoint DTO with validation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): prefix-cache routing controls in node scheduling form

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): wire prefix-cache index, NATS sync, and config

Activates prefix-cache-aware routing in distributed mode. Builds the
prefixcache Index + NATS-backed Sync + Pressure counter, installs the
distributedhdr.PrefixChainHook so core/backend/llm.go attaches a prefix
chain per request, subscribes to prefixcache.observe/prefixcache.invalidate
to apply peers' events to the local index (no re-broadcast), threads
PrefixProvider/PrefixConfig/Pressure into the SmartRouter and
Pressure/PressureThreshold into the ReplicaReconciler, and runs a
background eviction ticker (every TTL/2) bound to the app context.

Enabled by default; --distributed-prefix-cache=false (LOCALAI_DISTRIBUTED_PREFIX_CACHE)
opts out and leaves the provider/pressure nil so routing stays round-robin.
--distributed-prefix-cache-ttl (LOCALAI_DISTRIBUTED_PREFIX_CACHE_TTL, default 5m)
controls entry idle-timeout and eviction cadence.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(nodes): round-robin-floor invariant for prefix-cache routing

Drives Select directly: a saturated hot node (in_flight 50 vs 0) is never
picked even with a perfect prefix match (round-robin floor holds), while a
balanced hot node within the load slack is reused.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(prefixcache): clear branch lint findings and em dashes

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): validate prefix-cache config at startup wiring

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* perf(radixtree): single-walk WeightsFor for batch value weights

Add Tree.WeightsFor(values, now) which computes the recency-weighted
weight for many values in a single O(N + len(values)) tree traversal,
versus calling Weight once per value (O(len(values) * N)). Consumers
that score K candidates against the tree under the read lock no longer
pay K full walks.

Extract the per-entry contribution math into an unexported helper shared
by both Weight and WeightsFor so the metric stays identical (DRY).
Weight's public behavior is unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(config): add ModelConfig.ModelID() single source of truth

The c.Name fallback to c.Model was duplicated in core/backend/options.go
(feeding model.WithModelID) and hand-copied into core/backend/llm.go (the
prefix-chain salt). These MUST agree or the prefix-cache salt diverges
silently from the id the model loader tracks. Consolidate both into a new
config.ModelConfig.ModelID() helper and call it from both sites. Behavior
is identical.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* perf(prefixcache): reuse one xxhash.Digest in ExtractChain

ExtractChain allocated a fresh xxhash.New() Digest per block (up to MaxDepth
per call) and grew the chain slice without preallocation. Reuse a single
Digest via Reset() before each block and preallocate the chain to
min(nBlocks, MaxDepth).

xxhash seed 0 is stateless, so Reset()+Write produces the byte-identical
value to a fresh New()+Write. Output hashes are unchanged, preserving the
cross-process determinism that peers rely on over NATS. Verified by capturing
ExtractChain output for the existing test inputs before and after the
refactor: identical. Existing extractor tests pass unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(prefixcache): drop hot match when matched node is not a candidate; weigh cold candidates in one walk

Index.Decide called radixtree.LongestMatch over the whole tree, so the
deepest match could be a node that is offline, unloaded, or simply not in
the passed candidate set. Honoring that as HotNodeID produced a false
forced-disturb signal upstream (buildPreference records pressure when
chosen != HotNodeID), making it look like a warm replica was load
saturated when it was actually absent.

Build the candidate set once and only set HotNodeID/MatchRatio when the
matched node is an actual candidate; otherwise fall back to cold
placement. A future refinement could ask the tree for the longest match
restricted to the candidate nodes (shallower-but-valid) instead of
dropping it.

Also replace the per-candidate tree.Weight call in the cold-order sort
with a single tree.WeightsFor walk, turning O(K*N) under the read lock
into O(N + K).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(prefixcache): remove Select's unreachable deterministic fallback

buildPreference always passes ColdOrder as a permutation of the full
candidate set, so the cold-order loop hits every eligible candidate. The
trailing best/bestIF scan was dead. Replace it with a plain "return """
and document that ColdOrder is guaranteed to cover all candidates, so ""
means none were eligible.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(nodes): fetch model scheduling config once per Route

GetModelScheduling was read three times per request - in
resolveSelectorCandidates, buildPreference, and nodeMatchesScheduling -
three DB round-trips for one row that is immutable for the life of the
request, and not a consistent snapshot. Fetch it once near the top of
Route and thread the *ModelSchedulingConfig (may be nil) into all three
helpers. scheduleNewModel keeps its own fetch since it runs outside the
Route snapshot. Behavior is identical for nil sched.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(autoscale): add Pressure.Reset to consume forced-disturb signal

Pressure.Count is non-draining (it prunes only by age), so a single burst
of forced-disturbs stays within the rolling window for the whole window and
keeps Count >= threshold on every reconciler tick. The reconciler will use
Reset to clear a model's events after acting on the signal so a fresh
scale-up requires fresh forced-disturbs to accumulate, rather than one burst
driving the model toward MaxReplicas.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(autoscale): at most one scale-up per reconcile tick, consume pressure

Two autoscale bugs:

1. Over-scaling: the pressure scale-up block read Pressure.Count but never
   consumed it. With a non-draining counter a single forced-disturb burst
   kept Count >= threshold across the whole window, firing scaleUp on every
   tick and pushing the model toward MaxReplicas off one transient burst.
   After a successful pressure-triggered scale-up the reconciler now calls
   Pressure.Reset to consume the signal.

2. Double scale-up in one tick: the all-replicas-busy block and the pressure
   block could both fire in the same reconcileModel pass, each calling
   scaleUp(+1) against the same `current` read once at the top, so a model
   that was both busy and over threshold scaled +2 and could overshoot
   MaxReplicas by one. A scaledUp flag now enforces at most one scaleUp(+1)
   per tick: the pressure block is skipped if the busy block already scaled,
   and scale-down is skipped in any tick that scaled up.

MinReplicas enforcement, UnsatisfiableUntil backoff, and capacity guards are
unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): replica-removed chokepoint hook for prefix-cache invalidation

Add SetReplicaRemovedHook to NodeRegistry and fire it from both
RemoveNodeModel and RemoveAllNodeModelReplicas after a successful
delete. This is the single chokepoint every replica-removal path funnels
through (router eviction, reconciler scale-down, probe reaper,
health-monitor node-down reap, RemoteUnloaderAdapter), so the
prefix-cache index can be invalidated by construction rather than wiring
each call site individually.

The hook is stored in an atomic.Pointer so the startup wiring (setter)
and the request/reconcile-time fire are race-free; it is nil-safe when
unset. GORM Delete reports no error for a no-op delete, so the hook also
fires when nothing was removed; the consumer's Invalidate(model, node)
is idempotent so this is harmless.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): invalidate prefix-cache on any replica removal via registry hook

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(prefixcache): single source of truth for threshold bounds

Extract ValidateThresholds into prefixcache/config.go so the per-model
override validation (nodes.go endpoint) and Config.Validate share one
implementation of the numeric bounds (min_prefix_match in [0,1],
balance_abs_threshold >= 0, balance_rel_threshold == 0-or->= 1) instead
of hard-coding them in two places. The route_policy allow-list stays
explicit (not ParsePolicy, which maps typos to Default).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(nodes): preserve prefix-cache settings on partial scheduling update

A scheduling POST that omitted route_policy/thresholds (e.g. a
min_replicas-only update) full-replaced every column and silently reset
the model's previously-configured prefix-cache settings to empty/zero.

Make the four prefix-cache request fields pointers so omitted is
distinguishable from explicit zero, and merge PATCH-style in
SetSchedulingEndpoint: a provided pointer wins, an omitted one preserves
the existing config value (zero default when none). Non-prefix fields
keep their full-replace PUT semantics. Validation now runs on the
resolved values via prefixcache.ValidateThresholds.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(prefixcache): make Invalidate a no-op for uncached models and skip empty broadcasts

A registry chokepoint fires Sync.Invalidate(model, nodeID) for every replica
removal of every model, including round-robin models that never used the
prefix cache. Index.Invalidate previously called tree(model), which lazily
created and permanently retained an empty radix tree for any model that ever
lost a replica, growing the trees map without bound. Sync.Invalidate also
published a NATS PrefixCacheInvalidateEvent on every call, amplifying no-op
removals across the cluster.

Index.Invalidate now looks the tree up read-only via existingTree and returns
without allocating when none exists. The Provider interface is unchanged;
Sync gates the broadcast through an optional invalidateExisting(bool) capability
type-asserted from the wrapped Index, falling back to the prior always-broadcast
behavior for other Provider implementations.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* perf(prefixcache): derive Decide candidacy from WeightsFor and skip trivial sort

WeightsFor already returns a map keyed by every requested candidate, so the
separate candidates set built to validate the hot match was redundant: a node
is a candidate iff it is a key in the weights map. Drop the extra map and gate
the hot-match check on weights membership. Also skip the sort when there is at
most one candidate, since the input order is already the cold order. Behavior
is unchanged.

Deferred follow-up: skipping the WeightsFor walk entirely when a hot match wins
would need lazy cross-file changes and is out of scope here.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(nodes): fire replica-removed hook on bulk node_models deletes; trim LoadedReplicaStats columns

Bulk node-scoped node_models deletes (Register re-register cleanup,
MarkOffline, MarkDraining, Deregister) removed rows directly without
firing the replica-removed hook, so the prefix-cache index kept
pointing at nodes whose models were gone. Capture the DISTINCT model
names before each bulk delete and fire fireReplicaRemoved once per
model after a successful delete, restoring the single-chokepoint
invariant for all removal paths. The pre-query is skipped when no hook
is set so the no-hook path stays cheap.

Also narrow LoadedReplicaStats to SELECT only node_id and in_flight
(the only fields the router consumer reads), dropping the JOIN-side
available_vram fetch and unused columns while keeping the
[]ReplicaCandidate return type unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(reconciler): consume autoscale signals only on a real scale-up

scaleUp was fire-and-forget (void) yet its callers unconditionally
consumed the pressure signal (Pressure.Reset) and the MinReplicas
hysteresis (ClearUnsatisfiable) right after calling it. If scaleUp
added nothing (ScheduleAndLoadModel errored, or no node could be
loaded) the saturated warm replica got no new replica AND its
accumulated forced-disturb history was wiped, forcing the signal to
re-accumulate over a full PressureWindow before the next attempt.

Make scaleUp return whether at least one replica was actually
scheduled, and gate the side effects on it:

- pressure block (2b): set scaledUp and call Pressure.Reset only on
  success; on failure preserve the signal so the next tick retries off
  the same accumulated pressure.
- busy-burst block (2): set scaledUp from the return value so a failed
  attempt does not suppress the pressure path or scale-down.
- MinReplicas block: call ClearUnsatisfiable only on success so a
  failed attempt does not reset the unsatisfiable counter.

All existing invariants (MaxReplicas, capacity gating,
UnsatisfiableUntil cooldown, at-most-one-scale-up-per-tick) are
preserved.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(nodes): drop router's redundant prefix-cache Invalidate calls

The NodeRegistry removal chokepoint (RemoveNodeModel /
RemoveAllNodeModelReplicas) now fires SetReplicaRemovedHook, which
invalidates the prefix-cache index. The router was also calling
prefixProvider.Invalidate explicitly right after each registry removal
on the two stale-replica health-probe fall-throughs in Route and in
UnloadModel, so every router-side eviction invalidated twice (double
tree-prune + double NATS broadcast).

Remove the three redundant explicit Invalidate calls and their empty
nil-guards. Each removed call sat immediately after a registry removal
that fires the hook, so invalidation is preserved via the chokepoint.
Decide/Observe usage is untouched.

Re-point the unit test (fake registry fires no hook) to assert the
removal chokepoint is exercised on unload instead of the router's
direct invalidation.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(prefixcache): broadcast invalidations unconditionally for cross-frontend coherence

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(prefixcache): reject TTL<=0 in Config.Validate (eviction ticker would panic)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(nodes): make capture+delete atomic in bulk node_models removal paths

MarkOffline, MarkDraining, and the Register re-register cleanup ran the
nodeModelNames SELECT and the bulk node_models DELETE as two separate
statements on r.db with no transaction. A SetNodeModel landing between
the two was deleted but its replica-removed hook never fired, leaving
the prefix-cache index pointing at a removed replica until TTL or
candidacy self-heal.

Wrap the capture and the delete in a single db.Transaction in each path
(mirroring how Deregister already does it). The captured model names are
collected into a slice declared outside the closure; the
replica-removed hook fires for each only after the transaction commits,
so a rollback never invalidates the index for a removal that did not
persist. The set of fired hooks now equals exactly the set of
node_models rows actually deleted, with no interleaving gap.

The status flip in MarkOffline/MarkDraining (setStatus) is a separate,
pre-existing operation and routing already filters non-healthy nodes, so
it stays outside the transaction; return contracts are unchanged.
Deregister was already correct and is untouched. The cheap-path skip
(no hook -> skip the SELECT) is preserved.

Adds a spec asserting MarkOffline fires hooks for exactly the rows it
deletes and leaves no node_models row behind (consistent snapshot).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(nodes): debug logging for prefix-cache routing decisions and observations

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(radixtree): match shared prefixes by valuing every node on insert

Insert recorded the value (node id) only on the final node of the key
chain, leaving every intermediate prefix node valueless. LongestMatch
returns the deepest node that hasValue, so two chains that share a
leading block but diverge in the tail never matched: only exact-repeat
queries hit. That broke the prefix-cache routing core use cases (shared
system prompt, multi-turn extension, volatile tail), all of which rely
on prefix matching rather than exact-repeat.

Set value/hasValue/lastSeen at every node along the chain so each
prefix-block node remembers the node id that served that prefix
(SGLang/vLLM-style). The deepest match wins, and the last writer owns a
shared prefix node (a recency heuristic: the most recent chain through a
block is the one most likely still warm). size now counts valued nodes,
which is the intended meaning.

Updated radixtree tests to the new semantics: deepest-prefix test uses
non-overlapping chains, a new test asserts last-writer-owns-shared-node,
Evict/Remove/MaxEntries expectations recomputed for per-prefix-node
counting, and a shared-prefix LongestMatch red test added. Added a
prefixcache Decide test proving a prefix-only query routes to the warm
node. No prefixcache .go logic changed.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(distributed): lock in prefix-cache routing behavior end to end

Add a DB-backed e2e spec that drives SmartRouter against a real
NodeRegistry (Postgres testcontainer) and the real prefixcache.Index
radix-tree provider, using a fake gRPC backend factory so no real
inference runs. Covers the five behaviors validated by hand:

1. Cold miss + observe: an unseen prefix chain cold-places and is recorded.
2. Hot-match affinity: the same chain returns to its warm node X.
3. Shared-prefix match: a divergent chain sharing X's leading prefix
   still routes to X (the radix-tree regression we fixed).
4. Negative control: an unrelated chain is a cold miss, not a false
   hot match on X.
5. Failover + invalidation: removing X's replica fires the registry
   chokepoint hook to invalidate the prefix entry, and the chain fails
   over to surviving node Y and re-homes there.

Replaces the need for manual docker-compose re-runs.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(prefixcache): make prefix-cache affinity replica-granular

Track prefix-cache affinity per loaded replica (a backend process with its
own KV cache) instead of per node, so multiple replicas of the same model on
one node each keep distinct affinity and a hot prefix routes back to the exact
replica that served it.

- radixtree: add RemoveFunc(pred) and reimplement Remove on top of it.
- prefixcache: introduce ReplicaKey{NodeID, Replica}; Index/Candidate/
  PrefixDecision/Select/Provider now key on ReplicaKey. Add InvalidateNode to
  drop every replica of a node; Invalidate drops one replica. Select returns
  (ReplicaKey, bool) and gains a deterministic least-in-flight eligible
  fallback (tiebreak NodeID then Replica).
- messaging: carry Replica on PrefixCacheObserveEvent and
  PrefixCacheInvalidateEvent (Replica < 0 means all replicas of the node).
- Sync delegates + broadcasts with replica; InvalidateNode broadcasts
  Replica=-1; ApplyInvalidate routes negative replica to InvalidateNode.

This is part 1 of 2; the registry/router/wiring consumers are updated
separately.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): make prefix-cache routing replica-granular

Wire the SmartRouter, NodeRegistry, and distributed startup to the
replica-keyed prefixcache API. Affinity is now tracked per replica
(each replica is a separate process with its own KV cache), so a prefix
served by (node,0) no longer leaks onto the same-node sibling (node,1).

- RoutePreference gains PreferredReplica; FindAndLockNodeWithModel locks
  the EXACT (node_id, replica_index) row, falling through to the default
  ORDER BY when that replica is not loaded.
- SetReplicaRemovedHook now carries replicaIndex; RemoveNodeModel fires
  the specific replica, RemoveAllNodeModelReplicas and the four bulk
  node-scoped deletes fire replica<0 (all replicas of the node).
- buildPreference builds one Candidate per loaded replica and locks the
  exact replica the policy chose; observePrefix records the served
  ReplicaKey at every call site.
- distributed.go routes the hook to InvalidateNode (replica<0) or
  Invalidate(key).
- Tests updated to the replica-keyed API plus new coverage: a hot prefix
  on (node,0) prefers replica 0 over the same-node sibling (router unit +
  e2e), and FindAndLock locks the exact preferred replica.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(distributed): derive prefix chain from messages for tokenizer-template models

Prefix-cache-aware routing built its prompt-prefix chain from the rendered
prompt string `s` in ModelInference. For models with
TemplateConfig.UseTokenizerTemplate the frontend never renders a prompt - the
backend tokenizes the structured messages itself - so `s` is empty, the chain
is empty, and routing silently falls back to round-robin. That covers the bulk
of modern chat models (qwen3, llama3, ...), so the feature effectively never
engaged for them.

Fall back to messagesPrefixSource(messages): a deterministic, prefix-stable
head-first serialization of the conversation (role + content per turn). Two
requests sharing a leading system prompt and early turns share a leading byte
prefix, which ExtractChain maps to a shared chain prefix - landing both on the
same cache-warm replica. The rendered `s` is still preferred when present
(higher fidelity for non-template models).

Found via the multi-replica-per-node e2e: zero "prefix-cache routing decision"
logs despite per-request Route calls, traced to the empty-chain guard.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): document prefix-cache routing roadmap

Add a routing-and-caching roadmap section to the distributed-mode guide,
linking the epic (#10063) and the follow-up issues (#10064-#10070) surfaced
from a survey of SGLang, vLLM production-stack, Ray Serve, llm-d, AIBrix, and
NVIDIA Dynamo.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-30 23:24:22 +02:00
LocalAI [bot]
aee4611ab2 chore: ⬆️ Update mudler/parakeet.cpp to 30a307553f1965ceb38a1a922069a71e7dd67bf3 (#10092)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-30 22:48:09 +02:00
LocalAI [bot]
486467623c chore: ⬆️ Update antirez/ds4 to e16ead1e29c81a67bbb64e5b001117679cf9ce6e (#10076)
* ⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix(ds4): link new ds4_distributed.o into grpc-server build

Upstream ds4 e16ead1e split distributed inference into a new translation
unit (ds4_distributed.c/.h). ds4.c and ds4_cpu.o now reference its
ds4_dist_* symbols, so the grpc-server link fails with undefined
references unless that object is built and linked.

Add ds4_distributed.o to both the upstream object build (Makefile) and
the grpc-server link set (CMakeLists.txt) for every GPU mode. It is a
single GPU-agnostic object, so it is built/linked unconditionally.

Verified: the six undefined ds4_dist_session_* references in ds4_cpu.o
are all defined by the newly built ds4_distributed.o (nm cross-check).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-30 22:08:30 +02:00
LocalAI [bot]
4912c9b73a feat(parakeet-cpp): add NVIDIA NeMo Parakeet ASR backend (parakeet.cpp) (#10084)
* feat(parakeet-cpp): L0 backend scaffold, LoadModel + AudioTranscription (text)

Add a Go gRPC backend that bridges LocalAI to parakeet.cpp via the flat
C-API (parakeet_capi.h), loaded with purego (cgo-less, mirrors the
whisper / vibevoice-cpp backends).

L0 scope:
- main.go: dlopen libparakeet.so (override via PARAKEET_LIBRARY), register
  the C-API entry points, start the gRPC server.
- goparakeetcpp.go: Load (parakeet_capi_load), AudioTranscription
  (parakeet_capi_transcribe_path, decoder=0 = per-arch default head),
  Free, serialized through base.SingleThread since the C engine is a
  thread-unsafe singleton. char* returns are bound as uintptr so the
  malloc'd buffer is freed via parakeet_capi_free_string after copy.
- AudioTranscriptionStream returns a clear "not implemented in L0" error
  (closes the channel so the server doesn't hang), wired in L2.
- Makefile: clone-at-pin + cmake (PARAKEET_VERSION for bump_deps.sh),
  with a local-symlink dev shortcut; run.sh / package.sh mirror whisper.
- Test auto-skips without PARAKEET_BACKEND_TEST_MODEL/_WAV fixtures.

Builds clean (CGO_ENABLED=0), gofmt clean, test passes. The single
unsafeptr vet note in goStringFromCPtr is documented and matches the
whisper backend's tolerated pattern.

Word/segment timestamps (L1) and cache-aware streaming (L2) follow.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(parakeet-cpp): L1 word/segment timestamps via transcribe_path_json

AudioTranscription now calls parakeet_capi_transcribe_path_json and shapes
the per-word / per-token timestamps into the TranscriptResult:

- Bind parakeet_capi_transcribe_path_json (purego, char* as uintptr like
  the other returns) and register it in main.go + the test loader.
- Parse the JSON document ({"text","words":[{w,start,end,conf}],
  "tokens":[{id,t,conf}]}) into typed structs.
- Synthesise a single whole-clip segment (parakeet emits no native segment
  boundaries) spanning the first word start to the last word end; token ids
  populate Segment.Tokens.
- Attach word-level timings only when timestamp_granularities=["word"],
  matching the OpenAI API (segment-level default). secondsToNanos mirrors
  the whisper backend's nanosecond convention.

Verified end-to-end against tdt_ctc-110m (f16): both the default and
word-granularity specs pass; builds clean, gofmt clean, vet shows only the
one documented unsafeptr note shared with the whisper backend.

Cache-aware streaming (L2) follows.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(parakeet-cpp): L2 cache-aware streaming with EOU segmentation

Wire AudioTranscriptionStream to the streaming RNN-T C-API:

- Bind parakeet_capi_stream_{begin,feed,finalize,free}; feed takes 16 kHz
  mono float PCM ([]float32 via purego) and writes *eou_out on <EOU>/<EOB>.
- Decode opts.Dst to 16 kHz mono PCM (utils.AudioToWav + go-audio, same as
  the whisper backend), feed it in 1 s chunks, and emit each newly-finalized
  text run as a TranscriptStreamResponse delta.
- <EOU>/<EOB> events close the current segment; a closing FinalResult carries
  the full transcript plus the per-utterance segments (with a whole-clip
  fallback segment when no EOU fired).
- stream_begin returns 0 for non-streaming models, surfaced as a clear
  error instead of an empty stream. Honours context cancellation between
  chunks. Frees every malloc'd delta and the session.

Verified end-to-end against realtime_eou_120m-v1 (f16): the streamed
transcript matches the offline 110m reference word-for-word, deltas
reconstruct the final text, and the spec passes alongside the offline
specs. Builds clean, gofmt clean, vet shows only the shared documented
unsafeptr note.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(parakeet-cpp): L3 register backend in build/CI/gallery (whisper parity)

Wire the new Go gRPC parakeet-cpp backend (parakeet.cpp ggml port of NVIDIA
NeMo Parakeet ASR) into LocalAI's build/CI/gallery surfaces, matching the
existing ggml whisper Go backend 1:1.

- .github/backend-matrix.yml: add 11 linux entries + 1 darwin entry mirroring
  every whisper build (cpu amd64/arm64, intel sycl f32/f16, vulkan amd64/arm64,
  nvidia cuda-12, nvidia cuda-13, nvidia-l4t-arm64, nvidia-l4t-cuda-13-arm64,
  rocm hipblas, metal-darwin-arm64), all on ./backend/Dockerfile.golang with
  backend: "parakeet-cpp" and -*-parakeet-cpp tag-suffixes.
- scripts/changed-backends.js: explicit inferBackendPath branch resolving
  parakeet-cpp to backend/go/parakeet-cpp/ before the generic golang branch.
- .github/workflows/bump_deps.yaml: track the PARAKEET_VERSION pin in
  backend/go/parakeet-cpp/Makefile (repo mudler/parakeet.cpp, branch master).
- backend/index.yaml: add &parakeetcpp meta + latest/development image entries
  for every matrix tag-suffix.
- Makefile: add backends/parakeet-cpp to .NOTPARALLEL, BACKEND_PARAKEET_CPP
  definition, docker-build target eval, and test-extra-backend-parakeet-cpp-
  transcription target (mirrors test-extra-backend-whisper-transcription).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(parakeet-cpp): L4 gallery importer for parakeet GGUFs

Add ParakeetCppImporter so parakeet.cpp GGUFs auto-detect on /import-model
and route to the parakeet-cpp backend (it also surfaces in /backends/known,
which drives the import dropdown).

- Match is narrow: a .gguf whose name carries a parakeet architecture token
  (<arch>-<size>-<quant>.gguf, e.g. tdt_ctc-110m-f16.gguf, rnnt-0.6b-q4_k.gguf,
  realtime_eou_120m-v1-q8_0.gguf), a direct URL to one, or
  preferences.backend="parakeet-cpp". It deliberately does NOT claim arbitrary
  llama-style GGUFs, nor the upstream nvidia/parakeet-* NeMo repos (.nemo, not
  runnable here).
- Registered in the ASR batch BEFORE LlamaCPPImporter so its GGUFs aren't
  swallowed by the generic .gguf importer.
- Import nests files under parakeet-cpp/models/<name>/, defaults to the
  smallest quant (q4_k, near-lossless on parakeet) with a size-ladder
  fallback, and honours preferences.quantizations / name / description.

Tested with synthetic HF details (no network): metadata, positive matches
(HF repo, direct URL, preference), narrowness negatives (llama GGUF, NeMo
repo), and import (default quant, override, direct URL), 9 specs pass,
build/vet/gofmt clean.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(parakeet-cpp): document the parakeet-cpp transcription backend

Add parakeet-cpp to the audio-to-text backend list and a dedicated usage
section: direct GGUF import (auto-detects to the backend), model YAML,
word-level timestamps via timestamp_granularities[]=word, and cache-aware
streaming with the realtime_eou model. Points at the mudler/parakeet-cpp-gguf
collection repo.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(parakeet-cpp): wire transcription gRPC e2e test into test-extra

The L3 commit added the test-extra-backend-parakeet-cpp-transcription
Makefile target but never invoked it in CI. Mirror the whisper job:

- Add a parakeet-cpp output to detect-changes (emitted by
  changed-backends.js from the matrix entry).
- Add tests-parakeet-cpp-grpc-transcription, gated on the parakeet-cpp
  path filter / run-all, building the backend image and running the
  transcription e2e against tdt_ctc-110m + the JFK clip.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* style(parakeet-cpp): drop em dashes from comments and docs

Replace em dashes with plain punctuation in the backend comments, the
importer, package.sh, and the audio-to-text docs section (and use "and"
instead of the multiplication sign). No behaviour change.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add parakeet-cpp f16 models to the model gallery

Add the 10 NVIDIA Parakeet models (f16, the recommended quality/speed
default) as gallery entries that install on the parakeet-cpp backend from
mudler/parakeet-cpp-gguf: tdt_ctc-110m/1.1b, tdt-0.6b-v2/v3, tdt-1.1b,
ctc-0.6b/1.1b, rnnt-0.6b/1.1b, and the cache-aware streaming
realtime_eou_120m-v1. Each pins the file sha256 and routes transcript
usecases to the backend.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(parakeet-cpp): satisfy govet lint + bump PARAKEET_VERSION

- goparakeetcpp.go: //nolint:govet on the C-owned-pointer unsafe.Pointer
  conversion (golangci-lint reports new-only issues, so unlike the whisper
  backend's identical line this one is flagged).
- Makefile: bump PARAKEET_VERSION to the current parakeet.cpp master commit
  (the previous pin's commit no longer exists after upstream history was
  squashed), so the backend image clone/build resolves again.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(parakeet-cpp): pin PARAKEET_VERSION to a tag-stable commit

The previous SHA pin was orphaned when parakeet.cpp's single-commit master
was amended/force-pushed, so the backend image clone (git fetch <sha>) failed
across every build variant. Repoint to 845c29e, which upstream now keeps
permanently fetchable via the `localai-backend-pin` tag, so future upstream
amends no longer break the backend build.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(parakeet-cpp): init the ggml submodule in the backend image clone

The backend Dockerfile clones parakeet.cpp at PARAKEET_VERSION with a shallow
fetch + checkout but never initialised submodules, so third_party/ggml was
empty and the parakeet.cpp cmake build failed at
`add_subdirectory(third_party/ggml)` (CMakeLists.txt:53) on every build
variant. Add `git submodule update --init --recursive --depth 1
--single-branch` after checkout, mirroring the whisper backend. Verified
locally: clone + submodule + cmake configure now succeeds.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(parakeet-cpp): statically link ggml into libparakeet.so

The shared libparakeet.so linked ggml's shared libs (libggml*.so), but the
package only ships libparakeet.so, so at runtime dlopen failed with
"libggml.so.0: cannot open shared object file" (the e2e transcription test
panicked on load). Build ggml static + PIC (BUILD_SHARED_LIBS=OFF,
CMAKE_POSITION_INDEPENDENT_CODE=ON) so libparakeet.so embeds ggml and depends
only on system libs already present in the runtime image. Verified locally:
ldd shows no libggml dependency.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(parakeet-cpp): non-streaming fallback in AudioTranscriptionStream

The e2e streaming test ran AudioTranscriptionStream against tdt_ctc-110m
(not a cache-aware streaming model), so stream_begin returned 0 and the call
errored. Per LocalAI's streaming contract (and the whisper backend), a
non-streaming model should fall back to a single offline transcription
emitted as one delta plus a closing FinalResult. Do that instead of erroring,
so the streaming endpoint works for every parakeet model. Verified locally:
the streaming spec passes against the non-streaming 110m model via fallback.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-30 14:46:10 +02:00
Richard Palethorpe
12d1f3a697 security(http): refuse redirects on outbound clients via hardened pkg/httpclient (#10087)
LocalAI's outbound HTTP clients used Go's default redirect policy, which
follows up to 10 redirects. On a cross-host redirect Go forwards custom
request headers — including credential headers such as Anthropic's
x-api-key — to the redirect target (Go strips Authorization, Cookie and
WWW-Authenticate cross-host, but NOT arbitrary custom headers). An
attacker able to elicit a redirect from an upstream (a hijacked or
spoofed upstream, DNS trickery, or a malicious upstream_url) then
harvests the operator's provider API key.

This was first reported against the cloud-proxy / MITM PII path
(GHSA-3mj3-57v2-4636); the same class affects every other outbound
client. Rather than patch each call site, add pkg/httpclient as the one
sanctioned constructor for outbound HTTP and route everything through it.

pkg/httpclient:
  - New(...)             refuses redirects, TLS 1.2 floor, no body
                         deadline (streaming/SSE safe)
  - NewWithTimeout(d)    simple request/response calls
  - WithFollowRedirects  opt-in following that still strips credential
                         headers on any cross-host hop; different
                         scheme/host/port == different origin, guarding
                         the curl CVE-2022-27774 port-confusion class
  - WithTransport(rt)    keep a custom transport (IP-pin, HTTP/2, a
                         credential-injecting RoundTripper)
  - HardenedTransport()  base transport with the TLS floor + bounded setup
  - Harden(c)            apply the policy to a library-supplied *http.Client
  - NoRedirect           the CheckRedirect policy; wraps ErrRedirectBlocked

Lint: a forbidigo rule flags http.DefaultClient and http.Get/Post/
PostForm/Head, pointing at pkg/httpclient (.golangci.yml,
.agents/coding-style.md). forbidigo cannot match the &http.Client{}
composite literal without also flagging legitimate *http.Client type
references, so that form is enforced by review.

Migrates every non-test outbound call site across core/, pkg/, cmd/, and
the Go backend (backend/go/cloud-proxy). Credential-bearing and
internal-RPC clients refuse redirects; download / CDN / registry clients
use WithFollowRedirects so they keep working while stripping secrets
cross-host. The only credential-bearing client that follows redirects is
the gated-download path (pkg/downloader/uri.go), which strips the token
on the cross-host hop to the CDN. Hardening this closes, in passing:
  - MCP remote-server bearer token leaking via a redirect (the
    RoundTripper re-injected Authorization on every hop)
  - agent multimedia/webhook clients leaking user-supplied auth headers
  - cors_proxy following redirects, bypassing its SSRF IP-pin
  - downloader's authorized read path leaking the token cross-host

Fixes: GHSA-3mj3-57v2-4636 (cloud-proxy leaks operator provider API key
(x-api-key) to attacker host on cross-host redirect)
Reported-by: tonghuaroot
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-30 12:04:10 +02:00
LocalAI [bot]
a7cad704b9 chore: ⬆️ Update ggml-org/llama.cpp to 22d66b567eef11cf2e9832f04db64ee0323a0fd0 (#10080)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-30 08:34:00 +02:00
LocalAI [bot]
7e4df67556 chore: ⬆️ Update ggml-org/whisper.cpp to f24588a272ae8e23280d9c220536437164e6ed28 (#10078)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-30 01:09:52 +02:00
LocalAI [bot]
5b24b4dacc chore: ⬆️ Update mudler/rf-detr.cpp to 65c0ffcc9a9bc9dae38252f63d0417c9845a6cf7 (#10075)
⬆️ Update mudler/rf-detr.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-30 00:55:41 +02:00
LocalAI [bot]
52fdb46892 docs: ⬆️ update docs version mudler/LocalAI (#10074)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-30 00:24:34 +02:00
LocalAI [bot]
b389f0fe5f chore(model-gallery): ⬆️ update checksum (#10081)
⬆️ Checksum updates in gallery/index.yaml

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-30 00:11:57 +02:00
LocalAI [bot]
74281be340 chore: ⬆️ Update vllm-project/vllm cu130 wheel to 0.22.0 (#10079)
⬆️ Update vllm-project/vllm cu130 wheel

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-30 00:11:41 +02:00
LocalAI [bot]
cacf2f7a2c chore: ⬆️ Update ikawrakow/ik_llama.cpp to 8960c5ba5ee9db30ba838304373aa4dbec9f7cbd (#10077)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-30 00:11:27 +02:00
LocalAI [bot]
4a2cc64d07 feat(reasoning): honor per-request reasoning_effort on chat completions (#10082)
The OpenAI `reasoning_effort` field only reached the prompt template; it
never toggled the backend's thinking. Map it onto
ReasoningConfig.DisableReasoning (which becomes the enable_thinking gRPC
metadata) in the request merge, so reasoning_effort="none" disables
reasoning per request: the use case from #10072 (run a single Qwen3-style
model and turn reasoning off for low-latency tasks while keeping it on
for others).

Effort levels (minimal/low/medium/high) enable thinking unless the model
config explicitly disabled it (reasoning.disable: true wins and is never
re-enabled by a request); "none" always disables.

Closes #10072


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-29 22:09:07 +00:00
Richard Palethorpe
4647770316 fix(model): track intentional stops, stop misreading clean shutdowns as crashes (#10060)
Two separate issues made graceful backend shutdown look ungraceful in the
logs, even though the processes were being terminated correctly
(go-processmanager defaults to process-group SIGTERM + 15s grace + SIGKILL):

1. "failed to read PID" — startProcess registers a per-process graceful-
   termination handler that calls Stop(), but StopAllGRPC (registered
   earlier, via app.Shutdown) already stopped and released store-tracked
   backends first. The second Stop() then failed reading the removed
   pidfile. Guard the handler with IsAlive() so it skips already-stopped
   processes; it still covers backends StopAllGRPC doesn't track (worker-
   supervised ones).

2. "Backend process exited unexpectedly" exitCode=-1 — the exit watcher
   treated only exit codes 0/143 as clean. But a child killed by our own
   SIGTERM/SIGKILL is reported by Go as exitCode -1 (signal termination),
   not the shell's 128+signal convention, so every intentional stop logged
   a false crash warning. The exit code can't distinguish an intended stop
   from a signal-induced crash.

Track intent directly instead: a stoppingProcs sync.Map (keyed by the
*process.Process pointer) is marked wherever LocalAI calls Stop() on
purpose, and the exit watcher uses it to pick the log level — Info
"stopped" when intentional, Warn "exited unexpectedly" otherwise (still
catching real crashes). The raw exit code is reported as a field but no
longer interpreted.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-29 18:54:27 +02:00
LocalAI [bot]
3c9b9529c0 chore(model gallery): 🤖 add 1 new models via gallery agent (#10061)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-29 16:39:14 +02:00
TLoE419
fc2bd0986c test(utils): cover path verification, sanitization, and unique naming (#9978)
pkg/utils/path.go provides the security primitives for download paths
(VerifyPath, InTrustedRoot) and the file-naming helpers used by every
import flow (SanitizeFileName, GenerateUniqueFileName). None of them had
test coverage, so a future regression in the traversal check or in the
".." stripping inside SanitizeFileName would land unnoticed.

The new specs pin the lexical contract for each helper:

- VerifyPath accepts strict descendants and inner traversal that stays
  inside the base, rejects "..", compound traversal, and the base path
  itself. An explicit spec documents that the check is purely lexical
  (filepath.Clean, not EvalSymlinks) so any future caller that needs
  symlink-aware defence knows to EvalSymlinks first.
- InTrustedRoot rejects the trusted root and sibling directories,
  accepts deeply nested descendants.
- SanitizeFileName covers the leading-directory and absolute-prefix
  paths plus the embedded ".." case ("foo..bar" -> "foobar") that the
  Clean+Base layer alone would leave intact.
- GenerateUniqueFileName covers the no-collision, single-collision,
  walk-the-counter, and empty-extension cases using GinkgoT().TempDir()
  so the suite stays hermetic.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: TLoE419 <tloemizuchizu@gmail.com>
2026-05-29 10:40:08 +00:00
Ching
a473a32678 test(react-ui): cover models gallery empty-state reset flow (#10019)
Exercise the filtered empty-state path in the models gallery and verify
that the clear-filters action restores the list and resets the filter
selection.

Assisted-by: Codex:gpt-5

Signed-off-by: Ching Kao <0980124jim@gmail.com>
2026-05-29 10:39:33 +00:00
LocalAI [bot]
3e220373b0 fix(functions): validate auto-detected XML tool-call names — robust glm-4.5/Hermes guard (#9722, supersedes #9940) (#10059)
fix(functions): validate auto-detected XML tool-call names (#9722)

The XML tool-call auto-detector tries every preset, including glm-4.5 whose
tool block is <tool_call>name...</tool_call>. When a Hermes/NousResearch model
emits <tool_call>{"name":"bash","arguments":{...}}</tool_call>, glm-4.5
mis-claims the block and returns the entire JSON object (or leading prose, or a
JSON array) as the function NAME. The misparse then wins over the JSON parser,
so streaming clients receive a tool call whose name is a JSON blob.

Guard the auto-detect paths in ParseXMLIterative: a returned tool name must look
like a real function name ([A-Za-z0-9_.-]+). Results that don't are dropped so
auto-detection falls through to the next format and ultimately to JSON parsing,
which handles Hermes correctly. An explicitly forced format (format != nil) is
left untouched and trusted verbatim.

This supersedes PR #9940, which dropped only names with a leading "{". That
narrower check misses leading prose ("Sure: {...}"), JSON arrays ("[{...}]")
and brace-less garbage ("name: bash, ..."); the name-shape check rejects all of
them while still accepting legitimate glm-4.5 calls. The fix applies to both the
streaming worker and the non-streaming ParseFunctionCall path, which both call
ParseXMLIterative with auto-detection.


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-29 12:03:33 +02:00
Richard Palethorpe
fbcd886a47 fix(application): stop backend processes synchronously on shutdown (#10058)
application.New wires a fire-and-forget goroutine that runs
StopAllGRPC + distributed.Shutdown when the app context is cancelled.
Callers (tests, CLI signal handler) cancel the context and then exit
immediately, so the test binary / process can terminate before that
goroutine kills the spawned backend children. go-processmanager sets no
Pdeathsig, so the orphans are reparented to init and survive — leaving
dozens of stray mock-backend processes after an e2e run.

Add Application.Shutdown(), which runs the same cleanup synchronously on
the caller's stack and is idempotent via sync.Once. The context-cancel
goroutine, the CLI signal handler, and the test suites all call it, so
cleanup is deterministic and the duplicated teardown logic collapses to
one place. The async goroutine remains as a safety net for callers that
forget; sync.Once dedupes the double call.

Wire e2e_suite_test and the two mock-backend Contexts in app_test to
call Shutdown in their AfterSuite/AfterEach.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-29 11:40:43 +02:00
泊舟
e1a782b70f fix(openai): stop streaming tool-call double-emission when autoparser is active (#10055)
Streaming /v1/chat/completions could emit the same logical tool call at
multiple `index` values. In processStreamWithTools the Go-side iterative
parser (ParseXMLIterative / ParseJSONIterative) runs on every token and
emits tool-call deltas, while the C++ chat-template autoparser delivers
its own tool calls via ChatDeltas that are flushed at end-of-stream by
ToolCallsFromChatDeltas -> buildDeferredToolCallChunks. With both paths
active the same call is emitted twice at different indices, so OpenAI
clients that accumulate tool calls by `index` dispatch the tool N times.

Skip the Go-side iterative parser once the autoparser is producing tool
calls (hasChatDeltaToolCalls). The deferred flush stays guarded by
lastEmittedCount, so the race where the Go parser emitted before the flag
flipped also remains single-emission. Backends without an autoparser
(e.g. vLLM) keep hasChatDeltaToolCalls=false and are unaffected.

Refs #9722

Signed-off-by: bozhouDev <259759010+bozhouDev@users.noreply.github.com>
Co-authored-by: bozhouDev <259759010+bozhouDev@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-29 11:39:09 +02:00
LocalAI [bot]
73cfedc023 fix: tool-call JSON leaks into content with stream+tools on tokenizer-template models (#10052) (#10057)
* fix(grammars): honor properties_order entry at index 0

The JSON-schema-to-GBNF property sort used `aOrder != 0 && bOrder != 0` as
its "is this key ordered?" guard. That treats index 0 — the first key listed
in properties_order — as unset, so `properties_order: name,arguments` fell
back to alphabetical ordering and still emitted "arguments" before "name".

Use presence in the order map instead: listed keys sort by their index and
ahead of unlisted keys, which keep a stable alphabetical order. This makes
the documented `properties_order: name,arguments` actually produce
name-first tool-call JSON. Relates to #10052.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(functions): defer tool grammar to the backend when the tokenizer template owns templating (#10052)

When use_tokenizer_template delegates templating to the backend (llama.cpp),
the backend also owns tool-call grammar generation and parsing. LocalAI was
still generating its own GBNF grammar and sending it down. With a grammar
present, llama.cpp does not hand the tools to its template, so its native
peg/json tool parser never engages: it streams the grammar-constrained
tool-call JSON back as plain content instead of emitting tool_calls. In
streaming mode the JSON object leaked into the content field, and the
Go-side incremental detector never gated content because the
LocalAI-generated grammar emitted "arguments" before "name".

The GGUF auto-import path already couples use_tokenizer_template with
grammar.disable, but that block is skipped when a template is already
configured, so gallery and hand-written configs (e.g. qwen3) that set the
tokenizer template directly never got the paired grammar.disable.

- SetDefaults now enforces the coupling for every config: when
  use_tokenizer_template is set, grammar generation is disabled and tools
  flow to the backend's native (name-first) pipeline. This also fixes
  already-installed models without editing each config.
- Set function.grammar.disable in the shared gallery/qwen3.yaml, which is
  the base config referenced by every qwen3 gallery entry.

Verified end to end against qwen3-4b with stream:true + tools: content no
longer carries the tool-call JSON, reasoning is classified separately, and
tool calls stream as proper name-first tool_calls deltas.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-29 10:12:53 +02:00
LocalAI [bot]
b982c977d5 chore: ⬆️ Update ggml-org/whisper.cpp to c932729a304f7d9eb5354afa38624cfa86a780cf (#10051)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-29 08:42:06 +02:00
LocalAI [bot]
532ca1b3a2 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 6eff055a0cc0e427a6849cfcb5de531b4b82d667 (#10050)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-29 08:41:50 +02:00
LocalAI [bot]
00ad55b590 chore: ⬆️ Update ggml-org/llama.cpp to 751ebd17a58a8a513994509214373bb9e6a3d66c (#10049)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-29 08:41:35 +02:00
LocalAI [bot]
4c58fd302f chore: ⬆️ Update leejet/stable-diffusion.cpp to 0e4ee04488159b81d95a9ffcd983a077fd5dcb77 (#10048)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-29 08:41:18 +02:00
LocalAI [bot]
66582e7035 chore: ⬆️ Update antirez/ds4 to 22393e770ea8eb7501d8718d6f66c6374004e03f (#10047)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-29 08:41:02 +02:00
LocalAI [bot]
1d13949588 docs: ⬆️ update docs version mudler/LocalAI (#10046)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-29 08:40:47 +02:00
LocalAI [bot]
c8ad67bbca chore: ⬆️ Update mudler/rf-detr.cpp to ecf64d7f7f20d73ebd906a983f398ed287256320 (#10035)
⬆️ Update mudler/rf-detr.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-29 08:39:47 +02:00
LocalAI [bot]
1c92b00918 fix(turboquant): guard upstream-only grpc-server fields for fork (#10043)
fix(turboquant): guard upstream-only grpc-server fields for fork build

backend/cpp/llama-cpp/grpc-server.cpp is reused by the turboquant build,
which compiles against an older llama.cpp fork (TheTom/llama-cpp-turboquant).
Two recent changes added references to upstream-only struct fields outside the
existing LOCALAI_LEGACY_LLAMA_CPP_SPEC guards:

  - common_params::checkpoint_min_step (default + option handler), added with
    the ggml-org/llama.cpp 35c9b1f3 bump (#9998)
  - the common_params_speculative::draft tensor_buft_overrides sentinel
    termination (#9919), which sat after the guard's #endif

The fork has neither field, so grpc-server.cpp failed to compile for every
turboquant flavor. Wrap the three references in #ifndef
LOCALAI_LEGACY_LLAMA_CPP_SPEC, matching the existing fork-compat guards, so the
stock llama-cpp build is unchanged and the fork build skips them. Update
patch-grpc-server.sh's doc comment to record what the macro now gates out.

Verified by a local fallback-flavor turboquant build: grpc-server.cpp compiles
against the fork and the backend image builds.


Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-28 17:37:54 +02:00
Richard Palethorpe
b81a6d01b3 perf(react-ui): code-split bundle, speed up coverage suite (#10042)
* Curate the highlight.js build to ~29 languages (lib/core + the
  common set) instead of the full ~190-grammar default: -787 KB raw /
  -230 KB gz on the base bundle.
* Code-split every route via React.lazy with a per-layout <Suspense>
  in App.jsx so the sidebar stays mounted on navigation. Initial entry
  chunk drops from 3194 KB raw / 887 KB gz to 397 KB / 122 KB (-87%).
  Warm chunks on sidebar hover/focus/touch via a preload registry so
  the click finds the chunk already in flight or cached.
* Migrate Playwright coverage from istanbul (build-time counters) to
  native Chromium V8 coverage, with per-worker accumulation +
  conversion. Suite drops from 71s to 30s at 20 workers (~58%) at the
  non-instrumented floor.
* Keep the coverage gate bundling-invariant: the coverage build inlines
  dynamic imports so every shipped source file lands in the denominator
  (otherwise untested page chunks silently drop out and inflate the
  percentage). Production builds stay code-split.
* Add UI_TEST_WORKERS=N Makefile knob; tighten coverage tolerance to
  0.8pp now that jitter sits near istanbul's ~0.5pp again.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-28 13:43:15 +02:00
Tai An
0fd666ee6e fix(openresponses): populate Content and accept bare {role,content} items (#10039) (#10040)
* fix(openresponses): populate Content and accept bare {role,content} items (#10039)

Fixes mudler/LocalAI#10039 — `/v1/responses` silently returned empty
output on any model whose YAML doesn't include a Go-side
`template.chat_message` block.

Three cooperating bugs:

* `convertORInputToMessages` populated only `StringContent` for string
  input and for the `input.Instructions` system message, leaving the
  `Content` (any) field nil.
* `TemplateMessages` gated all fallback content-rendering branches on
  `Content != nil && StringContent != ""` — but every branch in that
  function consumes `StringContent`, not `Content`. The `&&` silently
  dropped messages that had StringContent set and Content nil, producing
  an empty prompt that the 5× empty-retry guard then turned into a
  200 OK with `output: []`.
* The array-input branch of `convertORInputToMessages` dispatched on
  `itemMap["type"]` with no default, dropping bare `{role, content}`
  items emitted by the OpenAI Python SDK helper
  `client.responses.create(input=[{...}])`.

Fix:

* Set both `Content` and `StringContent` in the two openresponses
  message-construction sites that only set one.
* Treat a bare `{role, content}` item (no `type`) as
  `type: "message"` for OpenAI-SDK compatibility.
* Gate `TemplateMessages` fallback rendering on `StringContent != ""`,
  which is what every downstream branch in that function actually
  reads.

Regression test added to `evaluator_test.go` covering the fallback
path (no `ChatMessage` template) with a StringContent-only message,
both with and without a role mapping.

* test(openresponses): guard Content population and ToProto path (#10039)

Add regression tests for the two seams the original fix touched but
left uncovered:

* convertORInputToMessages must populate both Content and StringContent
  for plain string input and for bare {role, content} array items (the
  OpenAI SDK shape that omits the type discriminator). Both are
  functional reds against the pre-fix code.
* Messages.ToProto reads Content, not StringContent — this is the path
  UseTokenizerTemplate backends (imported GGUFs) take. The cases pin
  that contract so a future regression on the producer side is caught.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-28 07:21:48 +00:00
LocalAI [bot]
7763fb23a3 chore: ⬆️ Update antirez/ds4 to 072bc0feb187be5f374c08b16d0045e1ad7bc9bc (#10036)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-28 08:41:03 +02:00
LocalAI [bot]
324277ccfd chore: ⬆️ Update ggml-org/whisper.cpp to 6dcdd6536456158667747f724d6bd3a2ceaa8d88 (#10032)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-28 00:25:20 +02:00
LocalAI [bot]
10d02e6c59 chore: ⬆️ Update leejet/stable-diffusion.cpp to 29ab511fc75f89fbab148665eab1a8e10a139a72 (#10033)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-28 00:24:59 +02:00
LocalAI [bot]
05ae06c17b chore: ⬆️ Update ggml-org/llama.cpp to aa50b2c2ae91326d5aad956ceeb015d1d48e626b (#10034)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-28 00:23:40 +02:00
LocalAI [bot]
2671e0c6f7 chore(model-gallery): ⬆️ update checksum (#10038)
⬆️ Checksum updates in gallery/index.yaml

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-28 00:22:19 +02:00
LocalAI [bot]
81b6b94f0b chore: ⬆️ Update ikawrakow/ik_llama.cpp to 3bf7e836c2c5a895e8d12d3eb7e398ae7ab2f9ce (#10037)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-28 00:21:45 +02:00
341 changed files with 21914 additions and 1072 deletions

View File

@@ -38,9 +38,12 @@ The React UI (`core/http/react-ui/`) has **no component/unit tests** — its onl
- **Browser:** the flake dev shell ships `chromium` and exports `PLAYWRIGHT_CHROMIUM_PATH`; `playwright.config.js` uses it via `launchOptions.executablePath`, and the Makefile skips `playwright install` when it's set. This avoids Playwright's downloaded browser, which can't resolve system libs (`libglib-2.0`, …) on NixOS. In CI (no `PLAYWRIGHT_CHROMIUM_PATH`) the Makefile falls back to `playwright install --with-deps chromium`.
- The app is a React SPA, so coverage accumulates across in-app navigation within a test; a full `page.goto`/reload resets it.
- `.nycrc.json` uses `all: true`, so **every `src/**` file is in the report**, including 0%-coverage ones — that's how you spot features with no test at all (sort the HTML report or `coverage-summary.json` by line% ascending).
- **UI coverage gate:** `make test-ui-coverage-check` runs the suite then `scripts/ui-coverage-check.sh`, failing if total line coverage drops more than `UI_COVERAGE_TOLERANCE` (default **1.0pp**) below `core/http/react-ui/coverage-baseline.txt`. `make test-ui-coverage-baseline` regenerates the baseline. **Why a tolerance (unlike the strict Go gate):** UI e2e line coverage is *non-deterministic* — async/debounced paths (e.g. the VRAM estimate's 500ms debounce) make identical specs vary ~0.5pp run-to-run, so a zero-tolerance gate would flake. Keep the tolerance just above the observed jitter. Run in CI (`tests-ui-e2e.yml`) and pre-commit on `core/http/react-ui/` changes.
- **UI coverage gate:** `make test-ui-coverage-check` runs the suite then `scripts/ui-coverage-check.sh`, failing if total line coverage drops more than `UI_COVERAGE_TOLERANCE` below `core/http/react-ui/coverage-baseline.txt`. `make test-ui-coverage-baseline` regenerates the baseline. Runs in CI (`tests-ui-e2e.yml`) and pre-commit on `core/http/react-ui/` changes.
- **Why it has a tolerance (unlike the strict Go gate):** UI e2e coverage is *non-deterministic*. Specs that assert on state and end while async/lazy render work is still in flight collect those lines only when the render beats the coverage teardown — so the total drifts with machine speed/load (a fast local box reads higher than a slow CI runner), diffusely across many specs. The tolerance absorbs that drift, so set the baseline *below* the slow-CI floor, never to a fast-local `make test-ui-coverage-baseline` number, or CI flaps.
- **Raising coverage is cheap:** a *render-smoke* spec (navigate to a route, assert its header renders) mounts a lazy page and runs its full render + initial effects, capturing most of its lines in a few lines of test — see `e2e/page-render-smoke.spec.js`. Auth is disabled in the test server (`isAdmin=true`), so `RequireAdmin`/`RequireFeature` routes render without a mock. The most *deterministic* win is removing a race: make a spec `await` a rendered element before ending (see `e2e/agents.spec.js` → AgentCreate) so its lines count every run.
Rules:
- The gate is **strict — there is no tolerance**. Any decrease fails, regardless of how many lines a PR adds or deletes. `covermode=atomic` makes line coverage deterministic, so there's no run-to-run jitter to excuse.
- When a change legitimately **raises** coverage, run `make test-coverage-baseline` and **commit** the updated `coverage-baseline.txt` so the ratchet moves up. Never lower the baseline by hand.
- If you can't get coverage back to baseline, the fix is to **add tests**, not to edit the baseline.
Rules (both gates):
- **Install the hooks:** `make install-hooks` once per clone so lint + coverage run pre-commit. Don't lean on CI for what the hook catches.
- **Don't work around the gate:** never `git commit --no-verify`, and never hand-lower a baseline or widen a tolerance to turn a red gate green. The ratchet only moves up.
- If a change drops coverage, **add tests** (sort `coverage-summary.json` by line% ascending to find untested code) rather than editing the baseline. When coverage legitimately rises, commit the regenerated baseline (`make test-coverage-baseline` / `test-ui-coverage-baseline`).
- The Go gate is **strict — no tolerance**; `covermode=atomic` keeps it deterministic. The UI gate keeps a small tolerance only because its e2e coverage isn't.

View File

@@ -50,6 +50,17 @@ Do not mix styles within a package. If you are extending tests in a package that
This is enforced by `golangci-lint` via the `forbidigo` linter (see `.golangci.yml`); calls like `t.Errorf` / `t.Fatalf` / `t.Run` / `t.Skip` / `t.Logf` are flagged. Run `make lint` locally before submitting; the same check runs in CI (`.github/workflows/lint.yml`).
## Outbound HTTP
All outbound HTTP must go through `github.com/mudler/LocalAI/pkg/httpclient` rather than the standard library's default client. Use `httpclient.New(...)` (no body deadline — safe for streaming/SSE) or `httpclient.NewWithTimeout(d, ...)` (simple request/response). Both **refuse redirects by default** and set a TLS 1.2 floor.
The reason is GHSA-3mj3-57v2-4636: the std default client follows redirects, and on a *cross-host* redirect Go forwards custom credential headers (e.g. Anthropic's `x-api-key`) to the redirect target, leaking the secret. `httpclient` fails closed instead.
- Need to follow redirects (download CDNs, registry blobs, GitHub asset URLs)? Pass `httpclient.WithFollowRedirects()` — it still strips credential headers on any cross-host hop.
- Have a custom transport (IP-pinned dialer, HTTP/2 tuning, a credential-injecting `RoundTripper`)? Pass `httpclient.WithTransport(rt)`, basing the transport on `httpclient.HardenedTransport()` to keep the TLS floor. Handed a `*http.Client` by a library? `httpclient.Harden(c)` applies the policy in place.
This is enforced by `forbidigo` (see `.golangci.yml`): `http.DefaultClient` and `http.Get`/`Post`/`PostForm`/`Head` are flagged. The `&http.Client{}` composite literal can't be matched precisely by forbidigo without also flagging legitimate `*http.Client` type references, so that form is caught by review — don't construct raw clients.
## Documentation
The project documentation is located in `docs/content`. When adding new features or changing existing functionality, it is crucial to update the documentation to reflect these changes. This helps users understand how to use the new capabilities and ensures the documentation stays relevant.

View File

@@ -68,6 +68,34 @@ go test -count=1 -timeout=30m -v ./tests/e2e-backends/...
CI does not load the model; the suite is opt-in via env vars.
## Distributed mode
ds4 supports **layer-split** distributed inference (a model too big for one host,
split by transformer layer; the GGUF must be present on every machine, each loads
only its slice). Topology is **inverted** vs llama.cpp: the coordinator listens,
workers dial in.
- **`ds4-worker` binary**: built and packaged next to `grpc-server` (`package.sh`
copies it into `package/`). Links the same engine objects plus `ds4_distributed.o`;
**no gRPC/protobuf dependency** (speaks ds4's own TCP transport), so it builds
even where `grpc-server` can't. Runs the worker serving loop (`ds4_dist_run`).
- **Coordinator wiring**: the ds4 `grpc-server` acts as coordinator when `LoadModel`
`ModelOptions.Options` (from model-YAML `options:`) carry:
- `ds4_role:coordinator` (enables distributed mode; absent → single-node, back-compat)
- `ds4_layers:0:19` (coordinator's own slice, inclusive; `N:output` includes the head)
- `ds4_listen:0.0.0.0:1234` (address workers dial into)
- `ds4_route_timeout:60` (optional; seconds Predict/PredictStream wait for the route
to form before returning gRPC `UNAVAILABLE`; default 60)
- **Worker CLI**: `local-ai worker ds4-distributed -- <ds4-worker args>` resolves the
ds4 backend and execs the packaged `ds4-worker` (raw passthrough), e.g.
`--role worker --model /models/ds4flash.gguf --layers 20:output --coordinator <host> 1234`.
Opt-in e2e in `tests/e2e-backends/backend_test.go`, gated by
`BACKEND_TEST_DS4_DISTRIBUTED=1` (plus `BACKEND_TEST_DS4_WORKER_BINARY`,
`BACKEND_TEST_DS4_WORKER_LAYERS`, `BACKEND_TEST_DS4_COORDINATOR_LAYERS`,
`BACKEND_TEST_DS4_LISTEN`). Design spec:
`docs/superpowers/specs/2026-05-30-ds4-distributed-inference-design.md`.
## Importer
`core/gallery/importers/ds4.go` (`DS4Importer`) auto-detects ds4 weights by

View File

@@ -716,6 +716,32 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-crispasr'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-parakeet-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "parakeet-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
@@ -1556,6 +1582,32 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-crispasr'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-parakeet-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "parakeet-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -1569,6 +1621,32 @@ include:
backend: "whisper"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-crispasr'
base-image: "ubuntu:24.04"
ubuntu-version: '2404'
runs-on: 'ubuntu-24.04-arm'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-parakeet-cpp'
base-image: "ubuntu:24.04"
ubuntu-version: '2404'
runs-on: 'ubuntu-24.04-arm'
backend: "parakeet-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -2850,6 +2928,20 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: '-cpu-crispasr'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
@@ -2864,6 +2956,20 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/arm64'
platform-tag: 'arm64'
tag-latest: 'auto'
tag-suffix: '-cpu-crispasr'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2877,6 +2983,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f32-crispasr'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f16'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2890,6 +3009,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f16'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f16-crispasr'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2904,6 +3036,20 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-crispasr'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2918,6 +3064,20 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/arm64'
platform-tag: 'arm64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-crispasr'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "0"
@@ -2931,6 +3091,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2204'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-arm64-crispasr'
base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
runs-on: 'ubuntu-24.04-arm'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2204'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2944,6 +3117,128 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-crispasr'
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
runs-on: 'ubuntu-latest'
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# parakeet-cpp
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: '-cpu-parakeet-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "parakeet-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/arm64'
platform-tag: 'arm64'
tag-latest: 'auto'
tag-suffix: '-cpu-parakeet-cpp'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "parakeet-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f32-parakeet-cpp'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "parakeet-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f16'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f16-parakeet-cpp'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "parakeet-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-parakeet-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "parakeet-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/arm64'
platform-tag: 'arm64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-parakeet-cpp'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "parakeet-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-arm64-parakeet-cpp'
base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
runs-on: 'ubuntu-24.04-arm'
backend: "parakeet-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2204'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-parakeet-cpp'
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
runs-on: 'ubuntu-latest'
skip-drivers: 'false'
backend: "parakeet-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# acestep-cpp
- build-type: ''
cuda-major-version: ""
@@ -3976,6 +4271,14 @@ includeDarwin:
tag-suffix: "-metal-darwin-arm64-whisper"
build-type: "metal"
lang: "go"
- backend: "crispasr"
tag-suffix: "-metal-darwin-arm64-crispasr"
build-type: "metal"
lang: "go"
- backend: "parakeet-cpp"
tag-suffix: "-metal-darwin-arm64-parakeet-cpp"
build-type: "metal"
lang: "go"
- backend: "acestep-cpp"
tag-suffix: "-metal-darwin-arm64-acestep-cpp"
build-type: "metal"

View File

@@ -3,6 +3,7 @@ package main
import (
"context"
"encoding/json"
"errors"
"fmt"
"os"
"strconv"
@@ -113,6 +114,17 @@ func main() {
fmt.Println("Searching for trending models on HuggingFace...")
rawModels, err := client.GetTrending(searchTerm, limit)
if err != nil {
if errors.Is(err, hfapi.ErrRateLimited) {
fmt.Printf("HuggingFace API is rate limited after retries, skipping this run: %v\n", err)
writeSummary(AddedModelSummary{
SearchTerm: searchTerm,
TotalFound: 0,
ModelsAdded: 0,
Quantization: quantization,
ProcessingTime: time.Since(startTime).String(),
})
return
}
fmt.Fprintf(os.Stderr, "Error fetching models: %v\n", err)
os.Exit(1)
}
@@ -277,4 +289,3 @@ func truncateString(s string, maxLen int) string {
}
return s[:maxLen] + "..."
}

View File

@@ -30,6 +30,14 @@ jobs:
variable: "WHISPER_CPP_VERSION"
branch: "master"
file: "backend/go/whisper/Makefile"
- repository: "CrispStrobe/CrispASR"
variable: "CRISPASR_VERSION"
branch: "main"
file: "backend/go/crispasr/Makefile"
- repository: "mudler/parakeet.cpp"
variable: "PARAKEET_VERSION"
branch: "master"
file: "backend/go/parakeet-cpp/Makefile"
- repository: "leejet/stable-diffusion.cpp"
variable: "STABLEDIFFUSION_GGML_VERSION"
branch: "master"

View File

@@ -18,7 +18,7 @@ jobs:
if: ${{ github.actor != 'dependabot[bot]' }}
- name: Run Gosec Security Scanner
if: ${{ github.actor != 'dependabot[bot]' }}
uses: securego/gosec@v2.22.9
uses: securego/gosec@v2.27.1
with:
# we let the report trigger content trigger a failure using the GitHub Security features.
args: '-no-fail -fmt sarif -out results.sarif ./...'

View File

@@ -46,6 +46,7 @@ jobs:
speaker-recognition: ${{ steps.detect.outputs.speaker-recognition }}
sherpa-onnx: ${{ steps.detect.outputs.sherpa-onnx }}
whisper: ${{ steps.detect.outputs.whisper }}
parakeet-cpp: ${{ steps.detect.outputs.parakeet-cpp }}
steps:
- name: Checkout repository
uses: actions/checkout@v6
@@ -633,6 +634,26 @@ jobs:
- name: Build whisper backend image and run transcription gRPC e2e tests
run: |
make test-extra-backend-whisper-transcription
# Parakeet ASR via the parakeet-cpp backend (C++/ggml port of NeMo
# Parakeet). Drives AudioTranscription (offline, with word timestamps) on
# tdt_ctc-110m + the JFK 11s clip.
tests-parakeet-cpp-grpc-transcription:
needs: detect-changes
if: needs.detect-changes.outputs.parakeet-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
- name: Build parakeet-cpp backend image and run transcription gRPC e2e tests
run: |
make test-extra-backend-parakeet-cpp-transcription
# VITS TTS via the sherpa-onnx backend. Drives both TTS (file write) and
# TTSStream (PCM chunks) on the e2e-backends harness.
tests-sherpa-onnx-grpc-tts:

View File

@@ -56,6 +56,20 @@ linters:
# are exempt — see linters.exclusions.rules below.
- pattern: '^os\.(Getenv|LookupEnv|Environ)$'
msg: 'Plumb config through ApplicationConfig (or the relevant CLI struct) instead of reading env directly. CLI entry points (core/cli/) bind env vars via kong''s `env:` tag — that is the only sanctioned env→struct boundary. See .agents/coding-style.md.'
# Outbound HTTP must go through pkg/httpclient, which refuses redirects
# by default and sets a TLS floor. The std-library default client and
# the http.Get/Post/... convenience helpers follow redirects (up to 10)
# and, on a cross-host redirect, forward custom credential headers such
# as Anthropic's x-api-key to the redirect target — leaking the secret
# (GHSA-3mj3-57v2-4636). forbidigo can't precisely match the
# `&http.Client{}` composite literal without also flagging legitimate
# `*http.Client` type references, so that form is enforced by
# convention + review; these two patterns catch the implicit-default
# client, which is the common footgun.
- pattern: '^http\.DefaultClient$'
msg: 'Use pkg/httpclient (httpclient.New / NewWithTimeout) instead of http.DefaultClient — the std client follows redirects and leaks credential headers cross-host (GHSA-3mj3-57v2-4636). See .agents/coding-style.md.'
- pattern: '^http\.(Get|Post|PostForm|Head)$'
msg: 'Use pkg/httpclient (httpclient.New / NewWithTimeout) instead of http.Get/Post/PostForm/Head — these use http.DefaultClient, which follows redirects and leaks credential headers cross-host (GHSA-3mj3-57v2-4636). See .agents/coding-style.md.'
exclusions:
paths:
# Upstream whisper.cpp source tree fetched by the whisper backend Makefile.
@@ -95,3 +109,18 @@ linters:
- path: _test\.go$
text: 'os\.(Getenv|LookupEnv|Environ)'
linters: [forbidigo]
# pkg/httpclient is the sanctioned home for outbound HTTP clients; it
# necessarily references net/http directly.
- path: ^pkg/httpclient/
text: 'http\.(DefaultClient|Get|Post|PostForm|Head)'
linters: [forbidigo]
# Tests drive local httptest servers where redirect/TLS hardening is
# irrelevant; the std client is fine there.
- path: _test\.go$
text: 'http\.(DefaultClient|Get|Post|PostForm|Head)'
linters: [forbidigo]
# Vendored upstream whisper.cpp Go bindings are a separate module and
# cannot import pkg/httpclient.
- path: ^backend/go/whisper/sources/
text: 'http\.(DefaultClient|Get|Post|PostForm|Head)'
linters: [forbidigo]

View File

@@ -35,6 +35,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
## Quick Reference
- **Git hooks & coverage gates**: Run `make install-hooks` once per clone so the pre-commit lint + coverage gates run. **Never bypass them with `git commit --no-verify`, and never lower a coverage baseline or widen a gate's tolerance to turn a red gate green** — the coverage ratchet only moves up. If a change drops coverage, add tests to raise it (e.g. render-smoke specs). See [.agents/building-and-testing.md](.agents/building-and-testing.md).
- **Logging**: Use `github.com/mudler/xlog` (same API as slog)
- **Go style**: Prefer `any` over `interface{}`
- **Comments**: Explain *why*, not *what*

View File

@@ -266,6 +266,12 @@ The e2e tests run LocalAI in a Docker container and exercise the API:
make test-e2e
```
### React UI tests and coverage
The React UI (`core/http/react-ui/`) is covered by Playwright e2e specs, gated by a **monotonic line-coverage ratchet** (`make test-ui-coverage-check`, run in CI and pre-commit). The metric is non-deterministic — a fast local box reads higher than a slow CI runner for the same code — so a small tolerance is unavoidable.
**If your change lowers UI coverage, raise it back by adding specs — do not widen the tolerance or hand-lower the baseline.** A *render-smoke* spec (navigate to a page, assert its header is visible) cheaply covers an entire lazy page. See `core/http/react-ui/e2e/page-render-smoke.spec.js` and the full policy in [.agents/building-and-testing.md](.agents/building-and-testing.md#react-ui-coverage).
### Running E2E container tests
These tests build a standard LocalAI Docker image and run it with pre-configured model configs to verify that most endpoints work correctly:

View File

@@ -1,5 +1,5 @@
# Disable parallel execution for backend builds
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio
GOCMD=go
GOTEST=$(GOCMD) test
@@ -309,13 +309,20 @@ run-e2e-aio: protogen-go
@echo 'Running e2e AIO tests'
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e-aio
# Distributed architecture e2e (PostgreSQL + NATS via testcontainers).
# Includes NatsJWT specs (JWT-enabled NATS). Requires Docker.
# VLLMMultinode is excluded here; use test-e2e-vllm-multinode for that.
test-e2e-distributed: protogen-go
@echo 'Running distributed e2e tests (label Distributed, incl. NatsJWT)'
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter='Distributed && !VLLMMultinode' --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e/distributed
# vLLM multi-node DP smoke (CPU). Builds local-ai:tests and the
# cpu-vllm backend from the current working tree, then drives a
# head + headless follower via testcontainers-go and asserts a chat
# completion. BuildKit caches both images, so re-runs only rebuild
# what changed. The test lives under tests/e2e/distributed and is
# selected by the VLLMMultinode label so it doesn't run alongside
# the other distributed-suite tests by default.
# test-e2e-distributed.
test-e2e-vllm-multinode: docker-build-e2e extract-backend-vllm protogen-go
@echo 'Running e2e vLLM multi-node DP test'
LOCALAI_IMAGE=local-ai \
@@ -991,6 +998,19 @@ test-extra-backend-whisper-transcription: docker-build-whisper
BACKEND_TEST_CAPS=health,load,transcription \
$(MAKE) test-extra-backend
## Audio transcription wrapper for the parakeet-cpp (parakeet.cpp ggml port)
## backend. Mirrors test-extra-backend-whisper-transcription: drives the
## AudioTranscription / AudioTranscriptionStream RPCs against a published
## Parakeet GGUF using the JFK 11s clip from whisper.cpp's CI samples. Not
## part of the default test suite - run explicitly once the pinned model URL
## is reachable.
test-extra-backend-parakeet-cpp-transcription: docker-build-parakeet-cpp
BACKEND_IMAGE=local-ai-backend:parakeet-cpp \
BACKEND_TEST_MODEL_URL=https://huggingface.co/mudler/parakeet-cpp-gguf/resolve/main/tdt_ctc-110m-f16.gguf \
BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
BACKEND_TEST_CAPS=health,load,transcription \
$(MAKE) test-extra-backend
## LocalVQE audio transform (joint AEC + noise suppression + dereverb).
## Exercises the audio_transform capability end-to-end: batch transform
## of a real WAV fixture and bidi streaming of synthetic silent frames.
@@ -1149,6 +1169,8 @@ BACKEND_HUGGINGFACE = huggingface|golang|.|false|true
BACKEND_SILERO_VAD = silero-vad|golang|.|false|true
BACKEND_STABLEDIFFUSION_GGML = stablediffusion-ggml|golang|.|--progress=plain|true
BACKEND_WHISPER = whisper|golang|.|false|true
BACKEND_CRISPASR = crispasr|golang|.|false|true
BACKEND_PARAKEET_CPP = parakeet-cpp|golang|.|false|true
BACKEND_VOXTRAL = voxtral|golang|.|false|true
BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true
BACKEND_QWEN3_TTS_CPP = qwen3-tts-cpp|golang|.|false|true
@@ -1236,6 +1258,8 @@ $(eval $(call generate-docker-build-target,$(BACKEND_HUGGINGFACE)))
$(eval $(call generate-docker-build-target,$(BACKEND_SILERO_VAD)))
$(eval $(call generate-docker-build-target,$(BACKEND_STABLEDIFFUSION_GGML)))
$(eval $(call generate-docker-build-target,$(BACKEND_WHISPER)))
$(eval $(call generate-docker-build-target,$(BACKEND_CRISPASR)))
$(eval $(call generate-docker-build-target,$(BACKEND_PARAKEET_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_VOXTRAL)))
$(eval $(call generate-docker-build-target,$(BACKEND_OPUS)))
$(eval $(call generate-docker-build-target,$(BACKEND_RERANKERS)))
@@ -1285,7 +1309,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
docker-save-%: backend-images
docker save local-ai-backend:$* -o backend-images/$*.tar
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy
########################################################
### Mock Backend for E2E Tests
@@ -1313,6 +1337,13 @@ build-ui-test-server: build-mock-backend react-ui protogen-go
test-ui-e2e: build-ui-test-server
cd core/http/react-ui && npm install && npx playwright install --with-deps chromium && npx playwright test
## Optional Playwright worker count for the UI e2e targets below. Pass
## UI_TEST_WORKERS=N (e.g. `make test-ui-coverage UI_TEST_WORKERS=20`) to
## override Playwright's default (cores/2). Empty by default so Playwright
## picks its own worker count.
UI_TEST_WORKERS ?=
PLAYWRIGHT_WORKERS_FLAG = $(if $(UI_TEST_WORKERS),--workers=$(UI_TEST_WORKERS),)
## Fast Playwright e2e run used by the pre-commit hook on React UI changes.
## Force-rebuilds the (non-instrumented) dist so the suite tests the working
## tree — not a stale dist the `react-ui` skip-guard would leave — re-embeds
@@ -1322,22 +1353,24 @@ test-ui-e2e: build-ui-test-server
test-ui: build-mock-backend protogen-go
cd core/http/react-ui && bun install && bun run build
$(GOCMD) build -o tests/e2e-ui/ui-test-server ./tests/e2e-ui
cd core/http/react-ui && sh $(CURDIR)/scripts/ensure-playwright-browser.sh && bunx playwright test
cd core/http/react-ui && sh $(CURDIR)/scripts/ensure-playwright-browser.sh && bunx playwright test $(PLAYWRIGHT_WORKERS_FLAG)
## React UI code coverage from the Playwright e2e suite. Builds an
## istanbul-instrumented bundle (COVERAGE=true), re-embeds it into the
## ui-test-server (the dist is //go:embed'ed at compile time), runs the
## Playwright specs which harvest window.__coverage__ via the coverage
## fixture — and writes an nyc report to core/http/react-ui/coverage/.
## Removes the instrumented dist afterwards so normal builds aren't served
## instrumented assets.
## React UI code coverage from the Playwright e2e suite. Builds a
## NON-instrumented bundle with source maps (COVERAGE_V8=true), re-embeds it
## into the ui-test-server (the dist is //go:embed'ed at compile time), runs the
## Playwright specs which collect native Chromium V8 coverage (PW_V8_COVERAGE=1)
## — far cheaper than istanbul's build-time counters (~40% faster end-to-end) —
## convert it to istanbul via v8-to-istanbul in the coverage fixture, and write
## an nyc report to core/http/react-ui/coverage/. Removes the dist afterwards so
## normal builds aren't served source-mapped assets. (The legacy istanbul path
## still exists: `bun run build:coverage` + unset PW_V8_COVERAGE.)
test-ui-coverage: build-mock-backend protogen-go
trap 'rm -rf "$(CURDIR)/core/http/react-ui/dist"' EXIT; \
( cd core/http/react-ui && bun install && bun run build:coverage ) && \
( cd core/http/react-ui && bun install && bun run build:coverage-v8 ) && \
$(GOCMD) build -o tests/e2e-ui/ui-test-server ./tests/e2e-ui && \
( cd core/http/react-ui && rm -rf .nyc_output coverage && \
sh $(CURDIR)/scripts/ensure-playwright-browser.sh && \
bunx playwright test && bun run coverage:report )
PW_V8_COVERAGE=1 bunx playwright test $(PLAYWRIGHT_WORKERS_FLAG) && bun run coverage:report )
## UI coverage baseline (committed) and the strict gate that compares against
## it — the React mirror of test-coverage-baseline / test-coverage-check.

View File

@@ -31,12 +31,18 @@
**LocalAI** is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.
- **Drop-in API compatibility** — OpenAI, Anthropic, ElevenLabs APIs
- **36+ backends** — llama.cpp, vLLM, transformers, whisper, diffusers, MLX...
- **Any hardware** — NVIDIA, AMD, Intel, Apple Silicon, Vulkan, or CPU-only
- **Multi-user ready** — API key auth, user quotas, role-based access
- **Built-in AI agents** — autonomous agents with tool use, RAG, MCP, and skills
- **Privacy-first** — your data never leaves your infrastructure
**A small core, not a bundle.** Each backend wraps a best-in-class engine (llama.cpp, vLLM, whisper.cpp, stable-diffusion, MLX...) in its own image, pulled only when a model needs it. You install nothing you don't use.
- **Composable by design**: backends are separate and pulled on demand, so you install only what your model needs
- **Open and extensible**: load any model, or build your own backend in any language against an open interface
- **Drop-in API compatibility**: OpenAI, Anthropic, and ElevenLabs APIs across every backend
- **Any model, any modality**: LLMs, vision, voice, image, and video behind one API
- **Any hardware**: NVIDIA, AMD, Intel, Apple Silicon, Vulkan, or CPU-only
- **Multi-user ready**: API key auth, user quotas, role-based access
- **Built-in AI agents**: autonomous agents with tool use, RAG, MCP, and skills
- **Privacy-first**: your data never leaves your infrastructure
![A small LocalAI core with backends (llama.cpp, vLLM, MLX, whisper.cpp, stable-diffusion, kokoro, parakeet.cpp...) plugged in as separate on-demand images](docs/static/images/diagrams/composable-core.png)
Created by [Ettore Di Giacinto](https://github.com/mudler) and maintained by the [LocalAI team](#team).

View File

@@ -537,6 +537,15 @@ message TTSRequest {
string dst = 3;
string voice = 4;
optional string language = 5;
// instructions is a free-form, per-request style/voice description (maps to
// the OpenAI `instructions` field). Backends that support expressive synthesis
// (e.g. Qwen3-TTS CustomVoice/VoiceDesign) prefer this over the static YAML
// option when set; backends that don't simply ignore it.
optional string instructions = 6;
// params carries optional, backend-specific per-request generation parameters
// (e.g. Chatterbox exaggeration/cfg_weight/temperature). Values are strings and
// coerced by the backend; unset leaves the backend's configured defaults.
map<string, string> params = 7;
}
message VADRequest {

View File

@@ -2,6 +2,7 @@ ds4/
build/
package/
grpc-server
ds4-worker
*.o
backend.pb.cc
backend.pb.h

View File

@@ -60,6 +60,11 @@ elseif(DS4_GPU STREQUAL "cpu")
set(DS4_OBJS "${DS4_DIR}/ds4_cpu.o")
endif()
# ds4.c now references ds4_distributed.c (distributed inference was split into
# its own translation unit upstream). It is a single GPU-agnostic object shared
# by every GPU mode, so link it in regardless of DS4_GPU.
list(APPEND DS4_OBJS "${DS4_DIR}/ds4_distributed.o")
add_executable(${TARGET}
grpc-server.cpp
dsml_parser.cpp
@@ -99,3 +104,36 @@ if(DS4_NATIVE)
target_compile_options(${TARGET} PRIVATE -march=native)
endif()
endif()
# ds4-worker: standalone distributed worker. Links the same ds4 engine objects
# (including ds4_distributed.o) but has NO gRPC/protobuf dependency - it speaks
# ds4's own TCP transport via ds4_dist_run(). Buildable wherever the engine
# objects build, even on hosts without protobuf/grpc dev headers.
add_executable(ds4-worker worker_main.c)
target_include_directories(ds4-worker PRIVATE ${DS4_DIR})
foreach(obj ${DS4_OBJS})
target_sources(ds4-worker PRIVATE ${obj})
set_source_files_properties(${obj} PROPERTIES EXTERNAL_OBJECT TRUE GENERATED TRUE)
endforeach()
# worker_main.c is C, but the engine objects built by nvcc (ds4_cuda.o) and the
# Metal path (ds4_metal.o, Obj-C++) reference the C++ runtime (libstdc++). Force
# the C++ linker driver so those symbols resolve; the C driver would not link
# libstdc++ and the CUDA/Metal builds fail with undefined std:: references.
set_target_properties(ds4-worker PROPERTIES LINKER_LANGUAGE CXX)
target_link_libraries(ds4-worker PRIVATE Threads::Threads m)
if(DS4_GPU STREQUAL "cuda")
target_link_libraries(ds4-worker PRIVATE CUDA::cudart CUDA::cublas)
elseif(DS4_GPU STREQUAL "metal")
target_link_libraries(ds4-worker PRIVATE ${FOUNDATION_LIB} ${METAL_LIB})
elseif(DS4_GPU STREQUAL "cpu")
target_compile_definitions(ds4-worker PRIVATE DS4_NO_GPU)
endif()
if(DS4_NATIVE)
if(APPLE)
target_compile_options(ds4-worker PRIVATE -mcpu=native)
else()
target_compile_options(ds4-worker PRIVATE -march=native)
endif()
endif()

View File

@@ -1,10 +1,10 @@
# ds4 backend Makefile.
#
# Upstream pin lives below as DS4_VERSION?=e8e8779b261c10f36ad6270ba732c8f0be5b62e3
# Upstream pin lives below as DS4_VERSION?=477c0e82e2699b35a65fd0a1ed6fe66b41087dfe
# (.github/bump_deps.sh) can find and update it - matches the
# llama-cpp / ik-llama-cpp / turboquant convention.
DS4_VERSION?=e8e8779b261c10f36ad6270ba732c8f0be5b62e3
DS4_VERSION?=477c0e82e2699b35a65fd0a1ed6fe66b41087dfe
DS4_REPO?=https://github.com/antirez/ds4
CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
@@ -18,16 +18,19 @@ UNAME_S := $(shell uname -s)
CMAKE_ARGS ?= -DCMAKE_BUILD_TYPE=Release
# ds4_distributed.o is a GPU-agnostic translation unit that ds4.c/ds4_cpu.o now
# reference (upstream split distributed inference into its own .c). The same
# object is shared by every GPU mode, so it is appended unconditionally below.
ifeq ($(BUILD_TYPE),cublas)
CMAKE_ARGS += -DDS4_GPU=cuda
DS4_OBJ_TARGET := ds4.o ds4_cuda.o
DS4_OBJ_TARGET := ds4.o ds4_cuda.o ds4_distributed.o
else ifeq ($(UNAME_S),Darwin)
CMAKE_ARGS += -DDS4_GPU=metal
DS4_OBJ_TARGET := ds4.o ds4_metal.o
DS4_OBJ_TARGET := ds4.o ds4_metal.o ds4_distributed.o
else
# CPU reference path (Linux only - macOS CPU path is broken by VM bug per ds4 README).
CMAKE_ARGS += -DDS4_GPU=cpu
DS4_OBJ_TARGET := ds4_cpu.o
DS4_OBJ_TARGET := ds4_cpu.o ds4_distributed.o
endif
ifneq ($(NATIVE),true)
@@ -52,17 +55,18 @@ ds4:
# the right per-platform compile flags (Objective-C/Metal on Darwin, nvcc on Linux+CUDA).
ds4/ds4.o: ds4
ifeq ($(BUILD_TYPE),cublas)
+$(MAKE) -C ds4 ds4.o ds4_cuda.o
+$(MAKE) -C ds4 ds4.o ds4_cuda.o ds4_distributed.o
else ifeq ($(UNAME_S),Darwin)
+$(MAKE) -C ds4 ds4.o ds4_metal.o
+$(MAKE) -C ds4 ds4.o ds4_metal.o ds4_distributed.o
else
+$(MAKE) -C ds4 ds4_cpu.o
+$(MAKE) -C ds4 ds4_cpu.o ds4_distributed.o
endif
grpc-server: ds4/ds4.o
mkdir -p $(BUILD_DIR)
cd $(BUILD_DIR) && cmake $(CMAKE_ARGS) $(CURRENT_MAKEFILE_DIR) && cmake --build . --config Release -j $(JOBS)
cp $(BUILD_DIR)/grpc-server grpc-server
cp $(BUILD_DIR)/ds4-worker ds4-worker
package: grpc-server
bash package.sh
@@ -71,7 +75,7 @@ test:
@echo "ds4 backend: e2e coverage at tests/e2e-backends/ (BACKEND_BINARY mode)"
clean:
rm -rf $(BUILD_DIR) grpc-server package
rm -rf $(BUILD_DIR) grpc-server ds4-worker package
if [ -d ds4 ]; then $(MAKE) -C ds4 clean; fi
purge: clean

View File

@@ -23,8 +23,11 @@ extern "C" {
#include <atomic>
#include <chrono>
#include <climits>
#include <csignal>
#include <cstdlib>
#include <cstring>
#include <ctime>
#include <iostream>
#include <memory>
#include <mutex>
@@ -51,6 +54,12 @@ ds4_session *g_session = nullptr;
int g_ctx_size = 32768;
std::string g_kv_cache_dir; // empty disables disk cache
// Distributed coordinator state. g_distributed is set true when LoadModel is
// given 'ds4_role:coordinator'; generation then waits for the worker route to
// form before running. Single-node behavior is unchanged when unset.
bool g_distributed = false;
int g_route_timeout_sec = 60;
std::atomic<Server *> g_server{nullptr};
// Parse a "key:value" option string. Returns empty when no colon.
@@ -60,6 +69,77 @@ static std::pair<std::string, std::string> split_option(const std::string &opt)
return {opt.substr(0, colon), opt.substr(colon + 1)};
}
// Parse a positive base-10 integer. Returns false (without throwing) on empty,
// trailing garbage, non-positive, or overflow - unlike std::stoi.
static bool parse_positive_int(const std::string &s, int *out) {
if (s.empty()) return false;
char *end = nullptr;
long v = std::strtol(s.c_str(), &end, 10);
if (!end || *end != '\0' || v <= 0 || v > INT_MAX) return false;
*out = static_cast<int>(v);
return true;
}
// Parse a ds4 layer spec "START:END" or "START:output" into the engine's
// distributed layer fields. Returns false on malformed input.
static bool parse_layers_spec(const std::string &spec, ds4_distributed_layers *out) {
auto colon = spec.find(':');
if (colon == std::string::npos) return false;
std::string lhs = spec.substr(0, colon);
std::string rhs = spec.substr(colon + 1);
if (lhs.empty() || rhs.empty()) return false;
char *end = nullptr;
long start = std::strtol(lhs.c_str(), &end, 10);
if (!end || *end != '\0' || start < 0) return false;
out->start = static_cast<uint32_t>(start);
out->has_output = false;
if (rhs == "output") {
out->has_output = true;
out->end = out->start; // engine treats has_output as "through final layer"
} else {
long e = std::strtol(rhs.c_str(), &end, 10);
if (!end || *end != '\0' || e < start) return false;
out->end = static_cast<uint32_t>(e);
}
out->set = true;
return true;
}
// When acting as a distributed coordinator, block until the worker route
// covers all layers (ds4_session_distributed_route_ready == 1) or the timeout
// elapses. Returns an empty string on success, or an error message to return
// to the client. No-op when not distributed.
//
// Takes the g_engine_mu lock by reference and RELEASES it during each poll
// sleep. The wait can span up to g_route_timeout_sec seconds while workers
// connect; holding g_engine_mu the whole time would block the Status/Health
// readiness probes (they also lock g_engine_mu), making LocalAI's loader treat
// a still-starting worker as hung.
static std::string wait_route_ready(std::unique_lock<std::mutex> &lock) {
if (!g_distributed) return "";
char err[256] = {0};
const int deadline_polls = g_route_timeout_sec * 10; // 100ms per poll
for (int i = 0; i <= deadline_polls; ++i) {
int ready = ds4_session_distributed_route_ready(g_session, err, sizeof(err));
if (ready == 1) return "";
if (ready < 0) {
return std::string("ds4 distributed route error: ") +
(err[0] ? err : "unknown");
}
// Release the lock while sleeping so Status/Health and other RPCs can
// interleave during worker startup.
lock.unlock();
struct timespec ts = {0, 100L * 1000L * 1000L}; // 100ms
nanosleep(&ts, nullptr);
lock.lock();
// A concurrent Free() may have torn down the engine while we slept.
if (!g_engine || !g_session) {
return "ds4: model unloaded while waiting for distributed route";
}
}
return "ds4 distributed route incomplete: workers not connected (layers uncovered)";
}
static void append_token_text(ds4_engine *engine, int token, std::string &out) {
size_t len = 0;
const char *text = ds4_token_text(engine, token, &len);
@@ -377,6 +457,11 @@ public:
backend::Result *result) override {
std::lock_guard<std::mutex> lock(g_engine_mu);
// Reset distributed state so a model swap (a second LoadModel without
// ds4_role) doesn't inherit a stale coordinator configuration.
g_distributed = false;
g_route_timeout_sec = 60;
if (g_engine) {
if (g_session) { ds4_session_free(g_session); g_session = nullptr; }
ds4_engine_close(g_engine);
@@ -394,12 +479,23 @@ public:
std::string mtp_path;
int mtp_draft = 0;
float mtp_margin = 3.0f;
std::string ds4_role, ds4_layers, ds4_listen;
for (const auto &opt : request->options()) {
auto [k, v] = split_option(opt);
if (k == "mtp_path") mtp_path = v;
else if (k == "mtp_draft") mtp_draft = std::stoi(v);
else if (k == "mtp_margin") mtp_margin = std::stof(v);
else if (k == "kv_cache_dir") g_kv_cache_dir = v;
else if (k == "ds4_role") ds4_role = v;
else if (k == "ds4_layers") ds4_layers = v;
else if (k == "ds4_listen") ds4_listen = v;
else if (k == "ds4_route_timeout") {
if (!parse_positive_int(v, &g_route_timeout_sec)) {
result->set_success(false);
result->set_message("ds4: ds4_route_timeout must be a positive integer");
return GStatus::OK;
}
}
}
g_kv_cache.SetDir(g_kv_cache_dir);
@@ -422,6 +518,49 @@ public:
opt.backend = DS4_BACKEND_CUDA;
#endif
// Coordinator wiring. 'ds4_role:coordinator' enables layer-split
// distributed inference: this process listens on ds4_listen and owns
// the ds4_layers slice; workers dial in (see `local-ai worker
// ds4-distributed`). Absent ds4_role => unchanged single-node path.
// Must be static: opt.distributed.listen_host is a const char* the
// engine retains past this call, so it cannot point at a local that
// goes out of scope (otherwise a future "simplify to local" refactor
// reintroduces a dangling pointer).
static std::string s_listen_host;
if (ds4_role == "coordinator") {
if (ds4_layers.empty() || ds4_listen.empty()) {
result->set_success(false);
result->set_message("ds4: ds4_role:coordinator requires ds4_layers and ds4_listen");
return GStatus::OK;
}
// host:port for IPv4/hostname; IPv6 literals are unsupported (the
// first colon would split inside the address).
auto host_port = split_option(ds4_listen); // "host:port" -> {host, port}
if (host_port.second.empty()) {
result->set_success(false);
result->set_message("ds4: ds4_listen must be host:port");
return GStatus::OK;
}
int listen_port = 0;
if (!parse_positive_int(host_port.second, &listen_port)) {
result->set_success(false);
result->set_message("ds4: ds4_listen port must be a positive integer");
return GStatus::OK;
}
ds4_distributed_layers layers = {};
if (!parse_layers_spec(ds4_layers, &layers)) {
result->set_success(false);
result->set_message("ds4: invalid ds4_layers (want START:END or START:output)");
return GStatus::OK;
}
s_listen_host = host_port.first;
opt.distributed.role = DS4_DISTRIBUTED_COORDINATOR;
opt.distributed.layers = layers;
opt.distributed.listen_host = s_listen_host.c_str();
opt.distributed.listen_port = listen_port;
g_distributed = true;
}
int rc = ds4_engine_open(&g_engine, &opt);
if (rc != 0 || !g_engine) {
result->set_success(false);
@@ -458,10 +597,13 @@ public:
GStatus Predict(ServerContext *, const backend::PredictOptions *request,
backend::Reply *reply) override {
std::lock_guard<std::mutex> lock(g_engine_mu);
std::unique_lock<std::mutex> lock(g_engine_mu);
if (!g_engine || !g_session) {
return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
}
if (std::string route_err = wait_route_ready(lock); !route_err.empty()) {
return GStatus(StatusCode::UNAVAILABLE, route_err);
}
ds4_tokens prompt = {};
build_prompt(g_engine, request, &prompt);
int n_predict = request->tokens() > 0 ? request->tokens() : 256;
@@ -554,10 +696,13 @@ public:
GStatus PredictStream(ServerContext *, const backend::PredictOptions *request,
ServerWriter<backend::Reply> *writer) override {
std::lock_guard<std::mutex> lock(g_engine_mu);
std::unique_lock<std::mutex> lock(g_engine_mu);
if (!g_engine || !g_session) {
return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
}
if (std::string route_err = wait_route_ready(lock); !route_err.empty()) {
return GStatus(StatusCode::UNAVAILABLE, route_err);
}
ds4_tokens prompt = {};
build_prompt(g_engine, request, &prompt);
int n_predict = request->tokens() > 0 ? request->tokens() : 256;

View File

@@ -5,7 +5,8 @@ REPO_ROOT="${CURDIR}/../../.."
mkdir -p "$CURDIR/package/lib"
cp -avf "$CURDIR/grpc-server" "$CURDIR/package/"
cp -rfv "$CURDIR/run.sh" "$CURDIR/package/"
cp -avf "$CURDIR/ds4-worker" "$CURDIR/package/"
cp -rfv "$CURDIR/run.sh" "$CURDIR/package/"
UNAME_S=$(uname -s)
if [ "$UNAME_S" = "Darwin" ]; then

View File

@@ -0,0 +1,126 @@
// ds4-worker: standalone distributed worker for the LocalAI ds4 backend.
//
// A ds4 distributed worker owns a slice of the model's transformer layers,
// dials the coordinator, and serves activations for its slice. It does NOT
// speak backend.proto - it speaks ds4's own TCP transport via ds4_dist_run().
// This binary is intentionally minimal (no HTTP/web/kvstore/linenoise): it
// only needs the engine objects + ds4_distributed.o, which the backend already
// builds. It is launched by `local-ai worker ds4-distributed`.
//
// Usage:
// ds4-worker --role worker --model <gguf> --layers 20:output \
// --coordinator <host> <port> [--cpu|--cuda|--metal] [-c CTX] [-t N]
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <signal.h>
#include <limits.h>
#include "ds4.h"
#include "ds4_distributed.h"
static const char *need_arg(int *i, int argc, char **argv, const char *flag) {
if (*i + 1 >= argc) {
fprintf(stderr, "ds4-worker: missing value for %s\n", flag);
exit(2);
}
return argv[++(*i)];
}
static int parse_int_arg(const char *s, const char *flag) {
char *end = NULL;
long v = strtol(s, &end, 10);
if (!s[0] || *end || v <= 0 || v > INT_MAX) {
fprintf(stderr, "ds4-worker: invalid value for %s: %s\n", flag, s);
exit(2);
}
return (int)v;
}
static ds4_backend default_backend(void) {
#if defined(DS4_NO_GPU)
return DS4_BACKEND_CPU;
#elif defined(__APPLE__)
return DS4_BACKEND_METAL;
#else
return DS4_BACKEND_CUDA;
#endif
}
int main(int argc, char **argv) {
signal(SIGPIPE, SIG_IGN);
ds4_engine_options opt = {0};
opt.backend = default_backend();
int ctx_size = 32768;
for (int i = 1; i < argc; i++) {
const char *arg = argv[i];
if (!strcmp(arg, "-h") || !strcmp(arg, "--help")) {
fprintf(stdout, "ds4-worker: standalone ds4 distributed worker\n");
ds4_dist_usage(stdout);
fprintf(stdout, " -m, --model PATH model GGUF (the worker loads only its --layers slice)\n");
fprintf(stdout, " -c, --ctx N context size (default 32768)\n");
fprintf(stdout, " -t, --threads N CPU threads\n");
fprintf(stdout, " --cpu|--cuda|--metal backend override\n");
return 0;
}
char dist_err[256] = {0};
ds4_dist_cli_parse_result dist_parse =
ds4_dist_parse_cli_arg(arg, &i, argc, argv, &opt.distributed,
dist_err, sizeof(dist_err));
if (dist_parse == DS4_DIST_CLI_ERROR) {
fprintf(stderr, "ds4-worker: %s\n",
dist_err[0] ? dist_err : "invalid distributed option");
return 2;
}
if (dist_parse == DS4_DIST_CLI_MATCHED) continue;
if (!strcmp(arg, "-m") || !strcmp(arg, "--model")) {
opt.model_path = need_arg(&i, argc, argv, arg);
} else if (!strcmp(arg, "-c") || !strcmp(arg, "--ctx")) {
ctx_size = parse_int_arg(need_arg(&i, argc, argv, arg), arg);
} else if (!strcmp(arg, "-t") || !strcmp(arg, "--threads")) {
opt.n_threads = parse_int_arg(need_arg(&i, argc, argv, arg), arg);
} else if (!strcmp(arg, "--cpu")) {
opt.backend = DS4_BACKEND_CPU;
} else if (!strcmp(arg, "--cuda")) {
opt.backend = DS4_BACKEND_CUDA;
} else if (!strcmp(arg, "--metal")) {
opt.backend = DS4_BACKEND_METAL;
} else {
fprintf(stderr, "ds4-worker: unknown option: %s\n", arg);
return 2;
}
}
if (opt.distributed.role != DS4_DISTRIBUTED_WORKER) {
fprintf(stderr, "ds4-worker: --role worker is required\n");
return 2;
}
if (!opt.model_path) {
fprintf(stderr, "ds4-worker: --model is required\n");
return 2;
}
char prep_err[256] = {0};
if (ds4_dist_prepare_engine_options(&opt.distributed, &opt,
prep_err, sizeof(prep_err)) != 0) {
fprintf(stderr, "ds4-worker: %s\n", prep_err);
return 2;
}
ds4_engine *engine = NULL;
if (ds4_engine_open(&engine, &opt) != 0 || !engine) {
fprintf(stderr, "ds4-worker: failed to open engine\n");
return 1;
}
ds4_dist_generation_options gen = {0};
gen.ctx_size = ctx_size;
int rc = ds4_dist_run(engine, &opt.distributed, &gen);
ds4_engine_close(engine);
return rc;
}

View File

@@ -1,5 +1,5 @@
IK_LLAMA_VERSION?=d2da6da05c73aeb658a3d1751f386c24e6963856
IK_LLAMA_VERSION?=1520eda980564241434b791ce2bbbd128c4be9ea
LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp
CMAKE_ARGS?=

View File

@@ -1,5 +1,5 @@
LLAMA_VERSION?=0d18aaa9d1a8af3df9abccd828e22eeaac7f840b
LLAMA_VERSION?=7c158fbb4aec1bdc9c81d6ca0e785139f4826fae
LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
CMAKE_ARGS?=

View File

@@ -573,8 +573,12 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
// checkpoint_min_step: minimum spacing between context checkpoints in
// tokens (0 disables the minimum). Match upstream's default (256). This
// field was renamed from `checkpoint_every_nt` in llama.cpp; the semantics
// also shifted from a fixed cadence to a minimum spacing.
// also shifted from a fixed cadence to a minimum spacing. The turboquant
// fork branched before the field existed, so skip it on the legacy path
// (LOCALAI_LEGACY_LLAMA_CPP_SPEC is injected by patch-grpc-server.sh).
#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC
params.checkpoint_min_step = 256;
#endif
// decode options. Options are in form optname:optvale, or if booleans only optname.
for (int i = 0; i < request->options_size(); i++) {
@@ -748,11 +752,18 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
params.cache_idle_slots = false;
}
#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC
// --- minimum context-checkpoint spacing (upstream -cms / --checkpoint-min-step) ---
// 0 disables the minimum-spacing gate. Old option names (`checkpoint_every_nt`,
// `checkpoint_every_n_tokens`) are kept as aliases for backward compatibility
// with existing user configs: upstream renamed the field and shifted its
// semantics from a fixed cadence to a minimum spacing.
//
// Gated out for the turboquant fork, which lacks common_params::
// checkpoint_min_step. The leading `}` closing the cache_idle_slots
// branch is removed with this block; the next `} else if` (n_ubatch)
// then closes cache_idle_slots, so braces stay balanced under both
// preprocessor branches.
} else if (!strcmp(optname, "checkpoint_min_step") || !strcmp(optname, "checkpoint_min_spacing") ||
!strcmp(optname, "checkpoint_every_nt") || !strcmp(optname, "checkpoint_every_n_tokens")) {
if (optval != NULL) {
@@ -762,6 +773,7 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
// If conversion fails, keep default value (256)
}
}
#endif
// --- physical batch size (upstream -ub / --ubatch-size) ---
// Note: line ~482 already aliases n_ubatch to n_batch as a default; this
@@ -1165,9 +1177,15 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
params.tensor_buft_overrides.push_back({nullptr, nullptr});
}
}
// The draft tensor_buft_overrides are only populated under the modern
// (post-#22838) layout, whose population code is itself gated by
// LOCALAI_LEGACY_LLAMA_CPP_SPEC above. The turboquant fork lacks
// common_params_speculative::draft entirely, so skip the sentinel there too.
#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC
if (!params.speculative.draft.tensor_buft_overrides.empty()) {
params.speculative.draft.tensor_buft_overrides.push_back({nullptr, nullptr});
}
#endif
// TODO: Add yarn
@@ -1926,6 +1944,17 @@ public:
body_json["chat_template_kwargs"]["enable_thinking"] = (et_it->second == "true");
}
// Pass reasoning_effort via chat_template_kwargs too: the lever
// jinja templates like gpt-oss (Harmony) / LFM2.5 read, distinct
// from enable_thinking which those templates ignore.
auto re_it = metadata.find("reasoning_effort");
if (re_it != metadata.end() && !re_it->second.empty()) {
if (!body_json.contains("chat_template_kwargs")) {
body_json["chat_template_kwargs"] = json::object();
}
body_json["chat_template_kwargs"]["reasoning_effort"] = re_it->second;
}
// Debug: Print full body_json before template processing (includes messages, tools, tool_choice, etc.)
SRV_DBG("[CONVERSATION DEBUG] PredictStream: Full body_json before oaicompat_chat_params_parse:\n%s\n", body_json.dump(2).c_str());
@@ -2186,7 +2215,15 @@ public:
// content element — attaching to both would duplicate the first
// token since oaicompat_msg_diffs is the same for both.
json first_res_json = first_result->to_json();
if (first_res_json.is_array()) {
// Upstream llama.cpp (ggml-org/llama.cpp#23884) now emits an initial
// "begin" partial whose to_json() returns null, used only to signal the
// HTTP layer to flush 200 status headers before any token. gRPC has no
// such concept, so there is nothing to emit — the real tokens arrive in
// the loop below. Feeding this null into build_reply_from_json would
// throw (uncaught) and surface as a generic RPC error.
if (first_res_json.is_null()) {
// skip the begin-of-stream marker
} else if (first_res_json.is_array()) {
for (const auto & res : first_res_json) {
auto reply = build_reply_from_json(res, first_result.get());
// Skip chat deltas for role-init elements (have "role" in
@@ -2216,7 +2253,10 @@ public:
}
json res_json = result->to_json();
if (res_json.is_array()) {
if (res_json.is_null()) {
// begin-of-stream marker (see note above) — nothing to emit
continue;
} else if (res_json.is_array()) {
for (const auto & res : res_json) {
auto reply = build_reply_from_json(res, result.get());
bool is_role_init = res.contains("choices") && !res["choices"].empty() &&
@@ -2708,6 +2748,17 @@ public:
body_json["chat_template_kwargs"]["enable_thinking"] = (predict_et_it->second == "true");
}
// Pass reasoning_effort via chat_template_kwargs too: the lever
// jinja templates like gpt-oss (Harmony) / LFM2.5 read, distinct
// from enable_thinking which those templates ignore.
auto predict_re_it = predict_metadata.find("reasoning_effort");
if (predict_re_it != predict_metadata.end() && !predict_re_it->second.empty()) {
if (!body_json.contains("chat_template_kwargs")) {
body_json["chat_template_kwargs"] = json::object();
}
body_json["chat_template_kwargs"]["reasoning_effort"] = predict_re_it->second;
}
// Debug: Print full body_json before template processing (includes messages, tools, tool_choice, etc.)
SRV_DBG("[CONVERSATION DEBUG] Predict: Full body_json before oaicompat_chat_params_parse:\n%s\n", body_json.dump(2).c_str());

View File

@@ -124,8 +124,11 @@ fi
# 5. Define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top of the file so the
# grpc-server option parser skips the new option-handler blocks (ngram_mod,
# ngram_map_k, ngram_map_k4v, ngram_cache, draft.cache_type_*, draft.cpuparams*,
# draft.tensor_buft_overrides) introduced for the post-#22838 layout. Those
# blocks reference struct fields that simply do not exist in the fork.
# draft.tensor_buft_overrides) introduced for the post-#22838 layout, the
# draft.tensor_buft_overrides sentinel termination, and the
# common_params::checkpoint_min_step default/option (added with the
# 35c9b1f3 bump). Those blocks reference struct fields that simply do not
# exist in the fork.
if grep -q '^#define LOCALAI_LEGACY_LLAMA_CPP_SPEC' "$SRC"; then
echo "==> $SRC already defines LOCALAI_LEGACY_LLAMA_CPP_SPEC, skipping"
else

View File

@@ -192,6 +192,61 @@ var _ = Describe("Forward", func() {
Expect(<-gotAuth).To(Equal("Bearer sk-real"), "caller-supplied Basic header must be replaced")
})
It("refuses to follow upstream redirects and never leaks the key to the redirect target", func() {
// A 3xx from the configured upstream means misconfiguration or a
// hijacked/spoofed host. Following it would replay the request —
// and the injected API key — to the Location host. Anthropic's
// x-api-key is NOT stripped by Go on cross-host redirects, so this
// would be a credential leak. The proxy must refuse the redirect.
sinkHit := make(chan string, 1)
sink := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
sinkHit <- r.Header.Get("x-api-key")
w.WriteHeader(http.StatusOK)
}))
defer sink.Close()
redirector := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
http.Redirect(w, r, sink.URL, http.StatusFound)
}))
defer redirector.Close()
GinkgoT().Setenv("CLOUD_PROXY_REDIRECT_KEY", "ant-secret")
cp := NewCloudProxy()
Expect(cp.Load(&pb.ModelOptions{
Proxy: &pb.ProxyOptions{
UpstreamUrl: redirector.URL,
Mode: modePassthrough,
Provider: providerAnthropic,
ApiKeyEnv: "CLOUD_PROXY_REDIRECT_KEY",
},
})).To(Succeed())
addr := "test://forward-no-redirect"
grpc.Provide(addr, cp)
c := grpc.NewClient(addr, true, nil, false)
stream, err := c.Forward(context.Background())
Expect(err).NotTo(HaveOccurred())
Expect(stream.Send(&pb.ForwardRequest{
Path: "/v1/messages",
Method: "POST",
})).To(Succeed())
Expect(stream.CloseSend()).To(Succeed())
// Drain the stream; a refused redirect surfaces as a non-EOF error.
var streamErr error
for {
if _, err := stream.Recv(); err != nil {
if !errors.Is(err, io.EOF) {
streamErr = err
}
break
}
}
Expect(streamErr).To(HaveOccurred(), "refused redirect must surface as an error")
Expect(sinkHit).NotTo(Receive(), "the redirect target must never be contacted")
})
It("handles concurrent calls without interference", func() {
// CloudProxy explicitly omits base.SingleThread — independent
// Forward streams must not block each other or leak state.

View File

@@ -11,9 +11,12 @@ import (
"strings"
"sync/atomic"
"github.com/mudler/LocalAI/pkg/grpc/base"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/xlog"
"github.com/mudler/LocalAI/pkg/grpc/base"
"github.com/mudler/LocalAI/pkg/grpc/grpcerrors"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/LocalAI/pkg/httpclient"
)
// Mirror of core/config.Proxy{Mode,Provider}* — backends don't
@@ -48,10 +51,15 @@ type proxyConfig struct {
}
func NewCloudProxy() *CloudProxy {
// No Client-level Timeout — that would bound streaming SSE
// responses too, which can legitimately last minutes. Per-request
// deadlines come from the gRPC stream context.
return &CloudProxy{client: &http.Client{}}
// httpclient.New refuses redirects outright: the proxy talks to a
// single configured upstream API (OpenAI/Anthropic/...) that answers
// directly, so a 3xx means misconfiguration, a hijacked upstream, or
// DNS trickery — never normal operation. Following it would replay the
// request, including the operator's x-api-key (which Go does NOT strip
// on cross-host redirects), to an unvetted host and leak the key
// (GHSA-3mj3-57v2-4636). It also imposes no body deadline, so streaming
// SSE responses that legitimately last minutes are not truncated.
return &CloudProxy{client: httpclient.New()}
}
func (c *CloudProxy) Load(opts *pb.ModelOptions) error {
@@ -138,7 +146,7 @@ func resolveAPIKey(envName, filePath string) (string, error) {
func (c *CloudProxy) PredictRich(opts *pb.PredictOptions) (reply *pb.Reply, err error) {
cfg := c.cfg.Load()
if cfg == nil {
return nil, errors.New("cloud-proxy: model not loaded")
return nil, grpcerrors.ModelNotLoaded("cloud-proxy")
}
if cfg.mode != modeTranslate {
return nil, fmt.Errorf("cloud-proxy: Predict only valid in translate mode (have %s)", cfg.mode)
@@ -168,7 +176,7 @@ func (c *CloudProxy) PredictRich(opts *pb.PredictOptions) (reply *pb.Reply, err
func (c *CloudProxy) PredictStreamRich(opts *pb.PredictOptions, results chan<- *pb.Reply) (err error) {
cfg := c.cfg.Load()
if cfg == nil {
return errors.New("cloud-proxy: model not loaded")
return grpcerrors.ModelNotLoaded("cloud-proxy")
}
if cfg.mode != modeTranslate {
return fmt.Errorf("cloud-proxy: PredictStream only valid in translate mode (have %s)", cfg.mode)
@@ -262,7 +270,7 @@ func (c *CloudProxy) Forward(ctx context.Context, in <-chan *pb.ForwardRequest,
cfg := c.cfg.Load()
if cfg == nil {
return errors.New("cloud-proxy: model not loaded")
return grpcerrors.ModelNotLoaded("cloud-proxy")
}
if cfg.mode != modePassthrough {
return fmt.Errorf("cloud-proxy: Forward only valid in passthrough mode (have %s)", cfg.mode)
@@ -426,4 +434,3 @@ func isHopByHopHeader(name string) bool {
}
return false
}

5
backend/go/crispasr/.gitignore vendored Normal file
View File

@@ -0,0 +1,5 @@
sources
build*
libgocrispasr*.so
crispasr
package

View File

@@ -0,0 +1,30 @@
cmake_minimum_required(VERSION 3.12)
project(gocrispasr LANGUAGES C CXX)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
add_subdirectory(./sources/CrispASR)
add_library(gocrispasr MODULE cpp/crispasr_shim.cpp)
target_include_directories(gocrispasr PRIVATE
${CMAKE_CURRENT_SOURCE_DIR}/sources/CrispASR/include
${CMAKE_CURRENT_SOURCE_DIR}/sources/CrispASR/ggml/include)
# Link the same backend set as crispasr-cli (examples/cli/CMakeLists.txt) so
# the session API can dispatch to every compiled-in architecture, not just
# whisper. crispasr is the referencer; the backend static libs supply the
# per-architecture symbols; ggml is the math/runtime base.
target_link_libraries(gocrispasr PRIVATE
crispasr
parakeet canary canary_ctc cohere granite_speech granite_nle
voxtral voxtral4b qwen3_asr qwen3_tts orpheus chatterbox indextts
kokoro voxcpm2_tts m2m100 t5_translate wav2vec2-ggml vibevoice
silero-lid pyannote-seg funasr paraformer sensevoice
crisp_audio
ggml)
if(CMAKE_CXX_COMPILER_ID MATCHES "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.0)
target_link_libraries(gocrispasr PRIVATE stdc++fs)
endif()
set_property(TARGET gocrispasr PROPERTY CXX_STANDARD 17)
set_target_properties(gocrispasr PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})

View File

@@ -0,0 +1,132 @@
CMAKE_ARGS?=
BUILD_TYPE?=
NATIVE?=false
GOCMD?=go
GO_TAGS?=
JOBS?=$(shell nproc --ignore=1)
# CrispASR version (release tag)
CRISPASR_REPO?=https://github.com/CrispStrobe/CrispASR
CRISPASR_VERSION?=13d54e110e1538e0f0bc3af0680b9ab246cfb48d
SO_TARGET?=libgocrispasr.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
# Keep the build lean: no tests/examples/server/SDL2/curl/ffmpeg (the FROM scratch
# image cannot satisfy those runtime deps). All ASR/TTS model backends stay enabled.
CMAKE_ARGS+=-DCRISPASR_BUILD_TESTS=OFF -DCRISPASR_BUILD_EXAMPLES=OFF -DCRISPASR_BUILD_SERVER=OFF
CMAKE_ARGS+=-DCRISPASR_SDL2=OFF -DCRISPASR_CURL=OFF -DCRISPASR_FFMPEG=OFF
ifeq ($(NATIVE),false)
CMAKE_ARGS+=-DGGML_NATIVE=OFF
endif
ifeq ($(BUILD_TYPE),cublas)
CMAKE_ARGS+=-DGGML_CUDA=ON
else ifeq ($(BUILD_TYPE),openblas)
CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
else ifeq ($(BUILD_TYPE),clblas)
CMAKE_ARGS+=-DGGML_CLBLAST=ON -DCLBlast_DIR=/some/path
else ifeq ($(BUILD_TYPE),hipblas)
CMAKE_ARGS+=-DGGML_HIPBLAS=ON
else ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DGGML_VULKAN=ON
else ifeq ($(OS),Darwin)
ifneq ($(BUILD_TYPE),metal)
CMAKE_ARGS+=-DGGML_METAL=OFF
else
CMAKE_ARGS+=-DGGML_METAL=ON
CMAKE_ARGS+=-DGGML_METAL_EMBED_LIBRARY=ON
endif
endif
ifeq ($(BUILD_TYPE),sycl_f16)
CMAKE_ARGS+=-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx \
-DGGML_SYCL_F16=ON
endif
ifeq ($(BUILD_TYPE),sycl_f32)
CMAKE_ARGS+=-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx
endif
sources/CrispASR:
mkdir -p sources/CrispASR
cd sources/CrispASR && \
git init && \
git remote add origin $(CRISPASR_REPO) && \
git fetch origin && \
git checkout $(CRISPASR_VERSION) && \
git submodule update --init --recursive --depth 1 --single-branch
# CrispASR's src/CMakeLists.txt locates its vendored llama.cpp
# (crispasr-llama-core, used by the chat C-ABI) via ${CMAKE_SOURCE_DIR},
# which assumes CrispASR is the top-level CMake project. We add_subdirectory
# it, so ${CMAKE_SOURCE_DIR} is THIS backend dir and the talk-llama sources
# aren't found. Rewrite to ${PROJECT_SOURCE_DIR} (the crispasr project root),
# which is correct both standalone and as a subproject. Idempotent.
sed -i 's#\$${CMAKE_SOURCE_DIR}/examples/talk-llama#\$${PROJECT_SOURCE_DIR}/examples/talk-llama#' sources/CrispASR/src/CMakeLists.txt
# Detect OS
UNAME_S := $(shell uname -s)
ifeq ($(UNAME_S),Linux)
VARIANT_TARGETS = libgocrispasr-avx.so libgocrispasr-avx2.so libgocrispasr-avx512.so libgocrispasr-fallback.so
else
VARIANT_TARGETS = libgocrispasr-fallback.so
endif
crispasr: main.go gocrispasr.go $(VARIANT_TARGETS)
CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o crispasr ./
package: crispasr
bash package.sh
build: package
clean: purge
rm -rf libgocrispasr*.so package sources/CrispASR crispasr
purge:
rm -rf build*
ifeq ($(UNAME_S),Linux)
libgocrispasr-avx.so: sources/CrispASR
$(MAKE) purge
$(info ${GREEN}I crispasr build info:avx${RESET})
SO_TARGET=libgocrispasr-avx.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgocrispasr-custom
rm -rfv build*
libgocrispasr-avx2.so: sources/CrispASR
$(MAKE) purge
$(info ${GREEN}I crispasr build info:avx2${RESET})
SO_TARGET=libgocrispasr-avx2.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgocrispasr-custom
rm -rfv build*
libgocrispasr-avx512.so: sources/CrispASR
$(MAKE) purge
$(info ${GREEN}I crispasr build info:avx512${RESET})
SO_TARGET=libgocrispasr-avx512.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgocrispasr-custom
rm -rfv build*
endif
libgocrispasr-fallback.so: sources/CrispASR
$(MAKE) purge
$(info ${GREEN}I crispasr build info:fallback${RESET})
SO_TARGET=libgocrispasr-fallback.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgocrispasr-custom
rm -rfv build*
libgocrispasr-custom: CMakeLists.txt cpp/crispasr_shim.cpp cpp/crispasr_shim.h
mkdir -p build-$(SO_TARGET) && \
cd build-$(SO_TARGET) && \
cmake .. $(CMAKE_ARGS) && \
cmake --build . --config Release -j$(JOBS) && \
cd .. && \
mv build-$(SO_TARGET)/libgocrispasr.so ./$(SO_TARGET)
test: crispasr
CGO_ENABLED=0 $(GOCMD) test -v ./...
all: crispasr package

View File

@@ -0,0 +1,253 @@
#include "crispasr_shim.h"
#include "ggml-backend.h"
#include "crispasr.h"
#include <atomic>
#include <vector>
// Opaque session types. crispasr.h declares `struct crispasr_session;` but not
// the result type nor the open/transcribe/result accessors — those are
// CA_EXPORT extern "C" symbols in src/crispasr_c_api.cpp, so we forward-declare
// exactly the ones we use. Signatures verified against
// sources/CrispASR/src/crispasr_c_api.cpp.
struct crispasr_session_result;
extern "C" {
crispasr_session *crispasr_session_open(const char *model_path, int n_threads);
crispasr_session *crispasr_session_open_explicit(const char *model_path,
const char *backend_name,
int n_threads);
int crispasr_session_set_codec_path(crispasr_session *s, const char *path);
void crispasr_session_close(crispasr_session *s);
const char *crispasr_session_backend(crispasr_session *s);
int crispasr_session_set_translate(crispasr_session *s, int enable);
crispasr_session_result *crispasr_session_transcribe_lang(
crispasr_session *s, const float *pcm, int n_samples, const char *language);
int crispasr_session_result_n_segments(crispasr_session_result *r);
const char *crispasr_session_result_segment_text(crispasr_session_result *r,
int i);
int64_t crispasr_session_result_segment_t0(crispasr_session_result *r, int i);
int64_t crispasr_session_result_segment_t1(crispasr_session_result *r, int i);
void crispasr_session_result_free(crispasr_session_result *r);
float *crispasr_session_synthesize(crispasr_session *s, const char *text,
int *out_n_samples);
void crispasr_pcm_free(float *pcm);
int crispasr_session_set_speaker_name(crispasr_session *s, const char *name);
int crispasr_session_set_voice(crispasr_session *s, const char *path,
const char *ref_text_or_null);
}
static crispasr_session *g_session = nullptr;
static crispasr_session_result *g_result = nullptr;
static struct whisper_vad_context *vctx;
static std::vector<float> flat_segs;
static std::atomic<int> g_abort{0};
extern "C" void set_abort(int v) {
g_abort.store(v, std::memory_order_relaxed);
}
static void ggml_log_cb(enum ggml_log_level level, const char *log,
void *data) {
const char *level_str;
if (!log) {
return;
}
switch (level) {
case GGML_LOG_LEVEL_DEBUG:
level_str = "DEBUG";
break;
case GGML_LOG_LEVEL_INFO:
level_str = "INFO";
break;
case GGML_LOG_LEVEL_WARN:
level_str = "WARN";
break;
case GGML_LOG_LEVEL_ERROR:
level_str = "ERROR";
break;
default: /* Potential future-proofing */
level_str = "?????";
break;
}
fprintf(stderr, "[%-5s] ", level_str);
fputs(log, stderr);
fflush(stderr);
}
int load_model(const char *const model_path, int threads,
const char *backend_name) {
whisper_log_set(ggml_log_cb, nullptr);
ggml_backend_load_all();
if (backend_name && *backend_name) {
g_session =
crispasr_session_open_explicit(model_path, backend_name, threads);
} else {
g_session = crispasr_session_open(model_path, threads);
}
if (g_session == nullptr) {
fprintf(stderr, "error: failed to open CrispASR session for model\n");
return 1;
}
fprintf(stderr, "info: CrispASR backend selected: %s\n",
crispasr_session_backend(g_session));
return 0;
}
// set_codec_path forwards a companion file (qwen3-tts codec, orpheus SNAC,
// chatterbox s3gen, or mimo-asr tokenizer) to the active session. Returns 0 on
// success or when the active backend needs no companion, negative on failure,
// and -1 when no session is open.
int set_codec_path(const char *path) {
return g_session ? crispasr_session_set_codec_path(g_session, path) : -1;
}
int load_model_vad(const char *const model_path) {
whisper_log_set(ggml_log_cb, nullptr);
ggml_backend_load_all();
struct whisper_vad_context_params vcparams =
whisper_vad_default_context_params();
// XXX: Overridden to false in upstream due to performance?
// vcparams.use_gpu = true;
vctx = whisper_vad_init_from_file_with_params(model_path, vcparams);
if (vctx == nullptr) {
fprintf(stderr, "error: Failed to init model as VAD\n");
return 1;
}
return 0;
}
int vad(float pcmf32[], size_t pcmf32_len, float **segs_out,
size_t *segs_out_len) {
if (!whisper_vad_detect_speech(vctx, pcmf32, pcmf32_len)) {
fprintf(stderr, "error: failed to detect speech\n");
return 1;
}
struct whisper_vad_params params = whisper_vad_default_params();
struct whisper_vad_segments *segs =
whisper_vad_segments_from_probs(vctx, params);
size_t segn = whisper_vad_segments_n_segments(segs);
// fprintf(stderr, "Got segments %zd\n", segn);
flat_segs.clear();
for (int i = 0; i < segn; i++) {
flat_segs.push_back(whisper_vad_segments_get_segment_t0(segs, i));
flat_segs.push_back(whisper_vad_segments_get_segment_t1(segs, i));
}
// fprintf(stderr, "setting out variables: %p=%p -> %p, %p=%zx -> %zx\n",
// segs_out, *segs_out, flat_segs.data(), segs_out_len, *segs_out_len,
// flat_segs.size());
*segs_out = flat_segs.data();
*segs_out_len = flat_segs.size();
// fprintf(stderr, "freeing segs\n");
whisper_vad_free_segments(segs);
// fprintf(stderr, "returning\n");
return 0;
}
// threads, diarize and prompt are accepted for Go-side API parity but unused
// in Phase 1: the thread count is fixed at session open, and diarization and
// the initial prompt are separate CrispASR features not yet wired through the
// session ASR path.
int transcribe(uint32_t threads, char *lang, bool translate, bool diarize,
float pcmf32[], size_t pcmf32_len, size_t *segs_out_len,
char *prompt) {
(void)threads;
(void)diarize;
(void)prompt;
if (!g_session) {
return 1;
}
// Reset stale abort flag from any prior cancelled call. set_abort remains
// best-effort: the session transcribe call is blocking and exposes no abort
// hook, so a mid-decode abort cannot interrupt it.
g_abort.store(0, std::memory_order_relaxed);
crispasr_session_set_translate(g_session, translate ? 1 : 0);
if (g_result) {
crispasr_session_result_free(g_result);
g_result = nullptr;
}
const char *language = (lang && *lang) ? lang : nullptr;
g_result = crispasr_session_transcribe_lang(g_session, pcmf32, (int)pcmf32_len,
language);
if (!g_result) {
fprintf(stderr, "error: transcription failed\n");
return 1;
}
*segs_out_len = crispasr_session_result_n_segments(g_result);
return 0;
}
const char *get_segment_text(int i) {
if (!g_result) {
return "";
}
return crispasr_session_result_segment_text(g_result, i);
}
int64_t get_segment_t0(int i) {
if (!g_result) {
return 0;
}
return crispasr_session_result_segment_t0(g_result, i);
}
int64_t get_segment_t1(int i) {
if (!g_result) {
return 0;
}
return crispasr_session_result_segment_t1(g_result, i);
}
const char *get_backend(void) {
return g_session ? crispasr_session_backend(g_session) : "";
}
// TTS uses the already-open session (crispasr_session_open auto-detects a TTS
// model). Output is 24 kHz mono float PCM (upstream CrispASR convention),
// malloc'd by the C API; the caller must release it via tts_free.
float *tts_synthesize(const char *text, int *out_n_samples) {
if (out_n_samples) *out_n_samples = 0;
if (!g_session || !text) return nullptr;
return crispasr_session_synthesize(g_session, text, out_n_samples);
}
void tts_free(float *pcm) {
if (pcm) crispasr_pcm_free(pcm);
}
int tts_set_voice(const char *name) {
if (!g_session || !name || !*name) return 0;
return crispasr_session_set_speaker_name(g_session, name);
}
// tts_set_voice_file loads a voice from a file: a .gguf path selects a voice
// pack, a .wav path with a non-empty ref_text performs zero-shot voice cloning
// (the C API returns -2 when ref_text is required but missing). Returns -1 when
// no session is open or path is null.
int tts_set_voice_file(const char *path, const char *ref_text) {
if (!g_session || !path) return -1;
const char *ref = (ref_text && *ref_text) ? ref_text : nullptr;
return crispasr_session_set_voice(g_session, path, ref);
}

View File

@@ -0,0 +1,23 @@
#include <cstddef>
#include <cstdint>
extern "C" {
int load_model(const char *const model_path, int threads,
const char *backend_name);
int set_codec_path(const char *path);
int load_model_vad(const char *const model_path);
int vad(float pcmf32[], size_t pcmf32_size, float **segs_out,
size_t *segs_out_len);
int transcribe(uint32_t threads, char *lang, bool translate, bool diarize,
float pcmf32[], size_t pcmf32_len, size_t *segs_out_len,
char *prompt);
const char *get_segment_text(int i);
int64_t get_segment_t0(int i);
int64_t get_segment_t1(int i);
const char *get_backend(void);
void set_abort(int v);
float *tts_synthesize(const char *text, int *out_n_samples); // 24kHz mono float, malloc'd; NULL on failure
void tts_free(float *pcm);
int tts_set_voice(const char *name); // best-effort speaker selection; 0 ok
int tts_set_voice_file(const char *path, const char *ref_text); // load voice pack (.gguf) or zero-shot clone (.wav + ref_text)
}

View File

@@ -0,0 +1,497 @@
package main
import (
"context"
"fmt"
"os"
"path/filepath"
"strings"
"sync"
"unsafe"
"github.com/go-audio/audio"
"github.com/go-audio/wav"
"github.com/mudler/LocalAI/pkg/grpc/base"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/LocalAI/pkg/utils"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
)
var (
CppLoadModel func(modelPath string, threads int, backendName string) int
CppSetCodecPath func(path string) int
CppLoadModelVAD func(modelPath string) int
CppVAD func(pcmf32 []float32, pcmf32Size uintptr, segsOut unsafe.Pointer, segsOutLen unsafe.Pointer) int
CppTranscribe func(threads uint32, lang string, translate bool, diarize bool, pcmf32 []float32, pcmf32Len uintptr, segsOutLen unsafe.Pointer, prompt string) int
CppGetSegmentText func(i int) string
CppGetSegmentStart func(i int) int64
CppGetSegmentEnd func(i int) int64
CppGetBackend func() string
CppSetAbort func(v int)
CppTTSSynthesize func(text string, outNSamples unsafe.Pointer) uintptr
CppTTSFree func(ptr uintptr)
CppTTSSetVoice func(name string) int
CppTTSSetVoiceFile func(path string, refText string) int
)
type CrispASR struct {
base.SingleThread
}
// splitOption splits a "prefix:value" model option into its key and value,
// matching the convention used by other backends (see sherpa-onnx). It returns
// ok=false when the option carries no ':' separator.
func splitOption(oo string) (key, value string, ok bool) {
parts := strings.SplitN(oo, ":", 2)
if len(parts) != 2 {
return "", "", false
}
return parts[0], parts[1], true
}
func (w *CrispASR) Load(opts *pb.ModelOptions) error {
vadOnly := false
backendName := ""
codecPath := ""
speakerName := ""
voicePath := ""
voiceRefText := ""
for _, oo := range opts.Options {
if oo == "vad_only" {
vadOnly = true
continue
}
switch key, value, ok := splitOption(oo); {
case ok && key == "backend":
backendName = value
case ok && key == "codec":
codecPath = value
case ok && key == "speaker":
speakerName = value
case ok && key == "voice":
voicePath = value
case ok && key == "voice_text":
voiceRefText = value
default:
fmt.Fprintf(os.Stderr, "Unrecognized option: %v\n", oo)
}
}
if vadOnly {
if ret := CppLoadModelVAD(opts.ModelFile); ret != 0 {
return fmt.Errorf("Failed to load CrispASR VAD model")
}
return nil
}
// Resolve a relative companion path against the model directory so a config
// can reference a sibling codec/tokenizer file by name alone.
if codecPath != "" && !filepath.IsAbs(codecPath) {
codecPath = filepath.Join(filepath.Dir(opts.ModelFile), codecPath)
}
// A voice file (.gguf pack or .wav prompt) is resolved against the model
// directory just like the codec, so a config can reference a sibling file.
if voicePath != "" && !filepath.IsAbs(voicePath) {
voicePath = filepath.Join(filepath.Dir(opts.ModelFile), voicePath)
}
if ret := CppLoadModel(opts.ModelFile, int(opts.Threads), backendName); ret != 0 {
return fmt.Errorf("Failed to load CrispASR transcription model")
}
// Load the companion file (codec/tokenizer/s3gen) after the session is open.
// rc==0 means success or "not applicable" for the active backend; only a
// negative code is fatal.
if codecPath != "" {
if rc := CppSetCodecPath(codecPath); rc < 0 {
return fmt.Errorf("crispasr: failed to load companion file %q (rc=%d)", codecPath, rc)
}
fmt.Fprintf(os.Stderr, "CrispASR companion file loaded: %s\n", codecPath)
}
// Apply the Load-time default voice. A baked speaker (speaker:) is selected
// by name and is best-effort: a backend that can't honor it is logged, not
// fatal. A voice file (voice:) is a hard requirement once configured, so a
// negative rc fails Load.
if speakerName != "" {
if rc := CppTTSSetVoice(speakerName); rc != 0 {
fmt.Fprintf(os.Stderr, "crispasr: speaker %q not applied (rc=%d)\n", speakerName, rc)
}
}
if voicePath != "" {
if rc := CppTTSSetVoiceFile(voicePath, voiceRefText); rc < 0 {
return fmt.Errorf("crispasr: failed to load voice %q (rc=%d)", voicePath, rc)
}
fmt.Fprintf(os.Stderr, "CrispASR voice loaded: %s\n", voicePath)
}
fmt.Fprintf(os.Stderr, "CrispASR backend selected: %s\n", CppGetBackend())
return nil
}
func (w *CrispASR) VAD(req *pb.VADRequest) (pb.VADResponse, error) {
audio := req.Audio
// We expect 0xdeadbeef to be overwritten and if we see it in a stack trace we know it wasn't
segsPtr, segsLen := uintptr(0xdeadbeef), uintptr(0xdeadbeef)
segsPtrPtr, segsLenPtr := unsafe.Pointer(&segsPtr), unsafe.Pointer(&segsLen)
if ret := CppVAD(audio, uintptr(len(audio)), segsPtrPtr, segsLenPtr); ret != 0 {
return pb.VADResponse{}, fmt.Errorf("Failed VAD")
}
// Happens when CPP vector has not had any elements pushed to it
if segsPtr == 0 {
return pb.VADResponse{
Segments: []*pb.VADSegment{},
}, nil
}
// unsafeptr warning is caused by segsPtr being on the stack and therefor being subject to stack copying AFAICT
// however the stack shouldn't have grown between setting segsPtr and now, also the memory pointed to is allocated by C++
segs := unsafe.Slice((*float32)(unsafe.Pointer(segsPtr)), segsLen) //nolint:govet // segsPtr addresses C++-owned heap memory passed back through the cgo-free purego boundary; the uintptr->Pointer round-trip is intentional and the buffer outlives this read.
vadSegments := []*pb.VADSegment{}
for i := range len(segs) >> 1 {
s := segs[2*i] / 100
t := segs[2*i+1] / 100
vadSegments = append(vadSegments, &pb.VADSegment{
Start: s,
End: t,
})
}
return pb.VADResponse{
Segments: vadSegments,
}, nil
}
func (w *CrispASR) AudioTranscription(ctx context.Context, opts *pb.TranscriptRequest) (pb.TranscriptResult, error) {
if err := ctx.Err(); err != nil {
return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled")
}
dir, err := os.MkdirTemp("", "crispasr")
if err != nil {
return pb.TranscriptResult{}, err
}
defer func() { _ = os.RemoveAll(dir) }()
convertedPath := filepath.Join(dir, "converted.wav")
if err := utils.AudioToWav(opts.Dst, convertedPath); err != nil {
return pb.TranscriptResult{}, err
}
fh, err := os.Open(convertedPath)
if err != nil {
return pb.TranscriptResult{}, err
}
defer func() { _ = fh.Close() }()
d := wav.NewDecoder(fh)
buf, err := d.FullPCMBuffer()
if err != nil {
return pb.TranscriptResult{}, err
}
data := buf.AsFloat32Buffer().Data
var duration float32
if buf.Format != nil && buf.Format.SampleRate > 0 {
duration = float32(len(data)) / float32(buf.Format.SampleRate)
}
segsLen := uintptr(0xdeadbeef)
segsLenPtr := unsafe.Pointer(&segsLen)
// Watcher: flips the C-side abort flag when ctx is cancelled. The
// goroutine is joined synchronously (close(done) signals it to exit,
// wg.Wait() blocks until it has) so a late CppSetAbort(1) cannot fire
// after the function returns and corrupt the next transcription call.
done := make(chan struct{})
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
select {
case <-ctx.Done():
CppSetAbort(1)
case <-done:
}
}()
defer func() {
close(done)
wg.Wait()
}()
ret := CppTranscribe(opts.Threads, opts.Language, opts.Translate, opts.Diarize, data, uintptr(len(data)), segsLenPtr, opts.Prompt)
if ret == 2 {
return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled")
}
if ret != 0 {
return pb.TranscriptResult{}, fmt.Errorf("Failed Transcribe")
}
segments := []*pb.TranscriptSegment{}
text := ""
for i := range int(segsLen) {
// segment start/end conversion factor taken from https://github.com/ggml-org/whisper.cpp/blob/master/examples/cli/cli.cpp#L895
s := CppGetSegmentStart(i) * (10000000)
t := CppGetSegmentEnd(i) * (10000000)
// The session result can emit bytes that aren't valid UTF-8 (e.g. a
// multibyte codepoint split across token boundaries); protobuf string
// fields reject those at marshal time. Scrub before the value escapes
// cgo. The session result is segment+word based and exposes no token
// IDs, so Tokens is left empty.
txt := strings.ToValidUTF8(strings.Clone(CppGetSegmentText(i)), "<22>")
segment := &pb.TranscriptSegment{
Id: int32(i),
Text: txt,
Start: s, End: t,
}
segments = append(segments, segment)
text += " " + strings.TrimSpace(txt)
}
return pb.TranscriptResult{
Segments: segments,
Text: strings.TrimSpace(text),
Language: opts.Language,
Duration: duration,
}, nil
}
// AudioTranscriptionStream runs the session transcribe to completion and then
// emits one delta per non-empty segment, followed by a final TranscriptResult.
// Progressive/real-time streaming isn't available via the session API (there
// is no per-decode callback), so deltas are emitted per-segment after the
// blocking decode returns rather than as segments are produced. The offline
// AudioTranscription is unchanged; both paths share the session and the
// SingleThread concurrency model.
func (w *CrispASR) AudioTranscriptionStream(ctx context.Context, opts *pb.TranscriptRequest, results chan *pb.TranscriptStreamResponse) error {
defer close(results)
if err := ctx.Err(); err != nil {
return status.Error(codes.Canceled, "transcription cancelled")
}
dir, err := os.MkdirTemp("", "crispasr")
if err != nil {
return err
}
defer func() { _ = os.RemoveAll(dir) }()
convertedPath := filepath.Join(dir, "converted.wav")
if err := utils.AudioToWav(opts.Dst, convertedPath); err != nil {
return err
}
fh, err := os.Open(convertedPath)
if err != nil {
return err
}
defer func() { _ = fh.Close() }()
d := wav.NewDecoder(fh)
buf, err := d.FullPCMBuffer()
if err != nil {
return err
}
data := buf.AsFloat32Buffer().Data
var duration float32
if buf.Format != nil && buf.Format.SampleRate > 0 {
duration = float32(len(data)) / float32(buf.Format.SampleRate)
}
// Same abort-watcher pattern as AudioTranscription. Joined synchronously
// so a late CppSetAbort(1) cannot fire after this function returns.
// Best-effort only: the session transcribe is blocking with no abort hook.
done := make(chan struct{})
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
select {
case <-ctx.Done():
CppSetAbort(1)
case <-done:
}
}()
defer func() {
close(done)
wg.Wait()
}()
segsLen := uintptr(0xdeadbeef)
segsLenPtr := unsafe.Pointer(&segsLen)
ret := CppTranscribe(opts.Threads, opts.Language, opts.Translate, opts.Diarize, data, uintptr(len(data)), segsLenPtr, opts.Prompt)
if ret == 2 {
return status.Error(codes.Canceled, "transcription cancelled")
}
if ret != 0 {
return fmt.Errorf("Failed Transcribe")
}
// Walk the segments once: emit a delta per non-empty segment and build the
// final TranscriptResult.Segments alongside. The first delta has no leading
// space and subsequent ones are prefixed with a single space, so
// concat(deltas) == final.Text exactly, matching the e2e contract.
segments := []*pb.TranscriptSegment{}
var assembled strings.Builder
for i := range int(segsLen) {
s := CppGetSegmentStart(i) * 10000000
t := CppGetSegmentEnd(i) * 10000000
txt := strings.ToValidUTF8(strings.Clone(CppGetSegmentText(i)), "<22>")
segments = append(segments, &pb.TranscriptSegment{
Id: int32(i),
Text: txt,
Start: s, End: t,
})
trimmed := strings.TrimSpace(txt)
if trimmed == "" {
continue
}
var delta string
if assembled.Len() == 0 {
delta = trimmed
} else {
delta = " " + trimmed
}
results <- &pb.TranscriptStreamResponse{Delta: delta}
assembled.WriteString(delta)
}
final := &pb.TranscriptResult{
Segments: segments,
Text: assembled.String(),
Language: opts.Language,
Duration: duration,
}
results <- &pb.TranscriptStreamResponse{FinalResult: final}
return nil
}
// synthesize returns 24 kHz mono float32 PCM for text via the open session.
func (w *CrispASR) synthesize(text string) ([]float32, error) {
if text == "" {
return nil, fmt.Errorf("crispasr: TTS requires non-empty text")
}
var n int32
ptr := CppTTSSynthesize(text, unsafe.Pointer(&n))
if ptr == 0 || n <= 0 {
return nil, fmt.Errorf("crispasr: synthesis failed (the loaded model may not be a supported TTS backend, or needs extra config e.g. orpheus SNAC codec)")
}
defer CppTTSFree(ptr)
src := unsafe.Slice((*float32)(unsafe.Pointer(ptr)), int(n)) //nolint:govet // ptr addresses C-allocated PCM returned across the purego boundary; copied out immediately below, before tts_free.
out := make([]float32, int(n)) // copy out of C memory before free
copy(out, src)
return out, nil
}
// setVoice applies a per-call speaker/voice override (best effort). CrispASR
// returns a negative code when the active backend can't honor the name; we log
// it rather than fail, so an unknown voice falls back to the default speaker.
func setVoice(voice string) {
v := strings.TrimSpace(voice)
if v == "" {
return
}
if rc := CppTTSSetVoice(v); rc != 0 {
fmt.Fprintf(os.Stderr, "crispasr: voice %q not applied by the active TTS backend (rc=%d); using default\n", v, rc)
}
}
func (w *CrispASR) TTS(req *pb.TTSRequest) error {
if req.Dst == "" {
return fmt.Errorf("crispasr: TTS requires a destination path")
}
setVoice(req.Voice)
pcm, err := w.synthesize(req.Text)
if err != nil {
return err
}
return writeWAV24k(req.Dst, pcm)
}
// TTSStream is the streaming counterpart to TTS. CrispASR has no progressive
// (native streaming) synth, so we synthesize the whole utterance, encode it to
// a 24 kHz WAV, and emit the encoded bytes as a single chunk. The gRPC server
// wrapper (pkg/grpc/server.go:TTSStream) ranges over the channel until it is
// closed, so this method owns the close - mirrors vibevoice-cpp's TTSStream.
func (w *CrispASR) TTSStream(req *pb.TTSRequest, results chan []byte) error {
defer close(results)
if req.Text == "" {
return fmt.Errorf("crispasr: TTSStream requires text")
}
setVoice(req.Voice)
pcm, err := w.synthesize(req.Text)
if err != nil {
return err
}
tmp, err := os.CreateTemp("", "crispasr-tts-stream-*.wav")
if err != nil {
return fmt.Errorf("crispasr: tempfile: %w", err)
}
dst := tmp.Name()
if err := tmp.Close(); err != nil {
return fmt.Errorf("crispasr: close tempfile: %w", err)
}
defer func() { _ = os.Remove(dst) }()
if err := writeWAV24k(dst, pcm); err != nil {
return err
}
encoded, err := os.ReadFile(dst)
if err != nil {
return fmt.Errorf("crispasr: read tempfile: %w", err)
}
results <- encoded
return nil
}
// writeWAV24k writes pcm as a 24000 Hz, mono, 16-bit PCM WAV at dst.
func writeWAV24k(dst string, pcm []float32) error {
f, err := os.Create(dst)
if err != nil {
return fmt.Errorf("crispasr: create %q: %w", dst, err)
}
enc := wav.NewEncoder(f, 24000, 16, 1, 1)
ints := make([]int, len(pcm))
for i, s := range pcm {
if s > 1 {
s = 1
} else if s < -1 {
s = -1
}
ints[i] = int(s * 32767)
}
buf := &audio.IntBuffer{
Format: &audio.Format{NumChannels: 1, SampleRate: 24000},
Data: ints,
SourceBitDepth: 16,
}
if err := enc.Write(buf); err != nil {
_ = enc.Close()
_ = f.Close()
return fmt.Errorf("crispasr: encode WAV: %w", err)
}
if err := enc.Close(); err != nil {
_ = f.Close()
return fmt.Errorf("crispasr: finalize WAV: %w", err)
}
if err := f.Close(); err != nil {
return fmt.Errorf("crispasr: close %q: %w", dst, err)
}
return nil
}

View File

@@ -0,0 +1,193 @@
package main
import (
"context"
"os"
"path/filepath"
"strings"
"sync"
"testing"
"github.com/ebitengine/purego"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
)
func TestCrispASR(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "CrispASR Backend Suite")
}
var (
libLoadOnce sync.Once
libLoadErr error
)
// ensureLibLoaded mirrors main.go's bootstrap so a Go test can drive the
// bridge without spinning up the gRPC server. Skips the current spec when the
// shared library isn't present (e.g. running before `make backends/whisper`).
func ensureLibLoaded() {
libLoadOnce.Do(func() {
libName := os.Getenv("CRISPASR_LIBRARY")
if libName == "" {
libName = "./libgocrispasr-fallback.so"
}
if _, err := os.Stat(libName); err != nil {
libLoadErr = err
return
}
gosd, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
if err != nil {
libLoadErr = err
return
}
purego.RegisterLibFunc(&CppLoadModel, gosd, "load_model")
purego.RegisterLibFunc(&CppSetCodecPath, gosd, "set_codec_path")
purego.RegisterLibFunc(&CppTranscribe, gosd, "transcribe")
purego.RegisterLibFunc(&CppGetSegmentText, gosd, "get_segment_text")
purego.RegisterLibFunc(&CppGetSegmentStart, gosd, "get_segment_t0")
purego.RegisterLibFunc(&CppGetSegmentEnd, gosd, "get_segment_t1")
purego.RegisterLibFunc(&CppGetBackend, gosd, "get_backend")
purego.RegisterLibFunc(&CppSetAbort, gosd, "set_abort")
purego.RegisterLibFunc(&CppTTSSynthesize, gosd, "tts_synthesize")
purego.RegisterLibFunc(&CppTTSFree, gosd, "tts_free")
purego.RegisterLibFunc(&CppTTSSetVoice, gosd, "tts_set_voice")
purego.RegisterLibFunc(&CppTTSSetVoiceFile, gosd, "tts_set_voice_file")
})
if libLoadErr != nil {
Skip("whisper library not loadable: " + libLoadErr.Error())
}
}
// fixturesOrSkip returns the model + audio paths or skips the spec if either
// env var is unset. The test never runs in default CI — it requires a real
// whisper model and a long audio file (~3 minutes) on disk.
func fixturesOrSkip() (string, string) {
modelPath := os.Getenv("CRISPASR_MODEL_PATH")
audioPath := os.Getenv("CRISPASR_AUDIO_PATH")
if modelPath == "" || audioPath == "" {
Skip("set CRISPASR_MODEL_PATH and CRISPASR_AUDIO_PATH to run this spec")
}
return modelPath, audioPath
}
// ttsModelOrSkip returns the TTS model path or skips the spec when the env var
// is unset. Like the transcription fixtures, this never runs in default CI — it
// needs a real TTS model (e.g. a vibevoice GGUF) on disk.
func ttsModelOrSkip() string {
modelPath := os.Getenv("CRISPASR_TTS_MODEL_PATH")
if modelPath == "" {
Skip("set CRISPASR_TTS_MODEL_PATH to run this spec")
}
return modelPath
}
var _ = Describe("CrispASR", func() {
Context("AudioTranscription cancellation", func() {
It("returns codes.Canceled on a pre-cancelled context and still succeeds afterwards", func() {
modelPath, audioPath := fixturesOrSkip()
ensureLibLoaded()
w := &CrispASR{}
Expect(w.Load(&pb.ModelOptions{ModelFile: modelPath})).To(Succeed())
// The session transcribe is blocking and exposes no abort hook, so
// a mid-decode cancel can't interrupt it. The contract we can rely
// on is the pre-call ctx.Err() check: a context cancelled before
// the call must yield codes.Canceled without starting a decode.
ctx, cancel := context.WithCancel(context.Background())
cancel()
_, err := w.AudioTranscription(ctx, &pb.TranscriptRequest{
Dst: audioPath,
Threads: 4,
Language: "en",
})
Expect(err).To(HaveOccurred(), "expected pre-cancelled context to fail")
st, ok := status.FromError(err)
Expect(ok).To(BeTrue(), "expected gRPC status error, got %v", err)
Expect(st.Code()).To(Equal(codes.Canceled), "expected codes.Canceled, got %v", err)
// Subsequent transcription must succeed — proves g_abort reset.
res, err := w.AudioTranscription(context.Background(), &pb.TranscriptRequest{
Dst: audioPath,
Threads: 4,
Language: "en",
})
Expect(err).ToNot(HaveOccurred(), "post-cancel transcription failed")
Expect(res.Text).ToNot(BeEmpty(), "post-cancel transcription returned empty text")
})
})
Context("AudioTranscriptionStream", func() {
It("emits multiple deltas progressively for a multi-segment clip", func() {
modelPath, audioPath := fixturesOrSkip()
ensureLibLoaded()
w := &CrispASR{}
Expect(w.Load(&pb.ModelOptions{ModelFile: modelPath})).To(Succeed())
results := make(chan *pb.TranscriptStreamResponse, 64)
done := make(chan error, 1)
go func() {
done <- w.AudioTranscriptionStream(context.Background(), &pb.TranscriptRequest{
Dst: audioPath,
Threads: 4,
Language: "en",
Stream: true,
}, results)
}()
var deltas []string
var assembled strings.Builder
var finalText string
var finalSegmentCount int
for chunk := range results {
if d := chunk.GetDelta(); d != "" {
deltas = append(deltas, d)
assembled.WriteString(d)
}
if final := chunk.GetFinalResult(); final != nil {
finalText = final.GetText()
finalSegmentCount = len(final.GetSegments())
}
}
Expect(<-done).ToNot(HaveOccurred())
// One delta per non-empty segment is emitted after the blocking
// decode returns (the session API has no per-decode callback), so a
// multi-segment clip MUST produce >=2 delta events, and
// concat(deltas) MUST equal final.Text exactly.
Expect(len(deltas)).To(BeNumerically(">=", 2),
"expected multiple deltas from a multi-segment clip, got %d (assembled=%q)",
len(deltas), assembled.String())
Expect(finalSegmentCount).To(BeNumerically(">=", 2),
"expected final to carry multiple segments")
Expect(assembled.String()).To(Equal(finalText),
"concat(deltas) must equal final.Text")
})
})
Context("TTS", func() {
It("synthesizes a non-empty WAV", func() {
ttsModel := ttsModelOrSkip()
ensureLibLoaded()
w := &CrispASR{}
Expect(w.Load(&pb.ModelOptions{ModelFile: ttsModel})).To(Succeed())
dst := filepath.Join(GinkgoT().TempDir(), "out.wav")
Expect(w.TTS(&pb.TTSRequest{Text: "Hello from CrispASR.", Dst: dst})).To(Succeed())
info, err := os.Stat(dst)
Expect(err).ToNot(HaveOccurred(), "synthesized WAV should exist at %q", dst)
// A real 24 kHz mono WAV is a 44-byte header plus samples; anything
// this small would mean an empty/failed synth.
Expect(info.Size()).To(BeNumerically(">", 1024),
"expected a non-trivial WAV, got %d bytes", info.Size())
})
})
})

View File

@@ -0,0 +1,58 @@
package main
// Note: this is started internally by LocalAI and a server is allocated for each model
import (
"flag"
"os"
"github.com/ebitengine/purego"
grpc "github.com/mudler/LocalAI/pkg/grpc"
)
var (
addr = flag.String("addr", "localhost:50051", "the address to connect to")
)
type LibFuncs struct {
FuncPtr any
Name string
}
func main() {
libName := os.Getenv("CRISPASR_LIBRARY")
if libName == "" {
libName = "./libgocrispasr-fallback.so"
}
lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
if err != nil {
panic(err)
}
libFuncs := []LibFuncs{
{&CppLoadModel, "load_model"},
{&CppSetCodecPath, "set_codec_path"},
{&CppLoadModelVAD, "load_model_vad"},
{&CppVAD, "vad"},
{&CppTranscribe, "transcribe"},
{&CppGetSegmentText, "get_segment_text"},
{&CppGetSegmentStart, "get_segment_t0"},
{&CppGetSegmentEnd, "get_segment_t1"},
{&CppGetBackend, "get_backend"},
{&CppSetAbort, "set_abort"},
{&CppTTSSynthesize, "tts_synthesize"},
{&CppTTSFree, "tts_free"},
{&CppTTSSetVoice, "tts_set_voice"},
{&CppTTSSetVoiceFile, "tts_set_voice_file"},
}
for _, lf := range libFuncs {
purego.RegisterLibFunc(lf.FuncPtr, lib, lf.Name)
}
flag.Parse()
if err := grpc.StartServer(*addr, &CrispASR{}); err != nil {
panic(err)
}
}

65
backend/go/crispasr/package.sh Executable file
View File

@@ -0,0 +1,65 @@
#!/bin/bash
# Script to copy the appropriate libraries based on architecture
# This script is used in the final stage of the Dockerfile
set -e
CURDIR=$(dirname "$(realpath $0)")
REPO_ROOT="${CURDIR}/../../.."
# Create lib directory
mkdir -p $CURDIR/package/lib
cp -avf $CURDIR/crispasr $CURDIR/package/
cp -fv $CURDIR/libgocrispasr-*.so $CURDIR/package/
cp -fv $CURDIR/run.sh $CURDIR/package/
# Detect architecture and copy appropriate libraries
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
# x86_64 architecture
echo "Detected x86_64 architecture, copying x86_64 libraries..."
cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
# ARM64 architecture
echo "Detected ARM64 architecture, copying ARM64 libraries..."
cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ $(uname -s) = "Darwin" ]; then
echo "Detected Darwin"
else
echo "Error: Could not detect architecture"
exit 1
fi
# Package GPU libraries based on BUILD_TYPE
# The GPU library packaging script will detect BUILD_TYPE and copy appropriate GPU libraries
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
if [ -f "$GPU_LIB_SCRIPT" ]; then
echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
package_gpu_libs
fi
echo "Packaging completed successfully"
ls -liah $CURDIR/package/
ls -liah $CURDIR/package/lib/

52
backend/go/crispasr/run.sh Executable file
View File

@@ -0,0 +1,52 @@
#!/bin/bash
set -ex
# Get the absolute current dir where the script is located
CURDIR=$(dirname "$(realpath $0)")
cd /
echo "CPU info:"
if [ "$(uname)" != "Darwin" ]; then
grep -e "model\sname" /proc/cpuinfo | head -1
grep -e "flags" /proc/cpuinfo | head -1
fi
LIBRARY="$CURDIR/libgocrispasr-fallback.so"
if [ "$(uname)" != "Darwin" ]; then
if grep -q -e "\savx\s" /proc/cpuinfo ; then
echo "CPU: AVX found OK"
if [ -e $CURDIR/libgocrispasr-avx.so ]; then
LIBRARY="$CURDIR/libgocrispasr-avx.so"
fi
fi
if grep -q -e "\savx2\s" /proc/cpuinfo ; then
echo "CPU: AVX2 found OK"
if [ -e $CURDIR/libgocrispasr-avx2.so ]; then
LIBRARY="$CURDIR/libgocrispasr-avx2.so"
fi
fi
# Check avx 512
if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
echo "CPU: AVX512F found OK"
if [ -e $CURDIR/libgocrispasr-avx512.so ]; then
LIBRARY="$CURDIR/libgocrispasr-avx512.so"
fi
fi
fi
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
export CRISPASR_LIBRARY=$LIBRARY
# If there is a lib/ld.so, use it
if [ -f $CURDIR/lib/ld.so ]; then
echo "Using lib/ld.so"
echo "Using library: $LIBRARY"
exec $CURDIR/lib/ld.so $CURDIR/crispasr "$@"
fi
echo "Using library: $LIBRARY"
exec $CURDIR/crispasr "$@"

View File

@@ -9,7 +9,7 @@ JOBS?=$(shell nproc --ignore=1)
# LocalVQE upstream version pin. Bump to a specific commit when picking up
# a new release; `main` works for development but is not reproducible.
LOCALVQE_REPO?=https://github.com/localai-org/LocalVQE
LOCALVQE_VERSION?=72bfb4c6
LOCALVQE_VERSION?=b0f0378a450e87c871b85689554801601ca56d98
# LocalVQE handles CPU feature selection internally (it ships the multiple
# libggml-cpu-*.so variants and its loader picks the best one at runtime
@@ -27,7 +27,8 @@ endif
# LocalVQE upstream supports CPU + Vulkan only. Other BUILD_TYPE values
# fall through to the default CPU build — Vulkan is already as fast as the
# specialised GPU paths would be on this 1.3 M-parameter model.
# specialised GPU paths would be on these small (1.3 M4.8 M parameter)
# models.
ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DGGML_VULKAN=ON -DLOCALVQE_VULKAN=ON
else ifeq ($(OS),Darwin)

View File

@@ -3,7 +3,6 @@ package main
import (
"encoding/binary"
"fmt"
"io"
"os"
"path/filepath"
"runtime"
@@ -11,6 +10,7 @@ import (
"strings"
"unsafe"
"github.com/go-audio/wav"
"github.com/mudler/LocalAI/pkg/grpc/base"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/xlog"
@@ -46,24 +46,24 @@ const (
// through the options builder (CppOptionsNew + setters + CppNewWithOptions)
// — the bare localvqe_new path doesn't expose backend / device selection.
var (
CppOptionsNew func() uintptr
CppOptionsFree func(opts uintptr)
CppOptionsSetModelPath func(opts uintptr, modelPath string) int32
CppOptionsSetBackend func(opts uintptr, backend string) int32
CppOptionsSetDevice func(opts uintptr, device int32) int32
CppNewWithOptions func(opts uintptr) uintptr
CppFree func(ctx uintptr)
CppProcessF32 func(ctx uintptr, mic, ref uintptr, nSamples int32, out uintptr) int32
CppProcessS16 func(ctx uintptr, mic, ref uintptr, nSamples int32, out uintptr) int32
CppProcessFrameF32 func(ctx uintptr, mic, ref uintptr, hopSamples int32, out uintptr) int32
CppProcessFrameS16 func(ctx uintptr, mic, ref uintptr, hopSamples int32, out uintptr) int32
CppReset func(ctx uintptr)
CppLastError func(ctx uintptr) string
CppSampleRate func(ctx uintptr) int32
CppHopLength func(ctx uintptr) int32
CppFFTSize func(ctx uintptr) int32
CppSetNoiseGate func(ctx uintptr, enabled int32, thresholdDBFS float32) int32
CppGetNoiseGate func(ctx uintptr, enabledOut, thresholdDBFSOut uintptr) int32
CppOptionsNew func() uintptr
CppOptionsFree func(opts uintptr)
CppOptionsSetModelPath func(opts uintptr, modelPath string) int32
CppOptionsSetBackend func(opts uintptr, backend string) int32
CppOptionsSetDevice func(opts uintptr, device int32) int32
CppNewWithOptions func(opts uintptr) uintptr
CppFree func(ctx uintptr)
CppProcessF32 func(ctx uintptr, mic, ref uintptr, nSamples int32, out uintptr) int32
CppProcessS16 func(ctx uintptr, mic, ref uintptr, nSamples int32, out uintptr) int32
CppProcessFrameF32 func(ctx uintptr, mic, ref uintptr, hopSamples int32, out uintptr) int32
CppProcessFrameS16 func(ctx uintptr, mic, ref uintptr, hopSamples int32, out uintptr) int32
CppReset func(ctx uintptr)
CppLastError func(ctx uintptr) string
CppSampleRate func(ctx uintptr) int32
CppHopLength func(ctx uintptr) int32
CppFFTSize func(ctx uintptr) int32
CppSetNoiseGate func(ctx uintptr, enabled int32, thresholdDBFS float32) int32
CppGetNoiseGate func(ctx uintptr, enabledOut, thresholdDBFSOut uintptr) int32
)
// LocalVQE speaks gRPC against LocalVQE's flat C ABI. The streaming
@@ -490,11 +490,14 @@ func (v *LocalVQE) applyStreamConfig(cfg *pb.AudioTransformStreamConfig) error {
// ---- WAV I/O ----------------------------------------------------------
//
// Minimal mono PCM WAV reader/writer. Only handles the subset LocalVQE
// cares about (mono, 16-bit signed, no extensible chunks). For broader
// audio support the HTTP layer's `audio.NormalizeAudioFile` already
// converts arbitrary input to a canonical WAV before we see it; this
// reader just decodes the canonical shape.
// Reader/writer for the mono 16-bit PCM shape LocalVQE works with. Decoding
// goes through the shared go-audio/wav decoder (as the whisper and parakeet
// backends do) so RIFF chunk walking is handled robustly — an 18/40-byte
// extensible `fmt ` chunk, or JUNK/bext/LIST metadata before or after `data`
// (e.g. ffmpeg's trailing "Lavf" tag), is skipped rather than spliced into
// the PCM stream as an audible click. The HTTP layer normalises arbitrary
// input to WAV before we see it, but that WAV is ffmpeg output and is not
// guaranteed to be the canonical 44-byte layout.
func readMonoWAVf32(path string) ([]float32, int, error) {
f, err := os.Open(path)
@@ -502,35 +505,26 @@ func readMonoWAVf32(path string) ([]float32, int, error) {
return nil, 0, err
}
defer func() { _ = f.Close() }()
header := make([]byte, 44)
if _, err := io.ReadFull(f, header); err != nil {
return nil, 0, err
buf, err := wav.NewDecoder(f).FullPCMBuffer()
if err != nil {
return nil, 0, fmt.Errorf("decode WAV: %w", err)
}
if string(header[0:4]) != "RIFF" || string(header[8:12]) != "WAVE" {
if buf == nil || buf.Format == nil {
return nil, 0, fmt.Errorf("not a WAV file")
}
channels := binary.LittleEndian.Uint16(header[22:24])
sampleRate := binary.LittleEndian.Uint32(header[24:28])
bitsPerSample := binary.LittleEndian.Uint16(header[34:36])
if channels != 1 {
return nil, 0, fmt.Errorf("only mono WAV supported (got %d channels)", channels)
if buf.Format.NumChannels != 1 {
return nil, 0, fmt.Errorf("only mono WAV supported (got %d channels)", buf.Format.NumChannels)
}
if bitsPerSample != 16 {
return nil, 0, fmt.Errorf("only 16-bit PCM supported (got %d bits)", bitsPerSample)
if buf.SourceBitDepth != 16 {
return nil, 0, fmt.Errorf("only 16-bit PCM supported (got %d bits)", buf.SourceBitDepth)
}
rest, err := io.ReadAll(f)
if err != nil {
return nil, 0, err
if len(buf.Data) == 0 {
return nil, 0, fmt.Errorf("WAV has no audio data")
}
n := len(rest) / 2
out := make([]float32, n)
for i := 0; i < n; i++ {
s := int16(binary.LittleEndian.Uint16(rest[i*2 : i*2+2]))
out[i] = float32(s) / 32768.0
}
return out, int(sampleRate), nil
// AsFloat32Buffer normalises by 2^(bitDepth-1) == /32768 for 16-bit,
// matching the model's expected [-1, 1) input range.
return buf.AsFloat32Buffer().Data, buf.Format.SampleRate, nil
}
func writeMonoWAVf32(path string, samples []float32, sampleRate int) error {
@@ -546,13 +540,13 @@ func writeMonoWAVf32(path string, samples []float32, sampleRate int) error {
binary.LittleEndian.PutUint32(header[4:8], 36+dataLen)
copy(header[8:12], []byte("WAVE"))
copy(header[12:16], []byte("fmt "))
binary.LittleEndian.PutUint32(header[16:20], 16) // fmt chunk size
binary.LittleEndian.PutUint16(header[20:22], 1) // PCM
binary.LittleEndian.PutUint16(header[22:24], 1) // mono
binary.LittleEndian.PutUint32(header[16:20], 16) // fmt chunk size
binary.LittleEndian.PutUint16(header[20:22], 1) // PCM
binary.LittleEndian.PutUint16(header[22:24], 1) // mono
binary.LittleEndian.PutUint32(header[24:28], uint32(sampleRate))
binary.LittleEndian.PutUint32(header[28:32], uint32(sampleRate*2)) // byte rate
binary.LittleEndian.PutUint16(header[32:34], 2) // block align
binary.LittleEndian.PutUint16(header[34:36], 16) // bits per sample
binary.LittleEndian.PutUint16(header[32:34], 2) // block align
binary.LittleEndian.PutUint16(header[34:36], 16) // bits per sample
copy(header[36:40], []byte("data"))
binary.LittleEndian.PutUint32(header[40:44], dataLen)
if _, err := f.Write(header); err != nil {

View File

@@ -1,7 +1,9 @@
package main
import (
"encoding/binary"
"os"
"path/filepath"
"testing"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
@@ -92,6 +94,147 @@ var _ = Describe("LocalVQE-cpp", func() {
})
})
Context("readMonoWAVf32 chunk parsing", func() {
// chunk builds a word-aligned RIFF sub-chunk (id + size + body + pad).
chunk := func(id string, body []byte) []byte {
out := append([]byte(id), 0, 0, 0, 0)
binary.LittleEndian.PutUint32(out[4:8], uint32(len(body)))
out = append(out, body...)
if len(body)&1 == 1 {
out = append(out, 0) // pad byte for odd-sized chunks
}
return out
}
// fmtBody returns a PCM `fmt ` chunk body. extra bytes simulate the
// 18/40-byte extensible form (cbSize + extension).
fmtBody := func(channels, bits uint16, rate uint32, extra int) []byte {
b := make([]byte, 16+extra)
binary.LittleEndian.PutUint16(b[0:2], 1) // PCM
binary.LittleEndian.PutUint16(b[2:4], channels)
binary.LittleEndian.PutUint32(b[4:8], rate)
binary.LittleEndian.PutUint32(b[8:12], rate*uint32(channels)*uint32(bits)/8)
binary.LittleEndian.PutUint16(b[12:14], channels*bits/8)
binary.LittleEndian.PutUint16(b[14:16], bits)
if extra >= 2 {
binary.LittleEndian.PutUint16(b[16:18], uint16(extra-2)) // cbSize
}
return b
}
// pcm encodes int16 samples little-endian.
pcm := func(samples ...int16) []byte {
b := make([]byte, len(samples)*2)
for i, s := range samples {
binary.LittleEndian.PutUint16(b[i*2:i*2+2], uint16(s))
}
return b
}
riff := func(chunks ...[]byte) []byte {
body := []byte("WAVE")
for _, c := range chunks {
body = append(body, c...)
}
out := append([]byte("RIFF"), 0, 0, 0, 0)
binary.LittleEndian.PutUint32(out[4:8], uint32(len(body)))
return append(out, body...)
}
writeWAV := func(b []byte) string {
p := filepath.Join(GinkgoT().TempDir(), "in.wav")
Expect(os.WriteFile(p, b, 0o600)).To(Succeed())
return p
}
// A canonical sample run with distinct values so any off-by-one /
// misalignment shows up as wrong numbers, not just wrong length.
samples := []int16{1000, -2000, 3000, -4000, 5000, -6000}
expectSamples := func(got []float32) {
Expect(got).To(HaveLen(len(samples)))
for i, s := range samples {
Expect(got[i]).To(BeNumerically("~", float32(s)/32768.0, 1e-6))
}
}
It("reads a canonical 44-byte WAV", func() {
p := writeWAV(riff(chunk("fmt ", fmtBody(1, 16, 16000, 0)), chunk("data", pcm(samples...))))
out, sr, err := readMonoWAVf32(p)
Expect(err).ToNot(HaveOccurred())
Expect(sr).To(Equal(16000))
expectSamples(out)
})
It("ignores a LIST/JUNK chunk placed before data (no leading-impulse splice)", func() {
p := writeWAV(riff(
chunk("fmt ", fmtBody(1, 16, 16000, 0)),
chunk("JUNK", []byte("padding-bytes-here!")), // odd length → exercises pad
chunk("LIST", []byte("INFOISFTLavf60.0")),
chunk("data", pcm(samples...)),
))
out, sr, err := readMonoWAVf32(p)
Expect(err).ToNot(HaveOccurred())
Expect(sr).To(Equal(16000))
expectSamples(out) // not corrupted by the preceding chunks
})
It("honours the data chunk size and drops a trailing metadata chunk", func() {
p := writeWAV(riff(
chunk("fmt ", fmtBody(1, 16, 16000, 0)),
chunk("data", pcm(samples...)),
chunk("LIST", []byte("INFOISFTLavf60.16.100")), // ffmpeg trailer tag
))
out, _, err := readMonoWAVf32(p)
Expect(err).ToNot(HaveOccurred())
expectSamples(out) // trailing LIST bytes not decoded as PCM
})
It("handles the 18-byte extensible fmt chunk", func() {
p := writeWAV(riff(chunk("fmt ", fmtBody(1, 16, 16000, 2)), chunk("data", pcm(samples...))))
out, sr, err := readMonoWAVf32(p)
Expect(err).ToNot(HaveOccurred())
Expect(sr).To(Equal(16000))
expectSamples(out)
})
It("rejects non-mono input", func() {
p := writeWAV(riff(chunk("fmt ", fmtBody(2, 16, 16000, 0)), chunk("data", pcm(samples...))))
_, _, err := readMonoWAVf32(p)
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("mono"))
})
It("rejects non-16-bit input", func() {
p := writeWAV(riff(chunk("fmt ", fmtBody(1, 8, 16000, 0)), chunk("data", pcm(samples...))))
_, _, err := readMonoWAVf32(p)
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("16-bit"))
})
It("rejects a non-WAV file", func() {
p := writeWAV([]byte("not a riff file at all"))
_, _, err := readMonoWAVf32(p)
Expect(err).To(HaveOccurred())
})
It("errors when the data chunk is missing", func() {
// fmt but no data: the decoder must fail rather than return an
// empty (or garbage) sample slice. The exact message is the
// decoder's, so just assert it errors.
p := writeWAV(riff(chunk("fmt ", fmtBody(1, 16, 16000, 0))))
_, _, err := readMonoWAVf32(p)
Expect(err).To(HaveOccurred())
})
It("round-trips through writeMonoWAVf32", func() {
p := filepath.Join(GinkgoT().TempDir(), "rt.wav")
in := []float32{0.1, -0.2, 0.3, -0.4}
Expect(writeMonoWAVf32(p, in, 16000)).To(Succeed())
out, sr, err := readMonoWAVf32(p)
Expect(err).ToNot(HaveOccurred())
Expect(sr).To(Equal(16000))
Expect(out).To(HaveLen(len(in)))
for i := range in {
Expect(out[i]).To(BeNumerically("~", in[i], 1e-4))
}
})
})
Context("model-gated integration (LOCALVQE_MODEL_PATH)", func() {
It("load + sample rate + hop + fft", func() {
path := modelPathOrSkip()

11
backend/go/parakeet-cpp/.gitignore vendored Normal file
View File

@@ -0,0 +1,11 @@
.cache/
sources/
build/
package/
parakeet-cpp-grpc
# build artifacts staged in-tree by the Makefile (cp from sources/) or
# symlinked for local dev; the real sources live in parakeet.cpp upstream.
*.so
*.so.*
parakeet_capi.h
compile_commands.json

View File

@@ -0,0 +1,93 @@
# parakeet-cpp backend Makefile.
#
# Upstream pin lives below as PARAKEET_VERSION?=b11fe5bca78ad8b342dd559a43d76df3984bb447
# (.github/bump_deps.sh) can find and update it - matches the
# whisper.cpp / ds4 / vibevoice-cpp convention.
#
# Local dev shortcut: if you already have an out-of-tree parakeet.cpp
# build, you can symlink the .so + header into this directory and skip
# the clone/cmake steps entirely, e.g.:
#
# ln -sf /path/to/parakeet.cpp/build-shared/libparakeet.so .
# ln -sf /path/to/parakeet.cpp/include/parakeet_capi.h .
# go build -o parakeet-cpp-grpc .
#
# That's what the L0 smoke test uses. The default target below does the
# proper clone-at-pin + cmake build so CI doesn't need a side-checkout.
PARAKEET_VERSION?=b11fe5bca78ad8b342dd559a43d76df3984bb447
PARAKEET_REPO?=https://github.com/mudler/parakeet.cpp
GOCMD?=go
GO_TAGS?=
JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
BUILD_TYPE?=
NATIVE?=false
# Build ggml statically into libparakeet.so (PIC) so the shared lib is
# self-contained: dlopen needs no libggml*.so alongside it, only system libs
# (libstdc++/libgomp/libc) that the runtime image already provides.
CMAKE_ARGS?=-DCMAKE_BUILD_TYPE=Release -DPARAKEET_SHARED=ON -DPARAKEET_BUILD_CLI=OFF -DBUILD_SHARED_LIBS=OFF -DCMAKE_POSITION_INDEPENDENT_CODE=ON
ifeq ($(NATIVE),false)
CMAKE_ARGS+=-DGGML_NATIVE=OFF
endif
# parakeet.cpp gates its GGML backends behind PARAKEET_GGML_* options and does
# set(GGML_CUDA ${PARAKEET_GGML_CUDA} CACHE BOOL "" FORCE), so a bare -DGGML_CUDA=ON
# is overwritten back to OFF and the build silently falls back to CPU. Forward the
# PARAKEET_GGML_* options instead. (openblas is not gated, so -DGGML_BLAS passes through.)
ifeq ($(BUILD_TYPE),cublas)
CMAKE_ARGS+=-DPARAKEET_GGML_CUDA=ON
else ifeq ($(BUILD_TYPE),openblas)
CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
else ifeq ($(BUILD_TYPE),hipblas)
CMAKE_ARGS+=-DPARAKEET_GGML_HIP=ON
else ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DPARAKEET_GGML_VULKAN=ON
endif
.PHONY: parakeet-cpp-grpc package build clean purge test all
all: parakeet-cpp-grpc
# Clone the upstream parakeet.cpp source at the pinned commit. Directory
# acts as the target so make only re-clones when missing. After a
# PARAKEET_VERSION bump, run 'make purge && make' to refetch.
sources/parakeet.cpp:
mkdir -p sources/parakeet.cpp
cd sources/parakeet.cpp && \
git init -q && \
git remote add origin $(PARAKEET_REPO) && \
git fetch --depth 1 origin $(PARAKEET_VERSION) && \
git checkout FETCH_HEAD && \
git submodule update --init --recursive --depth 1 --single-branch
# Build the shared lib + header out-of-tree, then stage them next to the
# Go sources so purego.Dlopen("libparakeet.so") and the cgo-less build
# both pick them up.
libparakeet.so: sources/parakeet.cpp
cmake -B sources/parakeet.cpp/build-shared -S sources/parakeet.cpp $(CMAKE_ARGS)
cmake --build sources/parakeet.cpp/build-shared --config Release -j$(JOBS)
cp -fv sources/parakeet.cpp/build-shared/libparakeet.so* ./ 2>/dev/null || true
cp -fv sources/parakeet.cpp/include/parakeet_capi.h ./
parakeet-cpp-grpc: libparakeet.so main.go goparakeetcpp.go
CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o parakeet-cpp-grpc .
package: parakeet-cpp-grpc
bash package.sh
build: package
# Test target. Smoke test is gated on PARAKEET_BACKEND_TEST_MODEL +
# PARAKEET_BACKEND_TEST_WAV; without them the spec auto-skips.
test:
LD_LIBRARY_PATH=$(CURDIR):$$LD_LIBRARY_PATH $(GOCMD) test ./... -count=1
clean: purge
rm -rf libparakeet.so* parakeet_capi.h package parakeet-cpp-grpc
purge:
rm -rf sources/parakeet.cpp

View File

@@ -0,0 +1,79 @@
package main
import "time"
// batchRequest is one in-flight unary transcription waiting to be batched.
// In production pcm/decoder are set; tag is an opaque marker used by tests.
type batchRequest struct {
pcm []float32
decoder int32
tag string
reply chan batchReply
}
// batchReply carries one per-item JSON object string (an element of the C-API's
// JSON array) or an error back to the waiting handler goroutine.
type batchReply struct {
json string
err error
}
// batcher coalesces concurrent batchRequests into batched runBatch calls. A
// single run() goroutine is the sole caller of runBatch, so runBatch (which in
// production calls the thread-unsafe C engine) is never entered concurrently.
type batcher struct {
submit chan *batchRequest
maxSize int
maxWait time.Duration
runBatch func(reqs []*batchRequest) // must deliver a reply to every req
}
func newBatcher(maxSize int, maxWait time.Duration, runBatch func([]*batchRequest)) *batcher {
if maxSize < 1 {
maxSize = 1
}
return &batcher{
submit: make(chan *batchRequest),
maxSize: maxSize,
maxWait: maxWait,
runBatch: runBatch,
}
}
// run is the dispatcher loop: accumulate submitted requests until either maxSize
// is reached or maxWait elapses since the first queued request, then dispatch.
// Exits when stop is closed (draining any partially-filled batch first).
func (b *batcher) run(stop <-chan struct{}) {
for {
var first *batchRequest
select {
case first = <-b.submit:
case <-stop:
return
}
batch := []*batchRequest{first}
// maxSize==1 disables batching: dispatch immediately (passthrough).
if b.maxSize == 1 {
b.runBatch(batch)
continue
}
timer := time.NewTimer(b.maxWait)
fill:
for len(batch) < b.maxSize {
select {
case r := <-b.submit:
batch = append(batch, r)
case <-timer.C:
break fill
case <-stop:
timer.Stop()
b.runBatch(batch)
return
}
}
timer.Stop()
b.runBatch(batch)
}
}

View File

@@ -0,0 +1,108 @@
package main
import (
"sync"
"time"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("batcher", func() {
echoReply := func(reqs []*batchRequest) {
for _, r := range reqs {
r.reply <- batchReply{json: r.tag}
}
}
It("coalesces concurrent submits into batches", func() {
var mu sync.Mutex
var sizes []int
run := func(reqs []*batchRequest) {
mu.Lock()
sizes = append(sizes, len(reqs))
mu.Unlock()
echoReply(reqs)
}
b := newBatcher(4, 50*time.Millisecond, run)
stop := make(chan struct{})
go b.run(stop)
defer close(stop)
const N = 4
var wg sync.WaitGroup
got := make([]string, N)
for i := 0; i < N; i++ {
wg.Add(1)
go func(i int) {
defer wg.Done()
rep := make(chan batchReply, 1)
b.submit <- &batchRequest{tag: string(rune('a' + i)), reply: rep}
got[i] = (<-rep).json
}(i)
}
wg.Wait()
mu.Lock()
defer mu.Unlock()
total, maxBatch := 0, 0
for _, s := range sizes {
total += s
if s > maxBatch {
maxBatch = s
}
}
Expect(total).To(Equal(N))
Expect(maxBatch).To(BeNumerically(">=", 2), "expected at least one batch to coalesce >1 request")
})
It("dispatches when max size is reached", func() {
dispatched := make(chan int, 8)
run := func(reqs []*batchRequest) {
dispatched <- len(reqs)
echoReply(reqs)
}
b := newBatcher(2, time.Hour, run) // huge window: only size can trigger
stop := make(chan struct{})
go b.run(stop)
defer close(stop)
for i := 0; i < 2; i++ {
rep := make(chan batchReply, 1)
b.submit <- &batchRequest{tag: "x", reply: rep}
go func(rep chan batchReply) { <-rep }(rep)
}
Eventually(dispatched, "2s").Should(Receive(Equal(2)))
})
It("dispatches when the wait window elapses", func() {
dispatched := make(chan int, 8)
run := func(reqs []*batchRequest) {
dispatched <- len(reqs)
echoReply(reqs)
}
b := newBatcher(8, 20*time.Millisecond, run) // size unreachable; window fires
stop := make(chan struct{})
go b.run(stop)
defer close(stop)
rep := make(chan batchReply, 1)
b.submit <- &batchRequest{tag: "x", reply: rep}
go func() { <-rep }()
Eventually(dispatched, "2s").Should(Receive(Equal(1)))
})
It("bypasses batching when max size is 1", func() {
dispatched := make(chan int, 8)
run := func(reqs []*batchRequest) {
dispatched <- len(reqs)
echoReply(reqs)
}
b := newBatcher(1, time.Hour, run) // size 1 => immediate dispatch
stop := make(chan struct{})
go b.run(stop)
defer close(stop)
rep := make(chan batchReply, 1)
b.submit <- &batchRequest{tag: "x", reply: rep}
go func() { <-rep }()
Eventually(dispatched, "2s").Should(Receive(Equal(1)))
})
})

View File

@@ -0,0 +1,556 @@
package main
import (
"context"
"encoding/json"
"errors"
"fmt"
"os"
"path/filepath"
"strconv"
"strings"
"sync"
"time"
"unsafe"
"github.com/go-audio/wav"
"github.com/mudler/LocalAI/pkg/grpc/base"
"github.com/mudler/LocalAI/pkg/grpc/grpcerrors"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/LocalAI/pkg/utils"
"github.com/mudler/xlog"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
)
// purego-bound entry points from libparakeet.so. Names match
// parakeet_capi.h exactly so a `nm libparakeet.so | grep parakeet_capi`
// is enough to spot drift.
//
// Functions that return char* are declared as uintptr so we can call
// parakeet_capi_free_string on the same pointer after copying, the
// C-API contract is "caller owns and must free the returned buffer".
var (
CppAbiVersion func() int32
CppLoad func(ggufPath string) uintptr
CppFree func(ctx uintptr)
CppTranscribePath func(ctx uintptr, wavPath string, decoder int32) uintptr
CppTranscribePathJSON func(ctx uintptr, wavPath string, decoder int32) uintptr
CppFreeString func(s uintptr)
CppLastError func(ctx uintptr) string
// Batched JSON transcription: takes a concatenated float buffer of clips
// plus their per-clip sample counts (sum(nSamples)==len(samplesConcat))
// and returns a malloc'd char* JSON ARRAY of per-clip {"text","words",
// "tokens"} objects (uintptr, freed via CppFreeString). purego passes the
// Go slices as the base pointer of their backing array (kept alive for the
// call), matching the CppStreamFeed pcm []float32 binding pattern; the C
// side reads them as const float*/const int*.
CppTranscribePcmBatchJSON func(ctx uintptr, samplesConcat []float32, nSamples []int32, nClips int32, sampleRate int32, decoder int32) uintptr
// Cache-aware streaming (RNN-T) entry points. stream_begin returns 0 for
// non-streaming models. feed/finalize return a malloc'd char* (uintptr,
// freed via CppFreeString); feed writes 1 to *eouOut on an <EOU>/<EOB>.
CppStreamBegin func(ctx uintptr) uintptr
CppStreamFeed func(s uintptr, pcm []float32, nSamples int32, eouOut unsafe.Pointer) uintptr
CppStreamFinalize func(s uintptr) uintptr
CppStreamFree func(s uintptr)
)
// streamChunkSamples is how much 16 kHz mono PCM we hand to stream_feed per
// call (1 s). The session buffers internally and decodes once a full
// cache-aware encoder chunk is available, so this only bounds how often we
// poll for newly-finalized text, not the model's actual chunk size.
const streamChunkSamples = 16000
// transcriptJSON mirrors the document returned by
// parakeet_capi_transcribe_path_json (see parakeet_capi.h):
//
// {"text":"...",
// "words":[{"w":"...","start":0.480,"end":0.640,"conf":0.9100}, ...],
// "tokens":[{"id":123,"t":0.480,"conf":0.9100}, ...]}
//
// "start"/"end"/"t" are seconds; "conf" is confidence in (0,1].
type transcriptJSON struct {
Text string `json:"text"`
Words []transcriptWord `json:"words"`
Tokens []transcriptToken `json:"tokens"`
}
type transcriptWord struct {
W string `json:"w"`
Start float64 `json:"start"`
End float64 `json:"end"`
Conf float64 `json:"conf"`
}
type transcriptToken struct {
ID int32 `json:"id"`
T float64 `json:"t"`
Conf float64 `json:"conf"`
}
// ParakeetCpp owns a single loaded parakeet_ctx. The C engine is a
// thread-unsafe singleton (mirrors whisper.cpp / vibevoice.cpp). Rather than
// serialize every call through base.SingleThread, we route unary
// transcription through an in-process batcher (its sole dispatcher goroutine
// is the only caller of the engine on that path) and guard the shared engine
// with engineMu so a streaming session and a batched-unary dispatch never
// touch it concurrently.
type ParakeetCpp struct {
base.Base
ctxPtr uintptr
engineMu sync.Mutex // sole guard of the one C engine (dispatcher + streaming)
bat *batcher
batStop chan struct{}
}
// Load is the LocalAI gRPC entry point for LoadModel: it calls
// parakeet_capi_load with the GGUF path and stashes the resulting
// opaque context pointer for AudioTranscription.
func (p *ParakeetCpp) Load(opts *pb.ModelOptions) error {
if opts.ModelFile == "" {
return errors.New("parakeet-cpp: ModelFile is required")
}
ctx := CppLoad(opts.ModelFile)
if ctx == 0 {
// No ctx to ask for last_error (the C-API's last-error buffer
// lives on the ctx that was never returned). Surface the path
// so the operator at least knows which load failed.
return fmt.Errorf("parakeet-cpp: parakeet_capi_load failed for %q", opts.ModelFile)
}
p.ctxPtr = ctx
// Dynamic batching knobs (model YAML options:, key:value form). Batching is
// OFF by default (batch_max_size:1): each request runs on its own. On GPU,
// raising batch_max_size coalesces concurrent requests into one batched
// engine call and improves throughput under load; leave it at 1 on CPU and
// for low-concurrency setups, where batching only adds latency.
maxSize := optInt(opts, "batch_max_size", 1)
maxWaitMs := optInt(opts, "batch_max_wait_ms", 15)
if maxWaitMs < 0 {
maxWaitMs = 0
}
if CppTranscribePcmBatchJSON != nil {
p.batStop = make(chan struct{})
p.bat = newBatcher(maxSize, time.Duration(maxWaitMs)*time.Millisecond, p.runBatch)
go p.bat.run(p.batStop) // dispatcher runs until Free closes batStop
if maxSize > 1 {
xlog.Info("parakeet-cpp: dynamic batching enabled",
"batch_max_size", maxSize, "batch_max_wait_ms", maxWaitMs)
} else {
xlog.Info("parakeet-cpp: dynamic batching off (batch_max_size=1); " +
"set batch_max_size>1 to coalesce concurrent requests on GPU")
}
} else {
xlog.Info("parakeet-cpp: batched C-API not present in libparakeet.so; " +
"batching disabled, using per-request transcription")
}
return nil
}
// optInt reads an integer model option (key:value form) from ModelOptions,
// returning def when absent or unparseable. The options array carries the
// model YAML's options: entries (see core/config; siblings such as
// acestep-cpp parse the same key:value form via strings.Cut on ":").
func optInt(opts *pb.ModelOptions, key string, def int) int {
for _, o := range opts.GetOptions() {
k, v, ok := strings.Cut(o, ":")
if ok && strings.TrimSpace(k) == key {
if n, err := strconv.Atoi(strings.TrimSpace(v)); err == nil {
return n
}
}
}
return def
}
// runBatch is the dispatcher's batch handler and the ONLY caller of the C
// engine on the unary path. It concatenates the batch PCM, calls the batched
// JSON C-API under engineMu, splits the JSON array, and replies to each request.
func (p *ParakeetCpp) runBatch(reqs []*batchRequest) {
// Observability: the actual coalesced batch size per engine call. Debug-level
// so it stays silent in normal operation but lets operators confirm/tune batching.
xlog.Debug("parakeet-cpp: dispatching batch", "size", len(reqs))
nSamples := make([]int32, len(reqs))
total := 0
for i, r := range reqs {
nSamples[i] = int32(len(r.pcm))
total += len(r.pcm)
}
concat := make([]float32, 0, total)
for _, r := range reqs {
concat = append(concat, r.pcm...)
}
var dec int32
if len(reqs) > 0 {
dec = reqs[0].decoder
}
p.engineMu.Lock()
cstr := CppTranscribePcmBatchJSON(p.ctxPtr, concat, nSamples, int32(len(reqs)), 16000, dec)
p.engineMu.Unlock()
if cstr == 0 {
err := fmt.Errorf("parakeet-cpp: batch transcribe failed: %s", CppLastError(p.ctxPtr))
for _, r := range reqs {
r.reply <- batchReply{err: err}
}
return
}
raw := goStringFromCPtr(cstr)
CppFreeString(cstr)
var docs []json.RawMessage
if err := json.Unmarshal([]byte(raw), &docs); err != nil || len(docs) != len(reqs) {
e := fmt.Errorf("parakeet-cpp: batch json: got %d results for %d reqs (%v)", len(docs), len(reqs), err)
for _, r := range reqs {
r.reply <- batchReply{err: e}
}
return
}
for i, r := range reqs {
r.reply <- batchReply{json: string(docs[i])}
}
}
// AudioTranscription decodes the wav at opts.Dst to 16 kHz mono PCM and
// submits it to the in-process batcher, which coalesces concurrent requests
// into a single batched engine call (parakeet_capi_transcribe_pcm_batch_json)
// with the default decoder (decoder=0, which selects the right head per
// architecture: transducer for tdt/rnnt/hybrid, CTC for ctc) and shapes the
// per-word timestamps into a LocalAI TranscriptResult.
//
// Parakeet emits word- and token-level timestamps but no native segment
// boundaries, so we synthesise a single whole-clip segment spanning the first
// word start to the last word end. Word-level timings are attached only when
// the caller opts in via timestamp_granularities=["word"] (matching the
// OpenAI API, whose default is segment-level); token ids always populate
// Segment.Tokens.
//
// translate/diarize/prompt/temperature/language/threads are not applicable to
// parakeet and are ignored; streaming is handled by AudioTranscriptionStream
// (L2).
func (p *ParakeetCpp) AudioTranscription(ctx context.Context, opts *pb.TranscriptRequest) (pb.TranscriptResult, error) {
if p.ctxPtr == 0 {
return pb.TranscriptResult{}, grpcerrors.ModelNotLoaded("parakeet-cpp")
}
if opts.Dst == "" {
return pb.TranscriptResult{}, errors.New("parakeet-cpp: TranscriptRequest.dst (audio path) is required")
}
// Fallback when the batched C-API is unavailable: transcribe from a file
// path (original behavior, no batching). The C library's audio loader only
// understands 16 kHz mono WAV/PCM, so convert the input first - otherwise
// any non-WAV upload (MP3, etc.) fails with "failed to load audio". This
// mirrors what every other audio backend (whisper, crispasr) does via
// utils.AudioToWav before handing the file to the engine.
if p.bat == nil {
converted, cleanup, err := convertToWavMono16k(opts.Dst)
if err != nil {
return pb.TranscriptResult{}, err
}
defer cleanup()
cstr := CppTranscribePathJSON(p.ctxPtr, converted, 0)
if cstr == 0 {
return pb.TranscriptResult{}, fmt.Errorf("parakeet-cpp: transcribe_path_json failed: %s", CppLastError(p.ctxPtr))
}
raw := goStringFromCPtr(cstr)
CppFreeString(cstr)
var doc transcriptJSON
if err := json.Unmarshal([]byte(raw), &doc); err != nil {
return pb.TranscriptResult{}, fmt.Errorf("parakeet-cpp: decode transcript json: %w", err)
}
return transcriptResultFromDoc(doc, opts), nil
}
// Batched path: decode to PCM, submit to the batcher, wait for this request's
// JSON element. The dispatcher is the sole engine caller on this path; both
// sends honour ctx cancellation.
pcm, _, err := decodeWavMono16k(opts.Dst)
if err != nil {
return pb.TranscriptResult{}, err
}
rep := make(chan batchReply, 1)
select {
case p.bat.submit <- &batchRequest{pcm: pcm, decoder: 0, reply: rep}:
case <-ctx.Done():
return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled")
}
var res batchReply
select {
case res = <-rep:
case <-ctx.Done():
return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled")
}
if res.err != nil {
return pb.TranscriptResult{}, res.err
}
var doc transcriptJSON
if err := json.Unmarshal([]byte(res.json), &doc); err != nil {
return pb.TranscriptResult{}, fmt.Errorf("parakeet-cpp: decode transcript json: %w", err)
}
return transcriptResultFromDoc(doc, opts), nil
}
// transcriptResultFromDoc maps a decoded transcriptJSON to a TranscriptResult,
// synthesising a single whole-clip segment and attaching word timings only when
// the caller requested word granularity. Shared by the batched and direct paths.
func transcriptResultFromDoc(doc transcriptJSON, opts *pb.TranscriptRequest) pb.TranscriptResult {
text := strings.TrimSpace(doc.Text)
words := make([]*pb.TranscriptWord, 0, len(doc.Words))
for _, w := range doc.Words {
words = append(words, &pb.TranscriptWord{Start: secondsToNanos(w.Start), End: secondsToNanos(w.End), Text: w.W})
}
tokens := make([]int32, 0, len(doc.Tokens))
for _, t := range doc.Tokens {
tokens = append(tokens, t.ID)
}
var segStart, segEnd int64
if len(words) > 0 {
segStart = words[0].Start
segEnd = words[len(words)-1].End
}
seg := &pb.TranscriptSegment{Id: 0, Start: segStart, End: segEnd, Text: text, Tokens: tokens}
if wordsRequested(opts.TimestampGranularities) {
seg.Words = words
}
return pb.TranscriptResult{Text: text, Segments: []*pb.TranscriptSegment{seg}}
}
// wordsRequested reports whether the caller asked for word-level timestamps.
// The OpenAI transcription API gates word timings behind
// timestamp_granularities[] containing "word" and defaults to segment-level
// otherwise; we follow that contract.
func wordsRequested(granularities []string) bool {
for _, g := range granularities {
if strings.EqualFold(strings.TrimSpace(g), "word") {
return true
}
}
return false
}
// secondsToNanos converts the C-API's fractional-second timestamps into the
// int64 nanoseconds LocalAI carries on TranscriptSegment/TranscriptWord, the
// same nanosecond convention the whisper backend uses.
func secondsToNanos(sec float64) int64 {
return int64(sec * 1e9)
}
// AudioTranscriptionStream drives the cache-aware streaming RNN-T over the
// audio at opts.Dst: it decodes the file to 16 kHz mono PCM, feeds it in
// chunks to parakeet_capi_stream_feed, and emits each newly-finalized text
// run as a TranscriptStreamResponse delta. <EOU>/<EOB> events close the
// current segment; a closing FinalResult carries the full transcript and the
// per-utterance segments.
//
// stream_begin returns 0 for models that are not cache-aware streaming models
// (only e.g. nvidia/parakeet_realtime_eou_120m-v1 qualifies). For those we fall
// back to a single offline transcription emitted as one delta plus a closing
// FinalResult, matching LocalAI's non-streaming streaming contract (and the
// whisper backend), so the streaming endpoint works for every model.
func (p *ParakeetCpp) AudioTranscriptionStream(ctx context.Context, opts *pb.TranscriptRequest, results chan *pb.TranscriptStreamResponse) error {
defer close(results)
if p.ctxPtr == 0 {
return grpcerrors.ModelNotLoaded("parakeet-cpp")
}
if opts.Dst == "" {
return errors.New("parakeet-cpp: TranscriptRequest.dst (audio path) is required")
}
if err := ctx.Err(); err != nil {
return status.Error(codes.Canceled, "transcription cancelled")
}
stream := CppStreamBegin(p.ctxPtr)
if stream == 0 {
// Not a cache-aware streaming model: run a normal offline
// transcription and emit it as one delta + a closing final result.
res, err := p.AudioTranscription(ctx, opts)
if err != nil {
return err
}
if t := strings.TrimSpace(res.Text); t != "" {
results <- &pb.TranscriptStreamResponse{Delta: t}
}
results <- &pb.TranscriptStreamResponse{FinalResult: &res}
return nil
}
defer CppStreamFree(stream)
// The C engine is a single shared context: a streaming session and a batched
// unary dispatch must never touch it at once, so hold engineMu for the whole
// stream. This lock is intentionally taken AFTER the non-streaming fallback
// above returns: that fallback goes through AudioTranscription -> the batcher
// -> runBatch, which itself acquires engineMu, so locking here first would
// deadlock. Do not hoist this lock above the fallback.
p.engineMu.Lock()
defer p.engineMu.Unlock()
data, duration, err := decodeWavMono16k(opts.Dst)
if err != nil {
return err
}
var (
full strings.Builder
segText strings.Builder
segments []*pb.TranscriptSegment
segID int32
)
flushSegment := func() {
t := strings.TrimSpace(segText.String())
segText.Reset()
if t == "" {
return
}
segments = append(segments, &pb.TranscriptSegment{Id: segID, Text: t})
segID++
}
// emitDelta consumes the malloc'd char* returned by feed/finalize: frees
// it, accumulates the text, and sends a delta when non-empty. A 0 return
// is an error (vs the "" empty-but-non-NULL no-new-text case).
emitDelta := func(ret uintptr) error {
if ret == 0 {
msg := CppLastError(p.ctxPtr)
if msg == "" {
msg = "unknown error"
}
return fmt.Errorf("parakeet-cpp: stream feed/finalize failed: %s", msg)
}
delta := goStringFromCPtr(ret)
CppFreeString(ret)
if delta == "" {
return nil
}
full.WriteString(delta)
segText.WriteString(delta)
results <- &pb.TranscriptStreamResponse{Delta: delta}
return nil
}
for off := 0; off < len(data); off += streamChunkSamples {
if err := ctx.Err(); err != nil {
return status.Error(codes.Canceled, "transcription cancelled")
}
end := min(off+streamChunkSamples, len(data))
chunk := data[off:end]
var eou int32
ret := CppStreamFeed(stream, chunk, int32(len(chunk)), unsafe.Pointer(&eou))
if err := emitDelta(ret); err != nil {
return err
}
if eou != 0 {
flushSegment()
}
}
// Flush the streaming tail (final encoder chunk).
if err := emitDelta(CppStreamFinalize(stream)); err != nil {
return err
}
flushSegment()
text := strings.TrimSpace(full.String())
if len(segments) == 0 && text != "" {
segments = append(segments, &pb.TranscriptSegment{Id: 0, Text: text})
}
results <- &pb.TranscriptStreamResponse{
FinalResult: &pb.TranscriptResult{
Text: text,
Segments: segments,
Duration: duration,
},
}
return nil
}
// decodeWavMono16k converts any input audio to 16 kHz mono PCM and returns the
// float samples plus the clip duration in seconds. Mirrors the whisper
// backend: utils.AudioToWav (ffmpeg) normalises rate/channels, go-audio
// decodes the PCM.
// convertToWavMono16k converts an arbitrary audio file to a 16 kHz mono WAV in
// a fresh temp dir and returns the path together with a cleanup func the caller
// must defer. WAV inputs already at 16 kHz/mono/16-bit are passed through by
// utils.AudioToWav (hardlink/copy), everything else is transcoded via ffmpeg.
// Used by the direct (non-batched) transcription path, which hands a file path
// to the C library's WAV-only audio loader.
func convertToWavMono16k(path string) (string, func(), error) {
dir, err := os.MkdirTemp("", "parakeet")
if err != nil {
return "", func() {}, err
}
cleanup := func() { _ = os.RemoveAll(dir) }
converted := filepath.Join(dir, "converted.wav")
if err := utils.AudioToWav(path, converted); err != nil {
cleanup()
return "", func() {}, err
}
return converted, cleanup, nil
}
func decodeWavMono16k(path string) ([]float32, float32, error) {
converted, cleanup, err := convertToWavMono16k(path)
if err != nil {
return nil, 0, err
}
defer cleanup()
fh, err := os.Open(converted)
if err != nil {
return nil, 0, err
}
defer func() { _ = fh.Close() }()
buf, err := wav.NewDecoder(fh).FullPCMBuffer()
if err != nil {
return nil, 0, err
}
data := buf.AsFloat32Buffer().Data
var duration float32
if buf.Format != nil && buf.Format.SampleRate > 0 {
duration = float32(len(data)) / float32(buf.Format.SampleRate)
}
return data, duration, nil
}
// Free releases the underlying parakeet_ctx. Called by LocalAI when the
// model is unloaded.
func (p *ParakeetCpp) Free() error {
// Stop the dispatcher before releasing the engine so no in-flight runBatch
// can touch a freed ctx (close leak / use-after-free on reload).
if p.batStop != nil {
close(p.batStop)
p.batStop = nil
}
if p.ctxPtr != 0 {
CppFree(p.ctxPtr)
p.ctxPtr = 0
}
return nil
}
// goStringFromCPtr copies a NUL-terminated C string into Go memory.
// cptr is the raw pointer returned by purego from the C-API (a malloc'd
// buffer the caller owns); callers must free it via CppFreeString after
// the copy lands.
//
// The uintptr->unsafe.Pointer conversion below trips go vet's unsafeptr
// check, which can't distinguish a C-owned heap pointer from Go-managed
// memory. It is safe here: the pointer addresses a malloc'd C buffer the
// Go GC neither tracks nor moves, and we dereference it immediately to
// copy the bytes out, the same pattern (and the same tolerated warning)
// as the whisper backend's unsafe.Slice over segsPtr.
func goStringFromCPtr(cptr uintptr) string {
if cptr == 0 {
return ""
}
p := unsafe.Pointer(cptr) //nolint:govet // C-owned malloc'd buffer, not Go-GC memory (see doc above)
n := 0
for *(*byte)(unsafe.Add(p, n)) != 0 {
n++
}
return string(unsafe.Slice((*byte)(p), n))
}

View File

@@ -0,0 +1,221 @@
package main
import (
"context"
"os"
"path/filepath"
"strings"
"sync"
"testing"
"github.com/ebitengine/purego"
"github.com/go-audio/audio"
"github.com/go-audio/wav"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
func TestParakeetCpp(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "parakeet-cpp Backend Suite")
}
var (
libLoadOnce sync.Once
libLoadErr error
)
// ensureLibLoaded mirrors main.go's bootstrap so a Go test can drive
// the C-API bridge without spinning up the gRPC server. Skips the
// current spec when libparakeet.so isn't loadable from cwd
// ($LD_LIBRARY_PATH or a symlink in ./).
func ensureLibLoaded() {
libLoadOnce.Do(func() {
libName := os.Getenv("PARAKEET_LIBRARY")
if libName == "" {
libName = "libparakeet.so"
}
lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
if err != nil {
libLoadErr = err
return
}
purego.RegisterLibFunc(&CppAbiVersion, lib, "parakeet_capi_abi_version")
purego.RegisterLibFunc(&CppLoad, lib, "parakeet_capi_load")
purego.RegisterLibFunc(&CppFree, lib, "parakeet_capi_free")
purego.RegisterLibFunc(&CppTranscribePath, lib, "parakeet_capi_transcribe_path")
purego.RegisterLibFunc(&CppTranscribePathJSON, lib, "parakeet_capi_transcribe_path_json")
if sym, err := purego.Dlsym(lib, "parakeet_capi_transcribe_pcm_batch_json"); err == nil && sym != 0 {
purego.RegisterLibFunc(&CppTranscribePcmBatchJSON, lib, "parakeet_capi_transcribe_pcm_batch_json")
}
purego.RegisterLibFunc(&CppStreamBegin, lib, "parakeet_capi_stream_begin")
purego.RegisterLibFunc(&CppStreamFeed, lib, "parakeet_capi_stream_feed")
purego.RegisterLibFunc(&CppStreamFinalize, lib, "parakeet_capi_stream_finalize")
purego.RegisterLibFunc(&CppStreamFree, lib, "parakeet_capi_stream_free")
purego.RegisterLibFunc(&CppFreeString, lib, "parakeet_capi_free_string")
purego.RegisterLibFunc(&CppLastError, lib, "parakeet_capi_last_error")
})
if libLoadErr != nil {
Skip("libparakeet.so not loadable: " + libLoadErr.Error())
}
}
// fixturesOrSkip returns the model + audio paths or skips the spec if
// either env var is unset. The smoke test never runs in default CI; it
// needs a real parakeet GGUF and a 16 kHz mono WAV on disk.
func fixturesOrSkip() (string, string) {
modelPath := os.Getenv("PARAKEET_BACKEND_TEST_MODEL")
audioPath := os.Getenv("PARAKEET_BACKEND_TEST_WAV")
if modelPath == "" || audioPath == "" {
Skip("set PARAKEET_BACKEND_TEST_MODEL and PARAKEET_BACKEND_TEST_WAV to run this spec")
}
return modelPath, audioPath
}
// writeMono16kWav writes `samples` frames of 16 kHz mono 16-bit silence to
// path. The result is already in AudioToWav's target format, so the conversion
// helper copies it through without invoking ffmpeg.
func writeMono16kWav(path string, samples int) {
GinkgoHelper()
f, err := os.Create(path)
Expect(err).ToNot(HaveOccurred())
enc := wav.NewEncoder(f, 16000, 16, 1, 1)
buf := &audio.IntBuffer{
Format: &audio.Format{NumChannels: 1, SampleRate: 16000},
SourceBitDepth: 16,
Data: make([]int, samples),
}
Expect(enc.Write(buf)).To(Succeed())
Expect(enc.Close()).To(Succeed())
Expect(f.Close()).To(Succeed())
}
var _ = Describe("ParakeetCpp", func() {
Context("AudioTranscription", func() {
It("transcribes a WAV via the parakeet C-API", func() {
modelPath, audioPath := fixturesOrSkip()
ensureLibLoaded()
p := &ParakeetCpp{}
Expect(p.Load(&pb.ModelOptions{ModelFile: modelPath})).To(Succeed())
defer func() { _ = p.Free() }()
res, err := p.AudioTranscription(context.Background(), &pb.TranscriptRequest{
Dst: audioPath,
})
Expect(err).ToNot(HaveOccurred())
Expect(strings.TrimSpace(res.Text)).ToNot(BeEmpty(),
"expected non-empty transcript for %s", audioPath)
Expect(res.Segments).To(HaveLen(1),
"synthesises a single whole-clip segment")
Expect(res.Segments[0].Text).To(Equal(res.Text),
"single segment text must equal the top-level text")
// Default (no granularities) is segment-level: no per-word timings.
Expect(res.Segments[0].Words).To(BeEmpty(),
"word timings are opt-in via timestamp_granularities")
})
It("emits word-level timestamps when granularity=word", func() {
modelPath, audioPath := fixturesOrSkip()
ensureLibLoaded()
p := &ParakeetCpp{}
Expect(p.Load(&pb.ModelOptions{ModelFile: modelPath})).To(Succeed())
defer func() { _ = p.Free() }()
res, err := p.AudioTranscription(context.Background(), &pb.TranscriptRequest{
Dst: audioPath,
TimestampGranularities: []string{"word"},
})
Expect(err).ToNot(HaveOccurred())
Expect(res.Segments).To(HaveLen(1))
seg := res.Segments[0]
Expect(seg.Words).ToNot(BeEmpty(),
"expected per-word timestamps with granularity=word")
// Monotonic, non-negative timings spanning the segment.
Expect(seg.Words[0].Start).To(BeNumerically(">=", int64(0)))
Expect(seg.End).To(BeNumerically(">=", seg.Start))
Expect(seg.Words[len(seg.Words)-1].End).To(Equal(seg.End),
"segment end tracks the last word")
})
})
Context("convertToWavMono16k", func() {
// The non-batched transcription path hands a file path to the C
// library's WAV-only audio loader, so it must convert first.
// utils.AudioToWav passes an already-16kHz/mono/16-bit WAV through
// without ffmpeg, which lets us exercise the helper (and the
// regression: the direct path used to skip conversion entirely)
// without a model, the C library, or ffmpeg.
It("returns a decodable 16kHz mono WAV copy and cleans it up", func() {
dir := GinkgoT().TempDir()
src := filepath.Join(dir, "input.wav")
writeMono16kWav(src, 16000) // 1s of silence at 16 kHz
converted, cleanup, err := convertToWavMono16k(src)
Expect(err).ToNot(HaveOccurred())
// It must produce a fresh temp file, not return the original path.
Expect(converted).ToNot(Equal(src))
Expect(converted).To(BeAnExistingFile())
pcm, _, err := decodeWavMono16k(converted)
Expect(err).ToNot(HaveOccurred())
Expect(pcm).To(HaveLen(16000), "round-trips the sample count")
cleanup()
Expect(converted).ToNot(BeAnExistingFile(), "cleanup removes the temp dir")
})
It("errors on a non-existent input rather than passing the path through", func() {
_, _, err := convertToWavMono16k(filepath.Join(GinkgoT().TempDir(), "missing.mp3"))
Expect(err).To(HaveOccurred())
})
})
Context("AudioTranscriptionStream", func() {
It("streams deltas and a closing FinalResult from a cache-aware model", func() {
// Streaming needs a cache-aware streaming model (e.g.
// realtime_eou); the offline test model would fail stream_begin.
modelPath := os.Getenv("PARAKEET_BACKEND_TEST_STREAM_MODEL")
audioPath := os.Getenv("PARAKEET_BACKEND_TEST_WAV")
if modelPath == "" || audioPath == "" {
Skip("set PARAKEET_BACKEND_TEST_STREAM_MODEL (cache-aware streaming model) and PARAKEET_BACKEND_TEST_WAV")
}
ensureLibLoaded()
p := &ParakeetCpp{}
Expect(p.Load(&pb.ModelOptions{ModelFile: modelPath})).To(Succeed())
defer func() { _ = p.Free() }()
results := make(chan *pb.TranscriptStreamResponse, 64)
errCh := make(chan error, 1)
go func() {
errCh <- p.AudioTranscriptionStream(context.Background(),
&pb.TranscriptRequest{Dst: audioPath}, results)
}()
var deltas []string
var final *pb.TranscriptResult
for r := range results {
if r.Delta != "" {
deltas = append(deltas, r.Delta)
}
if r.FinalResult != nil {
final = r.FinalResult
}
}
Expect(<-errCh).ToNot(HaveOccurred())
Expect(final).ToNot(BeNil(), "expected a closing FinalResult")
Expect(strings.TrimSpace(final.Text)).ToNot(BeEmpty(),
"expected a non-empty streamed transcript")
Expect(final.Segments).ToNot(BeEmpty(),
"FinalResult always carries at least one segment")
// The concatenated deltas reconstruct the final transcript.
Expect(strings.TrimSpace(strings.Join(deltas, ""))).To(Equal(strings.TrimSpace(final.Text)),
"deltas should reconstruct the final text")
})
})
})

View File

@@ -0,0 +1,75 @@
package main
// Started internally by LocalAI - one gRPC server per loaded model.
//
// Loads libparakeet.so via purego and registers the flat C-API entry
// points declared in parakeet_capi.h. The library name can be overridden
// with PARAKEET_LIBRARY (mirrors the WHISPER_LIBRARY / VIBEVOICECPP_LIBRARY
// convention in the sibling backends); the default looks for the .so next
// to this binary.
import (
"flag"
"fmt"
"os"
"github.com/ebitengine/purego"
grpc "github.com/mudler/LocalAI/pkg/grpc"
)
var (
addr = flag.String("addr", "localhost:50051", "the address to connect to")
)
type LibFuncs struct {
FuncPtr any
Name string
}
func main() {
libName := os.Getenv("PARAKEET_LIBRARY")
if libName == "" {
libName = "libparakeet.so"
}
lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
if err != nil {
panic(fmt.Errorf("parakeet-cpp: dlopen %q: %w", libName, err))
}
// Bound 1:1 to parakeet_capi.h. The C-API returns malloc'd char*
// buffers from transcribe_*; we register those as uintptr so we get
// the raw pointer back and can call parakeet_capi_free_string on it
// (purego's string return would copy and forget the original pointer,
// leaking it on every call).
libFuncs := []LibFuncs{
{&CppAbiVersion, "parakeet_capi_abi_version"},
{&CppLoad, "parakeet_capi_load"},
{&CppFree, "parakeet_capi_free"},
{&CppTranscribePath, "parakeet_capi_transcribe_path"},
{&CppTranscribePathJSON, "parakeet_capi_transcribe_path_json"},
{&CppStreamBegin, "parakeet_capi_stream_begin"},
{&CppStreamFeed, "parakeet_capi_stream_feed"},
{&CppStreamFinalize, "parakeet_capi_stream_finalize"},
{&CppStreamFree, "parakeet_capi_stream_free"},
{&CppFreeString, "parakeet_capi_free_string"},
{&CppLastError, "parakeet_capi_last_error"},
}
for _, lf := range libFuncs {
purego.RegisterLibFunc(lf.FuncPtr, lib, lf.Name)
}
// The batched-JSON entry point exists only in newer libparakeet.so (ABI >= 2).
// Probe with Dlsym and register only if present, so the backend still loads
// against an older library (it falls back to per-request transcription).
if sym, err := purego.Dlsym(lib, "parakeet_capi_transcribe_pcm_batch_json"); err == nil && sym != 0 {
purego.RegisterLibFunc(&CppTranscribePcmBatchJSON, lib, "parakeet_capi_transcribe_pcm_batch_json")
}
fmt.Fprintf(os.Stderr, "[parakeet-cpp] ABI=%d\n", CppAbiVersion())
flag.Parse()
if err := grpc.StartServer(*addr, &ParakeetCpp{}); err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,23 @@
#!/bin/bash
#
# L0 packaging stub: copy the binary, run.sh and libparakeet.so* into
# package/. The full ldd walk (libc, libstdc++, libgomp, GPU runtimes,
# arch detection) lands in L3, mirroring backend/go/whisper/package.sh.
set -e
CURDIR=$(dirname "$(realpath "$0")")
mkdir -p "$CURDIR/package/lib"
cp -avf "$CURDIR/parakeet-cpp-grpc" "$CURDIR/package/"
cp -avf "$CURDIR/run.sh" "$CURDIR/package/"
# libparakeet.so + any soname symlinks (libparakeet.so.X, libparakeet.so.X.Y).
cp -avf "$CURDIR"/libparakeet.so* "$CURDIR/package/lib/" 2>/dev/null || {
echo "ERROR: libparakeet.so not found in $CURDIR, run 'make' first" >&2
exit 1
}
echo "L0 package layout (full ldd walk lands in L3):"
ls -liah "$CURDIR/package/" "$CURDIR/package/lib/"

16
backend/go/parakeet-cpp/run.sh Executable file
View File

@@ -0,0 +1,16 @@
#!/bin/bash
set -e
CURDIR=$(dirname "$(realpath "$0")")
export LD_LIBRARY_PATH="$CURDIR/lib:$CURDIR:${LD_LIBRARY_PATH:-}"
# If a self-contained ld.so was packaged, route through it so the
# packaged libc / libstdc++ are used instead of the host's (matches the
# whisper backend's runtime layout).
if [ -f "$CURDIR/lib/ld.so" ]; then
echo "Using lib/ld.so"
exec "$CURDIR/lib/ld.so" "$CURDIR/parakeet-cpp-grpc" "$@"
fi
exec "$CURDIR/parakeet-cpp-grpc" "$@"

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# qwen3-tts.cpp version
QWEN3TTS_REPO?=https://github.com/predict-woo/qwen3-tts.cpp
QWEN3TTS_CPP_VERSION?=7a762e2ad4bacc6fdda81d81bf10a09ffb546f29
QWEN3TTS_CPP_VERSION?=136e5d36c17083da0321fd96512dc7b263f94a44
SO_TARGET?=libgoqwen3ttscpp.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF

View File

@@ -4,6 +4,7 @@ import (
"fmt"
"os"
"path/filepath"
"strings"
"github.com/mudler/LocalAI/pkg/grpc/base"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
@@ -21,6 +22,43 @@ type Qwen3TtsCpp struct {
threads int
}
// languageNameAliases maps common full language names to the canonical
// two-letter code understood by the C++ language_to_id table.
var languageNameAliases = map[string]string{
"english": "en",
"russian": "ru",
"chinese": "zh",
"japanese": "ja",
"korean": "ko",
"german": "de",
"french": "fr",
"spanish": "es",
"italian": "it",
"portuguese": "pt",
}
// normalizeLanguage coerces a caller-supplied language into the canonical code
// the model expects. It lowercases, trims, strips any region/locale suffix
// (en-US, en_US, ja.JP -> en/ja), and resolves common full names (english -> en).
// An empty input stays empty so the C++ side applies its English default; an
// unrecognized value is returned normalized so C++ can log it and default.
func normalizeLanguage(lang string) string {
lang = strings.ToLower(strings.TrimSpace(lang))
if lang == "" {
return ""
}
// Strip region/locale suffix: keep the segment before the first separator.
if i := strings.IndexAny(lang, "-_."); i >= 0 {
lang = lang[:i]
}
if code, ok := languageNameAliases[lang]; ok {
return code
}
return lang
}
func (q *Qwen3TtsCpp) Load(opts *pb.ModelOptions) error {
// ModelFile is the model directory path (containing GGUF files)
modelDir := opts.ModelFile
@@ -54,7 +92,7 @@ func (q *Qwen3TtsCpp) TTS(req *pb.TTSRequest) error {
dst := req.Dst
language := ""
if req.Language != nil {
language = *req.Language
language = normalizeLanguage(*req.Language)
}
// Synthesis parameters with sensible defaults

View File

@@ -0,0 +1,53 @@
package main
import (
"testing"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
func TestLanguageNormalization(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "qwen3-tts-cpp language normalization")
}
var _ = Describe("normalizeLanguage", func() {
DescribeTable("maps caller input to the canonical model language code",
func(input, expected string) {
Expect(normalizeLanguage(input)).To(Equal(expected))
},
// Canonical codes pass through unchanged
Entry("canonical en", "en", "en"),
Entry("canonical zh", "zh", "zh"),
Entry("canonical pt", "pt", "pt"),
// Case-insensitive
Entry("uppercase", "EN", "en"),
Entry("mixed case", "Ja", "ja"),
// Surrounding whitespace
Entry("trims whitespace", " en ", "en"),
// Region/locale stripping
Entry("BCP-47 region", "en-US", "en"),
Entry("underscore region", "en_US", "en"),
Entry("dotted locale", "ja.JP", "ja"),
Entry("region + case", "ZH-CN", "zh"),
// Full-name aliases
Entry("english name", "english", "en"),
Entry("chinese name cased", "Chinese", "zh"),
Entry("japanese name", "japanese", "ja"),
Entry("russian name", "russian", "ru"),
Entry("portuguese name", "portuguese", "pt"),
// Empty stays empty (C++ applies the English default)
Entry("empty", "", ""),
Entry("whitespace only", " ", ""),
// Unknown values pass through normalized so C++ can log + default
Entry("unknown code", "klingon", "klingon"),
Entry("unknown with region", "xx-YY", "xx"),
)
})

View File

@@ -11,7 +11,7 @@ JOBS?=$(shell nproc --ignore=1)
# build; leaving this on `master` always picks up the latest C-API surface
# (incl. the per-detection accessor functions used by gorfdetrcpp.go).
RFDETR_REPO?=https://github.com/mudler/rf-detr.cpp.git
RFDETR_VERSION?=main
RFDETR_VERSION?=65c0ffcc9a9bc9dae38252f63d0417c9845a6cf7
ifeq ($(NATIVE),false)
CMAKE_ARGS+=-DGGML_NATIVE=OFF

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# stablediffusion.cpp (ggml)
STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp
STABLEDIFFUSION_GGML_VERSION?=92dc7268fc4ffb0c0cc0bd52dfcefea91326e797
STABLEDIFFUSION_GGML_VERSION?=1f9ee88e09c258053fa59d5e05e23dfb10fa0b13
CMAKE_ARGS+=-DGGML_MAX_NAME=128

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# whisper.cpp version
WHISPER_REPO?=https://github.com/ggml-org/whisper.cpp
WHISPER_CPP_VERSION?=27101c01dcac1676e2b6422256233cd0f1f9ae28
WHISPER_CPP_VERSION?=99613cb720b65036237d44b52f753b51f75c2797
SO_TARGET?=libgowhisper.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF

View File

@@ -122,6 +122,62 @@
nvidia-cuda-12: "cuda12-whisper"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-whisper"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-whisper"
- &crispasr
name: "crispasr"
alias: "crispasr"
license: mit
icon: https://user-images.githubusercontent.com/1991296/235238348-05d0f6a4-da44-4900-a1de-d0707e75b763.jpeg
description: |
CrispASR unified speech engine (whisper.cpp fork on ggml) supporting many ASR architectures (Parakeet, Canary, Voxtral, Qwen3-ASR, Granite, Wav2Vec2, Moonshine, OmniASR, FireRedASR, and more).
urls:
- https://github.com/CrispStrobe/CrispASR
tags:
- audio-transcription
- CPU
- GPU
- CUDA
- HIP
capabilities:
default: "cpu-crispasr"
nvidia: "cuda12-crispasr"
intel: "intel-sycl-f16-crispasr"
metal: "metal-crispasr"
amd: "rocm-crispasr"
vulkan: "vulkan-crispasr"
nvidia-l4t: "nvidia-l4t-arm64-crispasr"
nvidia-cuda-13: "cuda13-crispasr"
nvidia-cuda-12: "cuda12-crispasr"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-crispasr"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-crispasr"
- &parakeetcpp
name: "parakeet-cpp"
alias: "parakeet-cpp"
license: mit
icon: https://avatars.githubusercontent.com/u/95302084
description: |
parakeet.cpp is a C++/ggml port of NVIDIA NeMo Parakeet automatic speech recognition (ASR) models.
It supports the tdt, ctc, rnnt and hybrid decoder families as well as cache-aware streaming transcription,
and runs on CPU, NVIDIA CUDA, AMD ROCm/HIP, Intel SYCL and NVIDIA Jetson (L4T) targets.
urls:
- https://github.com/mudler/parakeet.cpp
tags:
- audio-transcription
- CPU
- GPU
- CUDA
- HIP
capabilities:
default: "cpu-parakeet-cpp"
nvidia: "cuda12-parakeet-cpp"
intel: "intel-sycl-f16-parakeet-cpp"
metal: "metal-parakeet-cpp"
amd: "rocm-parakeet-cpp"
vulkan: "vulkan-parakeet-cpp"
nvidia-l4t: "nvidia-l4t-arm64-parakeet-cpp"
nvidia-cuda-13: "cuda13-parakeet-cpp"
nvidia-cuda-12: "cuda12-parakeet-cpp"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-parakeet-cpp"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-parakeet-cpp"
- &voxtral
name: "voxtral"
alias: "voxtral"
@@ -1928,6 +1984,246 @@
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-whisper"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-whisper
## crispasr
- !!merge <<: *crispasr
name: "crispasr-development"
capabilities:
default: "cpu-crispasr-development"
nvidia: "cuda12-crispasr-development"
intel: "intel-sycl-f16-crispasr-development"
metal: "metal-crispasr-development"
amd: "rocm-crispasr-development"
vulkan: "vulkan-crispasr-development"
nvidia-l4t: "nvidia-l4t-arm64-crispasr-development"
nvidia-cuda-13: "cuda13-crispasr-development"
nvidia-cuda-12: "cuda12-crispasr-development"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-crispasr-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-crispasr-development"
- !!merge <<: *crispasr
name: "nvidia-l4t-arm64-crispasr"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-crispasr"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-arm64-crispasr
- !!merge <<: *crispasr
name: "nvidia-l4t-arm64-crispasr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-crispasr"
mirrors:
- localai/localai-backends:master-nvidia-l4t-arm64-crispasr
- !!merge <<: *crispasr
name: "cuda13-nvidia-l4t-arm64-crispasr"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-crispasr"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-crispasr
- !!merge <<: *crispasr
name: "cuda13-nvidia-l4t-arm64-crispasr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-crispasr"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-crispasr
- !!merge <<: *crispasr
name: "cpu-crispasr"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-crispasr"
mirrors:
- localai/localai-backends:latest-cpu-crispasr
- !!merge <<: *crispasr
name: "metal-crispasr"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-crispasr"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-crispasr
- !!merge <<: *crispasr
name: "metal-crispasr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-crispasr"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-crispasr
- !!merge <<: *crispasr
name: "cpu-crispasr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-crispasr"
mirrors:
- localai/localai-backends:master-cpu-crispasr
- !!merge <<: *crispasr
name: "cuda12-crispasr"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-crispasr"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-crispasr
- !!merge <<: *crispasr
name: "rocm-crispasr"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-crispasr"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-crispasr
- !!merge <<: *crispasr
name: "intel-sycl-f32-crispasr"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-crispasr"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f32-crispasr
- !!merge <<: *crispasr
name: "intel-sycl-f16-crispasr"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-crispasr"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f16-crispasr
- !!merge <<: *crispasr
name: "vulkan-crispasr"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-crispasr"
mirrors:
- localai/localai-backends:latest-gpu-vulkan-crispasr
- !!merge <<: *crispasr
name: "vulkan-crispasr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-crispasr"
mirrors:
- localai/localai-backends:master-gpu-vulkan-crispasr
- !!merge <<: *crispasr
name: "metal-crispasr"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-crispasr"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-crispasr
- !!merge <<: *crispasr
name: "metal-crispasr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-crispasr"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-crispasr
- !!merge <<: *crispasr
name: "cuda12-crispasr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-crispasr"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-crispasr
- !!merge <<: *crispasr
name: "rocm-crispasr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-crispasr"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-crispasr
- !!merge <<: *crispasr
name: "intel-sycl-f32-crispasr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-crispasr"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f32-crispasr
- !!merge <<: *crispasr
name: "intel-sycl-f16-crispasr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-crispasr"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f16-crispasr
- !!merge <<: *crispasr
name: "cuda13-crispasr"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-crispasr"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-crispasr
- !!merge <<: *crispasr
name: "cuda13-crispasr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-crispasr"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-crispasr
## parakeet-cpp
- !!merge <<: *parakeetcpp
name: "parakeet-cpp-development"
capabilities:
default: "cpu-parakeet-cpp-development"
nvidia: "cuda12-parakeet-cpp-development"
intel: "intel-sycl-f16-parakeet-cpp-development"
metal: "metal-parakeet-cpp-development"
amd: "rocm-parakeet-cpp-development"
vulkan: "vulkan-parakeet-cpp-development"
nvidia-l4t: "nvidia-l4t-arm64-parakeet-cpp-development"
nvidia-cuda-13: "cuda13-parakeet-cpp-development"
nvidia-cuda-12: "cuda12-parakeet-cpp-development"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-parakeet-cpp-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-parakeet-cpp-development"
- !!merge <<: *parakeetcpp
name: "nvidia-l4t-arm64-parakeet-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-parakeet-cpp"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-arm64-parakeet-cpp
- !!merge <<: *parakeetcpp
name: "nvidia-l4t-arm64-parakeet-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-parakeet-cpp"
mirrors:
- localai/localai-backends:master-nvidia-l4t-arm64-parakeet-cpp
- !!merge <<: *parakeetcpp
name: "cuda13-nvidia-l4t-arm64-parakeet-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-parakeet-cpp"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-parakeet-cpp
- !!merge <<: *parakeetcpp
name: "cuda13-nvidia-l4t-arm64-parakeet-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-parakeet-cpp"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-parakeet-cpp
- !!merge <<: *parakeetcpp
name: "cpu-parakeet-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-parakeet-cpp"
mirrors:
- localai/localai-backends:latest-cpu-parakeet-cpp
- !!merge <<: *parakeetcpp
name: "cpu-parakeet-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-parakeet-cpp"
mirrors:
- localai/localai-backends:master-cpu-parakeet-cpp
- !!merge <<: *parakeetcpp
name: "metal-parakeet-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-parakeet-cpp"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-parakeet-cpp
- !!merge <<: *parakeetcpp
name: "metal-parakeet-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-parakeet-cpp"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-parakeet-cpp
- !!merge <<: *parakeetcpp
name: "cuda12-parakeet-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-parakeet-cpp"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-parakeet-cpp
- !!merge <<: *parakeetcpp
name: "cuda12-parakeet-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-parakeet-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-parakeet-cpp
- !!merge <<: *parakeetcpp
name: "rocm-parakeet-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-parakeet-cpp"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-parakeet-cpp
- !!merge <<: *parakeetcpp
name: "rocm-parakeet-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-parakeet-cpp"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-parakeet-cpp
- !!merge <<: *parakeetcpp
name: "intel-sycl-f32-parakeet-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-parakeet-cpp"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f32-parakeet-cpp
- !!merge <<: *parakeetcpp
name: "intel-sycl-f32-parakeet-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-parakeet-cpp"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f32-parakeet-cpp
- !!merge <<: *parakeetcpp
name: "intel-sycl-f16-parakeet-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-parakeet-cpp"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f16-parakeet-cpp
- !!merge <<: *parakeetcpp
name: "intel-sycl-f16-parakeet-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-parakeet-cpp"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f16-parakeet-cpp
- !!merge <<: *parakeetcpp
name: "vulkan-parakeet-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-parakeet-cpp"
mirrors:
- localai/localai-backends:latest-gpu-vulkan-parakeet-cpp
- !!merge <<: *parakeetcpp
name: "vulkan-parakeet-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-parakeet-cpp"
mirrors:
- localai/localai-backends:master-gpu-vulkan-parakeet-cpp
- !!merge <<: *parakeetcpp
name: "cuda13-parakeet-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-parakeet-cpp"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-parakeet-cpp
- !!merge <<: *parakeetcpp
name: "cuda13-parakeet-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-parakeet-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-parakeet-cpp
## stablediffusion-ggml
- !!merge <<: *stablediffusionggml
name: "cpu-stablediffusion-ggml"

View File

@@ -37,6 +37,20 @@ def is_int(s):
except ValueError:
return False
def coerce_param_value(value):
"""Coerce a TTSRequest.params value (string on the wire) to the type the
Chatterbox generate() kwargs expect (float/int/bool), matching how static
YAML options are coerced at load time. Non-string values pass through."""
if not isinstance(value, str):
return value
if is_float(value):
return float(value)
if is_int(value):
return int(value)
if value.lower() in ["true", "false"]:
return value.lower() == "true"
return value
def split_text_at_word_boundary(text, max_length=250):
"""
Split text at word boundaries without truncating words.
@@ -191,6 +205,14 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
# add options to kwargs
kwargs.update(self.options)
# Merge per-request params (TTSRequest.params), overriding the static
# YAML options. This exposes Chatterbox generation knobs (e.g.
# exaggeration, cfg_weight, temperature) per request. Values arrive as
# strings on the wire and are coerced to float/int/bool.
if hasattr(request, "params") and request.params:
for key, value in request.params.items():
kwargs[key] = coerce_param_value(value)
# Check if text exceeds 250 characters
# (chatterbox does not support long text)
# https://github.com/resemble-ai/chatterbox/issues/60

View File

@@ -1,6 +1,6 @@
--extra-index-url https://download.pytorch.org/whl/cpu
transformers==4.48.3
transformers==5.0.0rc3
accelerate
torch==2.4.1
torch==2.7.1+cpu
torchaudio==2.4.1
coqui-tts

View File

@@ -1,5 +1,5 @@
torch==2.4.1
torch==2.7.1+cpu
torchaudio==2.4.1
transformers==4.48.3
transformers==5.0.0rc3
accelerate
coqui-tts

View File

@@ -1,6 +1,6 @@
--extra-index-url https://download.pytorch.org/whl/rocm7.0
torch==2.10.0+rocm7.0
torch==2.7.1+cpu
torchaudio==2.10.0+rocm7.0
transformers==4.48.3
transformers==5.0.0rc3
accelerate
coqui-tts

View File

@@ -1,8 +1,8 @@
--extra-index-url https://download.pytorch.org/whl/xpu
torch==2.8.0+xpu
torch==2.7.1+cpu
torchaudio==2.8.0+xpu
optimum[openvino]
setuptools
transformers==4.48.3
transformers==5.0.0rc3
accelerate
coqui-tts

View File

@@ -1,4 +1,4 @@
torch==2.7.1
transformers==4.48.3
torch==2.7.1+cpu
transformers==5.0.0rc3
accelerate
coqui-tts

View File

@@ -1,3 +1,4 @@
--extra-index-url https://download.pytorch.org/whl/cu130
torch
texterrors==1.1.6
nemo_toolkit[asr]

View File

@@ -47,6 +47,26 @@ def is_int(s):
return False
def coerce_param_value(value):
"""Coerce a string param value (from the TTSRequest.params map, which is
string-typed on the wire) into the most specific Python type the model
generation kwargs expect: bool, int, float, else the original string."""
if not isinstance(value, str):
return value
lowered = value.strip().lower()
if lowered in ("true", "false"):
return lowered == "true"
try:
return int(value)
except ValueError:
pass
try:
return float(value)
except ValueError:
pass
return value
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
# If MAX_WORKERS are specified in the environment use it, otherwise default to 1
@@ -322,6 +342,19 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
return backend_pb2.Result(message="Model loaded successfully", success=True)
def _effective_instruct(self, request):
"""Resolve the instruction/style string for this request, preferring the
per-request TTSRequest.instructions value and falling back to the static
YAML `instruct` option. Empty string means "no instruction"."""
req_instruct = (
request.instructions
if hasattr(request, "instructions") and request.instructions
else ""
)
if req_instruct:
return req_instruct
return self.options.get("instruct", "") or ""
def _detect_mode(self, request):
"""Detect which mode to use based on request parameters."""
# Priority: VoiceClone > VoiceDesign > CustomVoice
@@ -338,8 +371,8 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
if self.audio_path or self.voices:
return "VoiceClone"
# VoiceDesign: instruct option is provided
if "instruct" in self.options and self.options["instruct"]:
# VoiceDesign: instruct provided per-request or via YAML option
if self._effective_instruct(request):
return "VoiceDesign"
# Default to CustomVoice
@@ -690,10 +723,20 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
if do_sample is not None:
generation_kwargs["do_sample"] = do_sample
instruct = self.options.get("instruct", "")
# Prefer the per-request instruction (TTSRequest.instructions) over the
# static YAML `instruct` option. This lets clients set a different style
# (CustomVoice emotion) or designed voice (VoiceDesign) per request.
instruct = self._effective_instruct(request)
if instruct is not None and instruct != "":
generation_kwargs["instruct"] = instruct
# Merge any per-request backend-specific params (TTSRequest.params).
# Values arrive as strings on the wire; coerce to int/float/bool so the
# model receives the types it expects. These override YAML-derived kwargs.
if hasattr(request, "params") and request.params:
for key, value in request.params.items():
generation_kwargs[key] = coerce_param_value(value)
# Generate audio based on mode
if mode == "VoiceClone":
# VoiceClone mode

View File

@@ -1,4 +1,4 @@
grpcio==1.80.0
grpcio==1.81.0
protobuf==7.35.0
certifi
setuptools

View File

@@ -3,5 +3,5 @@
# on a cu130 host. Pull the cu130-flavoured wheel from vLLM's per-tag index
# instead — the cublas13 case in install.sh adds --index-strategy=unsafe-best-match
# so uv consults this index alongside PyPI.
--extra-index-url https://wheels.vllm.ai/0.21.0/cu130
vllm==0.21.0
--extra-index-url https://wheels.vllm.ai/0.22.1/cu130
vllm==0.22.1

View File

@@ -1,4 +1,4 @@
grpcio==1.80.0
grpcio==1.81.0
protobuf
certifi
setuptools

View File

@@ -17,6 +17,7 @@ import (
"time"
"github.com/mudler/LocalAI/internal"
"github.com/mudler/LocalAI/pkg/httpclient"
)
// Release represents a LocalAI release
@@ -67,9 +68,7 @@ func NewReleaseManager() *ReleaseManager {
CurrentVersion: internal.PrintableVersion(),
ChecksumsPath: checksumsPath,
MetadataPath: metadataPath,
HTTPClient: &http.Client{
Timeout: 30 * time.Second,
},
HTTPClient: httpclient.NewWithTimeout(30*time.Second, httpclient.WithFollowRedirects()),
}
}

View File

@@ -90,6 +90,8 @@ type Application struct {
// LocalAI Assistant in-process MCP server. nil when DisableLocalAIAssistant
// is set; otherwise initialised in start() after galleryService.
localAIAssistant *mcpTools.LocalAIAssistantHolder
shutdownOnce sync.Once
}
func newApplication(appConfig *config.ApplicationConfig) *Application {
@@ -320,6 +322,24 @@ func (a *Application) IsDistributed() bool {
return a.distributed != nil
}
// Shutdown stops backend gRPC processes and distributed services
// synchronously on the caller's stack. The context-cancel goroutine wired
// in New does the same work asynchronously, which races test-binary exit
// and CLI shutdown — orphaning spawned mock-backend / llama.cpp / etc.
// children to init. Callers that need a guarantee that cleanup has
// finished before they proceed (AfterSuite/AfterEach, signal handlers)
// must call this. Safe to call multiple times.
func (a *Application) Shutdown() error {
var err error
a.shutdownOnce.Do(func() {
a.distributed.Shutdown()
if a.modelLoader != nil {
err = a.modelLoader.StopAllGRPC()
}
})
return err
}
// waitForHealthyWorker blocks until at least one healthy backend worker is registered.
// This prevents the agent pool from failing during startup when workers haven't connected yet.
func (a *Application) waitForHealthyWorker() {

View File

@@ -16,7 +16,9 @@ import (
"github.com/mudler/LocalAI/core/services/jobs"
"github.com/mudler/LocalAI/core/services/messaging"
"github.com/mudler/LocalAI/core/services/nodes"
"github.com/mudler/LocalAI/core/services/nodes/prefixcache"
"github.com/mudler/LocalAI/core/services/storage"
"github.com/mudler/LocalAI/pkg/distributedhdr"
"github.com/mudler/LocalAI/pkg/sanitize"
"github.com/mudler/xlog"
"gorm.io/gorm"
@@ -100,7 +102,12 @@ func initDistributed(cfg *config.ApplicationConfig, authDB *gorm.DB, configLoade
xlog.Info("Distributed instance", "id", cfg.Distributed.InstanceID)
// Connect to NATS
natsClient, err := messaging.New(cfg.Distributed.NatsURL)
natsAuth := cfg.Distributed.NatsAuthConfig()
if natsAuth.RequireAuth && (natsAuth.ServiceUserJWT == "" || natsAuth.ServiceUserSeed == "") {
return nil, fmt.Errorf("LOCALAI_NATS_REQUIRE_AUTH requires LOCALAI_NATS_SERVICE_JWT and LOCALAI_NATS_SERVICE_SEED")
}
natsOpts := cfg.Distributed.NatsMessagingOptions("", "")
natsClient, err := messaging.New(cfg.Distributed.NatsURL, natsOpts...)
if err != nil {
return nil, fmt.Errorf("connecting to NATS: %w", err)
}
@@ -240,6 +247,84 @@ func initDistributed(cfg *config.ApplicationConfig, authDB *gorm.DB, configLoade
cfg.Distributed.BackendUpgradeTimeoutOrDefault(),
)
// Prefix-cache-aware routing. Enabled by default; an operator can opt out
// with --distributed-prefix-cache=false, which leaves prefixProvider and
// pressure nil so the SmartRouter and reconciler behave exactly as the
// round-robin floor (true no-op). When enabled we build the local index,
// wrap it in a NATS-backed Sync (publishes our observations, applies peers'
// via the subscriptions below), install the extraction hook used by
// core/backend/llm.go, and run a background eviction ticker on the app ctx.
var prefixProvider prefixcache.Provider
var pressure *prefixcache.Pressure
var prefixCfg prefixcache.Config
if !cfg.Distributed.PrefixCacheDisabled {
prefixCfg = prefixcache.DefaultConfig()
if cfg.Distributed.PrefixCacheTTL > 0 {
prefixCfg.TTL = cfg.Distributed.PrefixCacheTTL
}
if err := prefixCfg.Validate(); err != nil {
return nil, fmt.Errorf("invalid prefix-cache configuration: %w", err)
}
idx := prefixcache.NewIndex(prefixCfg)
prefixSync := prefixcache.NewSync(idx, natsClient)
pressure = prefixcache.NewPressure(prefixCfg.PressureWindow)
prefixProvider = prefixSync
// Invalidate the prefix-cache index whenever a replica row is removed.
// SetReplicaRemovedHook fires from the single chokepoint all removal paths
// funnel through (RemoveNodeModel / RemoveAllNodeModelReplicas), so this
// one hook covers every path: reconciler scale-down, probe reaper,
// health-monitor reap, RemoteUnloaderAdapter, and the router. Registering
// it only inside this enabled block keeps the disabled path a true no-op
// (the registry stays hook-less).
registry.SetReplicaRemovedHook(func(model, node string, replica int) {
if replica < 0 {
prefixSync.InvalidateNode(model, node)
} else {
prefixSync.Invalidate(model, prefixcache.ReplicaKey{NodeID: node, Replica: replica})
}
})
distributedhdr.PrefixChainHook = func(model, prompt string) []uint64 {
return prefixcache.ExtractChain(model, prompt, prefixCfg)
}
// Apply peers' observations/invalidations to the same Sync. ApplyObserve
// and ApplyInvalidate update only the local index and do not re-publish,
// so there is no broadcast loop.
if _, err := messaging.SubscribeJSON(natsClient, messaging.SubjectPrefixCacheObserve, func(ev messaging.PrefixCacheObserveEvent) {
prefixSync.ApplyObserve(ev, time.Now())
}); err != nil {
return nil, fmt.Errorf("subscribing to %s: %w", messaging.SubjectPrefixCacheObserve, err)
}
if _, err := messaging.SubscribeJSON(natsClient, messaging.SubjectPrefixCacheInvalidate, func(ev messaging.PrefixCacheInvalidateEvent) {
prefixSync.ApplyInvalidate(ev)
}); err != nil {
return nil, fmt.Errorf("subscribing to %s: %w", messaging.SubjectPrefixCacheInvalidate, err)
}
// Background eviction: sweep idle entries on the app context. Stopped
// when the app context is cancelled (mirrors the reconciler loop which
// also runs on options.Context). TTL/2 keeps stale entries from
// outliving their idle window by more than half a TTL.
evictInterval := prefixCfg.TTL / 2
go func() {
ticker := time.NewTicker(evictInterval)
defer ticker.Stop()
for {
select {
case <-cfg.Context.Done():
return
case <-ticker.C:
prefixSync.Evict(time.Now())
}
}
}()
xlog.Info("Prefix-cache-aware routing enabled", "ttl", prefixCfg.TTL, "evictInterval", evictInterval)
} else {
xlog.Info("Prefix-cache-aware routing disabled: using round-robin routing")
}
// All dependencies ready — build SmartRouter with all options at once
var conflictResolver nodes.ConcurrencyConflictResolver
if configLoader != nil {
@@ -252,6 +337,9 @@ func initDistributed(cfg *config.ApplicationConfig, authDB *gorm.DB, configLoade
AuthToken: routerAuthToken,
DB: authDB,
ConflictResolver: conflictResolver,
PrefixProvider: prefixProvider,
PrefixConfig: prefixCfg,
Pressure: pressure,
})
// Create ReplicaReconciler for auto-scaling model replicas. Adapter +
@@ -268,6 +356,8 @@ func initDistributed(cfg *config.ApplicationConfig, authDB *gorm.DB, configLoade
Interval: 30 * time.Second,
ScaleDownDelay: 5 * time.Minute,
ProbeStaleAfter: 2 * time.Minute,
Pressure: pressure,
PressureThreshold: prefixCfg.PressureScaleThreshold,
})
// Create ModelRouterAdapter to wire into ModelLoader

View File

@@ -449,13 +449,15 @@ func New(opts ...config.AppOption) (*Application, error) {
application.ModelLoader().SetBackendLoggingEnabled(options.EnableBackendLogging)
// turn off any process that was started by GRPC if the context is canceled
// Safety-net cleanup if the application context is cancelled without
// the caller invoking Shutdown directly. This is fire-and-forget — it
// races binary exit and is unreliable in tests; the deterministic path
// is application.Shutdown(), which Shutdown's sync.Once dedupes with
// this goroutine.
go func() {
<-options.Context.Done()
xlog.Debug("Context canceled, shutting down")
application.distributed.Shutdown()
err := application.ModelLoader().StopAllGRPC()
if err != nil {
if err := application.Shutdown(); err != nil {
xlog.Error("error while stopping all grpc backends", "error", err)
}
}()

View File

@@ -123,14 +123,14 @@ var _ = Describe("X-LocalAI-Node ctx propagation contract", func() {
})
It("ModelTTS forwards the request context to the SmartRouter", func() {
_, _, err := backend.ModelTTS(reqCtx, "hello", "", "", loader, appCfg, modelCfg)
_, _, err := backend.ModelTTS(reqCtx, "hello", "", "", "", nil, loader, appCfg, modelCfg)
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("router short-circuit (test)"))
stampViaRouterCtx()
})
It("ModelTTSStream forwards the request context to the SmartRouter", func() {
err := backend.ModelTTSStream(reqCtx, "hello", "", "", loader, appCfg, modelCfg, func([]byte) error { return nil })
err := backend.ModelTTSStream(reqCtx, "hello", "", "", "", nil, loader, appCfg, modelCfg, func([]byte) error { return nil })
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("router short-circuit (test)"))
stampViaRouterCtx()

View File

@@ -19,6 +19,7 @@ import (
"github.com/mudler/LocalAI/core/trace"
"github.com/mudler/LocalAI/core/gallery"
"github.com/mudler/LocalAI/pkg/distributedhdr"
"github.com/mudler/LocalAI/pkg/grpc/proto"
model "github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/LocalAI/pkg/utils"
@@ -94,6 +95,22 @@ func ModelInference(ctx context.Context, s string, messages schema.Messages, ima
}
}
// Make the rendered prompt's prefix chain available to the distributed router
// for prefix-cache-aware node selection. No-op in single-process mode. The
// model id MUST match the id ModelOptions feeds to model.WithModelID, so both
// use the shared config.ModelConfig.ModelID() helper (Name with a fallback to
// Model) or the chain salt and the tracking key would diverge.
//
// s is empty for UseTokenizerTemplate models (the backend tokenizes the
// structured messages itself), so fall back to a prefix-stable serialization
// of the messages - otherwise prefix routing would silently degrade to
// round-robin for the bulk of modern chat models.
chainSource := s
if chainSource == "" {
chainSource = messagesPrefixSource(messages)
}
ctx = distributedhdr.MaybeWithPrefixChain(ctx, c.ModelID(), chainSource)
opts := ModelOptions(*c, o, model.WithContext(ctx))
inferenceModel, err := loader.Load(opts...)
if err != nil {

View File

@@ -34,16 +34,11 @@ func recordModelLoadFailure(appConfig *config.ApplicationConfig, modelName, back
}
func ModelOptions(c config.ModelConfig, so *config.ApplicationConfig, opts ...model.Option) []model.Option {
name := c.Name
if name == "" {
name = c.Model
}
defOpts := []model.Option{
model.WithBackendString(c.Backend),
model.WithModel(c.Model),
model.WithContext(so.Context),
model.WithModelID(name),
model.WithModelID(c.ModelID()),
}
threads := 1
@@ -244,13 +239,13 @@ func grpcModelOpts(c config.ModelConfig, modelPath string) *pb.ModelOptions {
if c.Backend == "cloud-proxy" {
opts.Proxy = &pb.ProxyOptions{
UpstreamUrl: c.Proxy.UpstreamURL,
Mode: c.Proxy.Mode,
Provider: c.Proxy.Provider,
ApiKeyEnv: c.Proxy.APIKeyEnv,
ApiKeyFile: c.Proxy.APIKeyFile,
UpstreamModel: c.Proxy.UpstreamModel,
RequestTimeoutSeconds: int32(c.Proxy.RequestTimeoutSeconds),
UpstreamUrl: c.Proxy.UpstreamURL,
Mode: c.Proxy.Mode,
Provider: c.Proxy.Provider,
ApiKeyEnv: c.Proxy.APIKeyEnv,
ApiKeyFile: c.Proxy.APIKeyFile,
UpstreamModel: c.Proxy.UpstreamModel,
RequestTimeoutSeconds: int32(c.Proxy.RequestTimeoutSeconds),
}
}
@@ -328,6 +323,12 @@ func gRPCPredictOpts(c config.ModelConfig, modelPath string) *pb.PredictOptions
metadata["enable_thinking"] = "true"
}
}
// Forward the effective reasoning effort so the backend can pass it to the
// jinja chat template (chat_template_kwargs.reasoning_effort) — the lever
// models like gpt-oss / LFM2.5 actually read, distinct from enable_thinking.
if c.ReasoningEffort != "" {
metadata["reasoning_effort"] = c.ReasoningEffort
}
pbOpts.Metadata = metadata
// Logprobs and TopLogprobs are set by the caller if provided

View File

@@ -4,6 +4,7 @@ import (
"encoding/json"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/pkg/reasoning"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
@@ -42,3 +43,57 @@ var _ = Describe("grpcModelOpts EngineArgs", func() {
Expect(opts.EngineArgs).To(BeEmpty())
})
})
// Guards the DisableReasoning -> enable_thinking metadata conversion that the
// per-request reasoning_effort feature (issue #10072) relies on: the request
// merge sets ReasoningConfig.DisableReasoning, and gRPCPredictOpts is where it
// becomes the gRPC PredictOptions.Metadata the backend reads.
var _ = Describe("gRPCPredictOpts enable_thinking metadata", func() {
// withReasoning builds a fully-defaulted config (gRPCPredictOpts dereferences
// many pointer fields) and overrides only the reasoning toggle.
withReasoning := func(disable *bool) config.ModelConfig {
cfg := config.ModelConfig{}
cfg.SetDefaults()
cfg.ReasoningConfig = reasoning.Config{DisableReasoning: disable}
return cfg
}
disabled := true
enabled := false
It("emits enable_thinking=false when reasoning is disabled", func() {
opts := gRPCPredictOpts(withReasoning(&disabled), "/tmp/models")
Expect(opts.Metadata).To(HaveKeyWithValue("enable_thinking", "false"))
})
It("emits enable_thinking=true when reasoning is enabled", func() {
opts := gRPCPredictOpts(withReasoning(&enabled), "/tmp/models")
Expect(opts.Metadata).To(HaveKeyWithValue("enable_thinking", "true"))
})
It("omits enable_thinking when reasoning is unset", func() {
opts := gRPCPredictOpts(withReasoning(nil), "/tmp/models")
Expect(opts.Metadata).ToNot(HaveKey("enable_thinking"))
})
})
// Guards forwarding the effective reasoning_effort into PredictOptions.Metadata,
// where the backend passes it to the jinja chat template (chat_template_kwargs)
// so models like gpt-oss / LFM2.5 honor it.
var _ = Describe("gRPCPredictOpts reasoning_effort metadata", func() {
withEffort := func(effort string) config.ModelConfig {
cfg := config.ModelConfig{}
cfg.SetDefaults()
cfg.ReasoningEffort = effort
return cfg
}
It("forwards reasoning_effort when set", func() {
opts := gRPCPredictOpts(withEffort("none"), "/tmp/models")
Expect(opts.Metadata).To(HaveKeyWithValue("reasoning_effort", "none"))
})
It("omits reasoning_effort when empty", func() {
opts := gRPCPredictOpts(withEffort(""), "/tmp/models")
Expect(opts.Metadata).ToNot(HaveKey("reasoning_effort"))
})
})

View File

@@ -0,0 +1,36 @@
package backend
import (
"strings"
"github.com/mudler/LocalAI/core/schema"
)
// messagesPrefixSource builds a deterministic, prefix-stable serialization of a
// chat conversation for prefix-cache-aware routing. It is the fallback used when
// the frontend did not render a prompt string: models with
// config.TemplateConfig.UseTokenizerTemplate tokenize the structured messages
// backend-side, so the frontend's rendered prompt is empty and a chain built
// from it would always be empty - silently degrading prefix routing to
// round-robin for the bulk of modern chat models.
//
// Messages are emitted head-first in turn order (role line + content line per
// message), so two conversations sharing a leading system prompt and early turns
// share a leading byte prefix. That is exactly what ExtractChain hashes into a
// shared chain prefix, landing both requests on the same cache-warm replica.
func messagesPrefixSource(messages schema.Messages) string {
var b strings.Builder
for _, m := range messages {
b.WriteString(m.Role)
b.WriteByte('\n')
content := m.StringContent
if content == "" {
if s, ok := m.Content.(string); ok {
content = s
}
}
b.WriteString(content)
b.WriteByte('\n')
}
return b.String()
}

View File

@@ -0,0 +1,53 @@
package backend
import (
"strings"
"github.com/mudler/LocalAI/core/schema"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("messagesPrefixSource", func() {
mk := func(role, content string) schema.Message {
return schema.Message{Role: role, StringContent: content}
}
It("serializes messages head-first in turn order", func() {
got := messagesPrefixSource(schema.Messages{
mk("system", "You are helpful."),
mk("user", "Hi"),
})
Expect(got).To(Equal("system\nYou are helpful.\nuser\nHi\n"))
})
It("is deterministic across calls for the same conversation", func() {
conv := schema.Messages{mk("system", "S"), mk("user", "U")}
Expect(messagesPrefixSource(conv)).To(Equal(messagesPrefixSource(conv)))
})
It("shares a leading byte prefix when the system prompt is shared", func() {
shared := "system\nShared system prompt.\nuser\n"
a := messagesPrefixSource(schema.Messages{mk("system", "Shared system prompt."), mk("user", "Question A")})
b := messagesPrefixSource(schema.Messages{mk("system", "Shared system prompt."), mk("user", "Question B")})
Expect(strings.HasPrefix(a, shared)).To(BeTrue())
Expect(strings.HasPrefix(b, shared)).To(BeTrue())
})
It("does NOT share a prefix when the system prompt differs", func() {
a := messagesPrefixSource(schema.Messages{mk("system", "Prompt A"), mk("user", "Q")})
b := messagesPrefixSource(schema.Messages{mk("system", "Prompt B"), mk("user", "Q")})
Expect(strings.HasPrefix(a, "system\nPrompt A")).To(BeTrue())
Expect(strings.HasPrefix(b, "system\nPrompt B")).To(BeTrue())
})
It("returns empty for no messages", func() {
Expect(messagesPrefixSource(nil)).To(Equal(""))
})
It("falls back to Content when StringContent is empty", func() {
got := messagesPrefixSource(schema.Messages{{Role: "user", Content: "plain"}})
Expect(got).To(Equal("user\nplain\n"))
})
})

View File

@@ -20,11 +20,32 @@ import (
"github.com/mudler/LocalAI/pkg/utils"
)
// newTTSRequest assembles the gRPC TTSRequest from the per-request inputs. The
// optional instructions string is only attached when non-empty so backends can
// distinguish "no per-request instruction" (fall back to YAML) from an explicit
// empty one. params is forwarded as-is (nil when unset).
func newTTSRequest(text, modelPath, voice, dst, language, instructions string, params map[string]string) *proto.TTSRequest {
req := &proto.TTSRequest{
Text: text,
Model: modelPath,
Voice: voice,
Dst: dst,
Language: &language,
Params: params,
}
if instructions != "" {
req.Instructions = &instructions
}
return req
}
func ModelTTS(
ctx context.Context,
text,
voice,
language string,
language,
instructions string,
params map[string]string,
loader *model.ModelLoader,
appConfig *config.ApplicationConfig,
modelConfig config.ModelConfig,
@@ -74,13 +95,9 @@ func ModelTTS(
startTime = time.Now()
}
res, err := ttsModel.TTS(ctx, &proto.TTSRequest{
Text: text,
Model: modelPath,
Voice: voice,
Dst: filePath,
Language: &language,
})
ttsRequest := newTTSRequest(text, modelPath, voice, filePath, language, instructions, params)
res, err := ttsModel.TTS(ctx, ttsRequest)
if appConfig.EnableTracing {
errStr := ""
@@ -128,7 +145,9 @@ func ModelTTSStream(
ctx context.Context,
text,
voice,
language string,
language,
instructions string,
params map[string]string,
loader *model.ModelLoader,
appConfig *config.ApplicationConfig,
modelConfig config.ModelConfig,
@@ -177,12 +196,10 @@ func ModelTTSStream(
var totalPCMBytes int
snippetCapped := false
err = ttsModel.TTSStream(ctx, &proto.TTSRequest{
Text: text,
Model: modelPath,
Voice: voice,
Language: &language,
}, func(reply *proto.Reply) {
// Streaming TTS writes to the HTTP response, not a file, so dst is empty.
ttsRequest := newTTSRequest(text, modelPath, voice, "", language, instructions, params)
err = ttsModel.TTSStream(ctx, ttsRequest, func(reply *proto.Reply) {
// First message contains sample rate info
if !headerSent && len(reply.Message) > 0 {
var info map[string]any

42
core/backend/tts_test.go Normal file
View File

@@ -0,0 +1,42 @@
package backend
// Specs for the TTSRequest assembly that carries the per-request
// instructions/params from the OpenAI `instructions` field (and the LocalAI
// `params` extension) through to the gRPC boundary. Before this plumbing the
// instruction value was dropped before reaching the backend; these specs pin
// that it now survives, and that the empty case stays backward compatible.
import (
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("newTTSRequest", func() {
It("attaches the instructions when a per-request value is set", func() {
req := newTTSRequest("hi", "/m", "alloy", "/out.wav", "en", "cheerful narrator", nil)
Expect(req.Instructions).ToNot(BeNil())
Expect(req.GetInstructions()).To(Equal("cheerful narrator"))
Expect(req.GetText()).To(Equal("hi"))
Expect(req.GetVoice()).To(Equal("alloy"))
Expect(req.GetDst()).To(Equal("/out.wav"))
Expect(req.GetLanguage()).To(Equal("en"))
})
It("leaves instructions unset when empty so backends fall back to YAML", func() {
req := newTTSRequest("hi", "/m", "", "/out.wav", "", "", nil)
Expect(req.Instructions).To(BeNil())
Expect(req.GetInstructions()).To(Equal(""))
})
It("forwards per-request params through to the backend", func() {
params := map[string]string{"exaggeration": "0.7", "cfg_weight": "0.3"}
req := newTTSRequest("hi", "/m", "", "/out.wav", "", "", params)
Expect(req.GetParams()).To(HaveKeyWithValue("exaggeration", "0.7"))
Expect(req.GetParams()).To(HaveKeyWithValue("cfg_weight", "0.3"))
})
It("leaves params nil when none are supplied", func() {
req := newTTSRequest("hi", "/m", "", "/out.wav", "", "", nil)
Expect(req.GetParams()).To(BeNil())
})
})

View File

@@ -52,10 +52,28 @@ type AgentWorkerCMD struct {
Subject string `env:"LOCALAI_AGENT_SUBJECT" default:"agent.execute" help:"NATS subject for agent execution" group:"distributed"`
Queue string `env:"LOCALAI_AGENT_QUEUE" default:"agent-workers" help:"NATS queue group name" group:"distributed"`
NatsJWT string `env:"LOCALAI_NATS_JWT" help:"NATS user JWT override (defaults to nats_jwt from registration)" group:"distributed"`
NatsUserSeed string `env:"LOCALAI_NATS_USER_SEED" help:"NATS user seed override (defaults to nats_user_seed from registration)" group:"distributed"`
NatsServiceJWT string `env:"LOCALAI_NATS_SERVICE_JWT" help:"Fallback NATS service JWT when registration does not mint agent JWT" group:"distributed"`
NatsServiceSeed string `env:"LOCALAI_NATS_SERVICE_SEED" help:"Fallback NATS service seed paired with LOCALAI_NATS_SERVICE_JWT" group:"distributed"`
NatsRequireAuth bool `env:"LOCALAI_NATS_REQUIRE_AUTH" default:"false" help:"Require NATS JWT+seed to connect" group:"distributed"`
// DistributedRequireAuth is the umbrella switch; for the agent worker (which
// has no file-transfer server) it implies NATS auth is required.
DistributedRequireAuth bool `env:"LOCALAI_DISTRIBUTED_REQUIRE_AUTH" default:"false" help:"Umbrella switch implying --nats-require-auth (agent workers have no file-transfer server)" group:"distributed"`
NatsTLSCA string `env:"LOCALAI_NATS_TLS_CA" type:"existingfile" help:"PEM file for NATS server CA (private PKI)" group:"distributed"`
NatsTLSCert string `env:"LOCALAI_NATS_TLS_CERT" type:"existingfile" help:"Client certificate for NATS mTLS" group:"distributed"`
NatsTLSKey string `env:"LOCALAI_NATS_TLS_KEY" type:"existingfile" help:"Client private key for NATS mTLS" group:"distributed"`
// Timeouts
MCPCIJobTimeout string `env:"LOCALAI_MCP_CI_JOB_TIMEOUT" default:"10m" help:"Timeout for MCP CI job execution" group:"distributed"`
}
// natsAuthRequired reports whether NATS JWT credentials must be present — the
// granular flag or the umbrella (LOCALAI_DISTRIBUTED_REQUIRE_AUTH).
func (cmd *AgentWorkerCMD) natsAuthRequired() bool {
return cmd.NatsRequireAuth || cmd.DistributedRequireAuth
}
func (cmd *AgentWorkerCMD) Run(ctx *cliContext.Context) error {
xlog.Info("Starting agent worker", "nats", sanitize.URL(cmd.NatsURL), "register_to", cmd.RegisterTo)
@@ -81,15 +99,30 @@ func (cmd *AgentWorkerCMD) Run(ctx *cliContext.Context) error {
registrationBody["token"] = cmd.RegistrationToken
}
nodeID, apiToken, err := regClient.RegisterWithRetry(context.Background(), registrationBody, 10)
// Context cancelled on shutdown — used by registration waits, heartbeat, and
// other background goroutines.
shutdownCtx, shutdownCancel := context.WithCancel(context.Background())
defer shutdownCancel()
// Acquire credentials via (re)registration. When the bus requires auth and no
// static fallback is configured, wait through admin approval until the
// frontend mints credentials rather than starting unauthenticated.
credMgr := workerregistry.NewNATSCredentialManager(
func(ctx context.Context) (*workerregistry.RegisterResponse, error) {
return regClient.RegisterFull(ctx, registrationBody)
},
cmd.natsAuthRequired() && cmd.NatsJWT == "" && cmd.NatsServiceJWT == "",
)
res, err := credMgr.Acquire(shutdownCtx)
if err != nil {
return fmt.Errorf("registration failed: %w", err)
}
nodeID := res.ID
xlog.Info("Registered with frontend", "nodeID", nodeID, "frontend", cmd.RegisterTo)
// Use provisioned API token if none was set
if cmd.APIToken == "" {
cmd.APIToken = apiToken
cmd.APIToken = res.APIToken
}
// Start heartbeat
@@ -98,14 +131,40 @@ func (cmd *AgentWorkerCMD) Run(ctx *cliContext.Context) error {
xlog.Warn("invalid heartbeat interval, using default 10s", "input", cmd.HeartbeatInterval, "error", err)
}
heartbeatInterval = cmp.Or(heartbeatInterval, 10*time.Second)
// Context cancelled on shutdown — used by heartbeat and other background goroutines
shutdownCtx, shutdownCancel := context.WithCancel(context.Background())
defer shutdownCancel()
go regClient.HeartbeatLoop(shutdownCtx, nodeID, heartbeatInterval, func() map[string]any { return map[string]any{} })
// Connect to NATS
natsClient, err := messaging.New(cmd.NatsURL)
// Resolve NATS credentials with precedence: explicit env override, then
// frontend-minted (auto-refreshed before expiry), then service fallback.
// Each static source must supply JWT and seed together.
natsTLS := messaging.TLSFiles{CA: cmd.NatsTLSCA, Cert: cmd.NatsTLSCert, Key: cmd.NatsTLSKey}
var natsOpts []messaging.Option
switch {
case cmd.NatsJWT != "" || cmd.NatsUserSeed != "":
if (cmd.NatsJWT == "") != (cmd.NatsUserSeed == "") {
return fmt.Errorf("LOCALAI_NATS_JWT and LOCALAI_NATS_USER_SEED must be set together")
}
natsOpts = append(natsOpts, messaging.WithUserJWT(cmd.NatsJWT, cmd.NatsUserSeed))
case credMgr.HasCredentials():
natsOpts = append(natsOpts, messaging.WithUserJWTProvider(credMgr.Provider()))
go func() {
if err := credMgr.RefreshLoop(shutdownCtx); err != nil {
xlog.Error("NATS credential refresh permanently failed; shutting down agent worker", "error", err)
shutdownCancel()
}
}()
case cmd.NatsServiceJWT != "" || cmd.NatsServiceSeed != "":
if (cmd.NatsServiceJWT == "") != (cmd.NatsServiceSeed == "") {
return fmt.Errorf("LOCALAI_NATS_SERVICE_JWT and LOCALAI_NATS_SERVICE_SEED must be set together")
}
natsOpts = append(natsOpts, messaging.WithUserJWT(cmd.NatsServiceJWT, cmd.NatsServiceSeed))
case cmd.natsAuthRequired():
return fmt.Errorf("NATS JWT+seed required: enable frontend minting or set LOCALAI_NATS_* env vars")
}
if natsTLS.Enabled() {
natsOpts = append(natsOpts, messaging.WithTLS(natsTLS))
}
natsClient, err := messaging.New(cmd.NatsURL, natsOpts...)
if err != nil {
return fmt.Errorf("connecting to NATS: %w", err)
}
@@ -183,17 +242,25 @@ func (cmd *AgentWorkerCMD) Run(ctx *cliContext.Context) error {
xlog.Info("Agent worker ready, waiting for jobs", "subject", cmd.Subject, "queue", cmd.Queue)
// Wait for shutdown
// Wait for an OS signal or an internal fatal condition (e.g. NATS
// credentials became unrenewable), so the worker restarts and re-acquires
// rather than lingering unable to serve.
sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, syscall.SIGINT, syscall.SIGTERM)
<-sigCh
var runErr error
select {
case <-sigCh:
case <-shutdownCtx.Done():
runErr = fmt.Errorf("agent worker shutting down: NATS credentials unavailable")
xlog.Error("Internal shutdown requested", "error", runErr)
}
xlog.Info("Shutting down agent worker")
shutdownCancel() // stop heartbeat loop immediately
dispatcher.Stop()
mcpTools.CloseAllMCPSessions()
regClient.GracefulDeregister(nodeID)
return nil
return runErr
}
// handleMCPToolRequest handles a NATS request-reply for MCP tool execution.

View File

@@ -145,19 +145,31 @@ type RunCMD struct {
DefaultAPIKeyExpiry string `env:"LOCALAI_DEFAULT_API_KEY_EXPIRY" help:"Default expiry for API keys (e.g. 90d, 1y; empty = no expiry)" group:"auth"`
// Distributed / Horizontal Scaling
Distributed bool `env:"LOCALAI_DISTRIBUTED" default:"false" help:"Enable distributed mode (requires PostgreSQL + NATS)" group:"distributed"`
InstanceID string `env:"LOCALAI_INSTANCE_ID" help:"Unique instance ID for distributed mode (auto-generated UUID if empty)" group:"distributed"`
NatsURL string `env:"LOCALAI_NATS_URL" help:"NATS server URL (e.g., nats://localhost:4222)" group:"distributed"`
StorageURL string `env:"LOCALAI_STORAGE_URL" help:"S3-compatible storage endpoint URL (e.g., http://minio:9000)" group:"distributed"`
StorageBucket string `env:"LOCALAI_STORAGE_BUCKET" default:"localai" help:"S3 bucket name for object storage" group:"distributed"`
StorageRegion string `env:"LOCALAI_STORAGE_REGION" default:"us-east-1" help:"S3 region" group:"distributed"`
StorageAccessKey string `env:"LOCALAI_STORAGE_ACCESS_KEY" help:"S3 access key ID" group:"distributed"`
StorageSecretKey string `env:"LOCALAI_STORAGE_SECRET_KEY" help:"S3 secret access key" group:"distributed"`
RegistrationToken string `env:"LOCALAI_REGISTRATION_TOKEN" help:"Token that backend nodes must provide to register (empty = no auth required)" group:"distributed"`
AutoApproveNodes bool `env:"LOCALAI_AUTO_APPROVE_NODES" default:"false" help:"Auto-approve new worker nodes (skip admin approval)" group:"distributed"`
BackendInstallTimeout string `env:"LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT" help:"NATS round-trip timeout for backend.install requests sent to worker nodes (default 15m). Increase for slow links pulling multi-GB images." group:"distributed"`
BackendUpgradeTimeout string `env:"LOCALAI_NATS_BACKEND_UPGRADE_TIMEOUT" help:"NATS round-trip timeout for backend.upgrade requests (default 15m)." group:"distributed"`
ExposeNodeHeader bool `env:"LOCALAI_EXPOSE_NODE_HEADER" default:"false" help:"Set the X-LocalAI-Node response header on inference responses (OpenAI chat/completions/embeddings, Anthropic /v1/messages, Ollama /api/chat,/api/generate,/api/embed) with the ID of the worker that served the request. Disabled by default: the node ID reveals internal topology and should not be exposed on a public endpoint. Best-effort: under heavy concurrency the header may reflect a recent routing decision rather than this exact request's." group:"distributed"`
Distributed bool `env:"LOCALAI_DISTRIBUTED" default:"false" help:"Enable distributed mode (requires PostgreSQL + NATS)" group:"distributed"`
InstanceID string `env:"LOCALAI_INSTANCE_ID" help:"Unique instance ID for distributed mode (auto-generated UUID if empty)" group:"distributed"`
NatsURL string `env:"LOCALAI_NATS_URL" help:"NATS server URL (e.g., nats://localhost:4222)" group:"distributed"`
StorageURL string `env:"LOCALAI_STORAGE_URL" help:"S3-compatible storage endpoint URL (e.g., http://minio:9000)" group:"distributed"`
StorageBucket string `env:"LOCALAI_STORAGE_BUCKET" default:"localai" help:"S3 bucket name for object storage" group:"distributed"`
StorageRegion string `env:"LOCALAI_STORAGE_REGION" default:"us-east-1" help:"S3 region" group:"distributed"`
StorageAccessKey string `env:"LOCALAI_STORAGE_ACCESS_KEY" help:"S3 access key ID" group:"distributed"`
StorageSecretKey string `env:"LOCALAI_STORAGE_SECRET_KEY" help:"S3 secret access key" group:"distributed"`
RegistrationToken string `env:"LOCALAI_REGISTRATION_TOKEN" help:"Token that backend nodes must provide to register (empty = no auth required)" group:"distributed"`
RegistrationRequireAuth bool `env:"LOCALAI_REGISTRATION_REQUIRE_AUTH" default:"false" help:"Fail startup when distributed mode is enabled but LOCALAI_REGISTRATION_TOKEN is empty (node endpoints and worker file-transfer server would otherwise be unauthenticated)" group:"distributed"`
DistributedRequireAuth bool `env:"LOCALAI_DISTRIBUTED_REQUIRE_AUTH" default:"false" help:"Umbrella switch: require BOTH NATS JWT credentials and a registration token when distributed mode is enabled (implies --nats-require-auth and --registration-require-auth)" group:"distributed"`
AutoApproveNodes bool `env:"LOCALAI_AUTO_APPROVE_NODES" default:"false" help:"Auto-approve new worker nodes (skip admin approval)" group:"distributed"`
DistributedPrefixCache bool `env:"LOCALAI_DISTRIBUTED_PREFIX_CACHE" default:"true" help:"Enable prefix-cache-aware routing in distributed mode (default true). When false, routing falls back to round-robin." group:"distributed"`
DistributedPrefixCacheTTL string `env:"LOCALAI_DISTRIBUTED_PREFIX_CACHE_TTL" help:"Idle-timeout for prefix-cache index entries; also drives the background eviction cadence (every TTL/2). Default 5m." group:"distributed"`
BackendInstallTimeout string `env:"LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT" help:"NATS round-trip timeout for backend.install requests sent to worker nodes (default 15m). Increase for slow links pulling multi-GB images." group:"distributed"`
BackendUpgradeTimeout string `env:"LOCALAI_NATS_BACKEND_UPGRADE_TIMEOUT" help:"NATS round-trip timeout for backend.upgrade requests (default 15m)." group:"distributed"`
NatsAccountSeed string `env:"LOCALAI_NATS_ACCOUNT_SEED" help:"NATS account signing seed (SU...) used to mint per-node worker JWTs at registration" group:"distributed"`
NatsServiceJWT string `env:"LOCALAI_NATS_SERVICE_JWT" help:"NATS user JWT for the frontend (and agent workers) to publish control-plane messages" group:"distributed"`
NatsServiceSeed string `env:"LOCALAI_NATS_SERVICE_SEED" help:"NATS user signing seed (SU...) paired with LOCALAI_NATS_SERVICE_JWT" group:"distributed"`
NatsWorkerJWTTTL string `env:"LOCALAI_NATS_WORKER_JWT_TTL" help:"Lifetime of minted per-node NATS JWTs (e.g. 24h, default 24h)" group:"distributed"`
NatsRequireAuth bool `env:"LOCALAI_NATS_REQUIRE_AUTH" default:"false" help:"Require NATS JWT credentials (service JWT + account seed) when distributed mode is enabled" group:"distributed"`
NatsTLSCA string `env:"LOCALAI_NATS_TLS_CA" type:"existingfile" help:"PEM file for NATS server CA (private PKI); use with tls:// in --nats-url" group:"distributed"`
NatsTLSCert string `env:"LOCALAI_NATS_TLS_CERT" type:"existingfile" help:"Client certificate for NATS mTLS" group:"distributed"`
NatsTLSKey string `env:"LOCALAI_NATS_TLS_KEY" type:"existingfile" help:"Client private key for NATS mTLS" group:"distributed"`
ExposeNodeHeader bool `env:"LOCALAI_EXPOSE_NODE_HEADER" default:"false" help:"Set the X-LocalAI-Node response header on inference responses (OpenAI chat/completions/embeddings, Anthropic /v1/messages, Ollama /api/chat,/api/generate,/api/embed) with the ID of the worker that served the request. Disabled by default: the node ID reveals internal topology and should not be exposed on a public endpoint. Best-effort: under heavy concurrency the header may reflect a recent routing decision rather than this exact request's." group:"distributed"`
Version bool
@@ -281,9 +293,53 @@ func (r *RunCMD) Run(ctx *cliContext.Context) error {
if r.RegistrationToken != "" {
opts = append(opts, config.WithRegistrationToken(r.RegistrationToken))
}
if r.RegistrationRequireAuth {
opts = append(opts, config.EnableRegistrationRequireAuth)
}
if r.DistributedRequireAuth {
opts = append(opts, config.EnableDistributedRequireAuth)
}
if r.NatsAccountSeed != "" {
opts = append(opts, config.WithNatsAccountSeed(r.NatsAccountSeed))
}
if r.NatsServiceJWT != "" {
opts = append(opts, config.WithNatsServiceJWT(r.NatsServiceJWT))
}
if r.NatsServiceSeed != "" {
opts = append(opts, config.WithNatsServiceSeed(r.NatsServiceSeed))
}
if r.NatsWorkerJWTTTL != "" {
d, err := time.ParseDuration(r.NatsWorkerJWTTTL)
if err != nil {
return fmt.Errorf("invalid LOCALAI_NATS_WORKER_JWT_TTL %q: %w", r.NatsWorkerJWTTTL, err)
}
opts = append(opts, config.WithNatsWorkerJWTTTL(d))
}
if r.NatsRequireAuth {
opts = append(opts, config.EnableNatsRequireAuth)
}
if r.NatsTLSCA != "" {
opts = append(opts, config.WithNatsTLSCA(r.NatsTLSCA))
}
if r.NatsTLSCert != "" {
opts = append(opts, config.WithNatsTLSCert(r.NatsTLSCert))
}
if r.NatsTLSKey != "" {
opts = append(opts, config.WithNatsTLSKey(r.NatsTLSKey))
}
if r.AutoApproveNodes {
opts = append(opts, config.EnableAutoApproveNodes)
}
if !r.DistributedPrefixCache {
opts = append(opts, config.DisablePrefixCache)
}
if r.DistributedPrefixCacheTTL != "" {
d, err := time.ParseDuration(r.DistributedPrefixCacheTTL)
if err != nil {
return fmt.Errorf("invalid LOCALAI_DISTRIBUTED_PREFIX_CACHE_TTL %q: %w", r.DistributedPrefixCacheTTL, err)
}
opts = append(opts, config.WithPrefixCacheTTL(d))
}
if r.ExposeNodeHeader {
opts = append(opts, config.WithExposeNodeHeader(true))
}
@@ -577,12 +633,8 @@ func (r *RunCMD) Run(ctx *cliContext.Context) error {
}
signals.RegisterGracefulTerminationHandler(func() {
if err := app.ModelLoader().StopAllGRPC(); err != nil {
xlog.Error("error while stopping all grpc backends", "error", err)
}
// Clean up distributed services (idempotent — safe if already called)
if d := app.Distributed(); d != nil {
d.Shutdown()
if err := app.Shutdown(); err != nil {
xlog.Error("error while shutting down application", "error", err)
}
})

View File

@@ -62,7 +62,7 @@ func (t *TTSCMD) Run(ctx *cliContext.Context) error {
options.Backend = t.Backend
options.Model = t.Model
filePath, _, err := backend.ModelTTS(context.Background(), text, t.Voice, t.Language, ml, opts, options)
filePath, _, err := backend.ModelTTS(context.Background(), text, t.Voice, t.Language, "", nil, ml, opts, options)
if err != nil {
return err
}

View File

@@ -14,4 +14,5 @@ type Worker struct {
LLamaCPP LLamaCPP `cmd:"" name:"llama-cpp-rpc" help:"Starts a llama.cpp worker in standalone mode"`
MLXDistributed MLXDistributed `cmd:"" name:"mlx-distributed" help:"Starts an MLX distributed worker in standalone mode (requires --hostfile and --rank)"`
VLLMDistributed VLLMDistributed `cmd:"" name:"vllm" help:"Starts a vLLM data-parallel follower process. Multi-node DP for a single model: head runs the existing vllm backend with engine_args.data_parallel_size>1, followers run this command."`
DS4Distributed DS4Distributed `cmd:"" name:"ds4-distributed" help:"Starts a ds4 distributed worker in standalone mode: owns a layer slice and dials the coordinator (pass ds4-worker args after --)"`
}

View File

@@ -0,0 +1,108 @@
package worker
import (
"context"
"encoding/json"
"errors"
"fmt"
"os"
"path/filepath"
"strings"
"syscall"
cliContext "github.com/mudler/LocalAI/core/cli/context"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/gallery"
"github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/LocalAI/pkg/system"
"github.com/mudler/xlog"
)
type DS4Distributed struct {
WorkerFlags `embed:""`
ExtraDS4Args string `name:"ds4-args" env:"LOCALAI_EXTRA_DS4_ARGS,EXTRA_DS4_ARGS" help:"Arguments passed to ds4-worker (e.g. '--role worker --model m.gguf --layers 20:output --coordinator HOST PORT')"`
}
const (
ds4WorkerBinaryName = "ds4-worker"
ds4GalleryName = "ds4"
)
// ds4WorkerArgs builds the argv for syscall.Exec when launching ds4-worker
// directly: the binary path followed by the space-split extra args. An empty
// extra string yields a bare invocation.
func ds4WorkerArgs(binary, extra string) []string {
args := []string{binary}
args = append(args, strings.Fields(extra)...)
return args
}
func findDS4Backend(galleries string, systemState *system.SystemState, requireIntegrity bool) (string, error) {
backends, err := gallery.ListSystemBackends(systemState)
if err != nil {
xlog.Warn("Failed listing system backends", "error", err)
return "", err
}
backend, ok := backends.Get(ds4GalleryName)
if !ok {
ml := model.NewModelLoader(systemState)
var gals []config.Gallery
if err := json.Unmarshal([]byte(galleries), &gals); err != nil {
xlog.Error("failed loading galleries", "error", err)
return "", err
}
if err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, ds4GalleryName, nil, true, requireIntegrity); err != nil {
xlog.Error("ds4 backend not found, failed to install it", "error", err)
return "", err
}
backends, err = gallery.ListSystemBackends(systemState)
if err != nil {
return "", err
}
backend, ok = backends.Get(ds4GalleryName)
if !ok {
return "", errors.New("ds4 backend not found after install")
}
}
backendPath := filepath.Dir(backend.RunFile)
if backendPath == "" {
return "", errors.New("ds4 backend not found, install it first")
}
return filepath.Join(backendPath, ds4WorkerBinaryName), nil
}
func (r *DS4Distributed) Run(ctx *cliContext.Context) error {
if r.ExtraDS4Args == "" && len(os.Args) < 4 {
return fmt.Errorf("usage: local-ai worker ds4-distributed -- --role worker --model <gguf> --layers <START:END|START:output> --coordinator <host> <port>")
}
systemState, err := system.GetSystemState(
system.WithBackendPath(r.BackendsPath),
system.WithBackendSystemPath(r.BackendsSystemPath),
)
if err != nil {
return err
}
worker, err := findDS4Backend(r.BackendGalleries, systemState, r.RequireBackendIntegrity)
if err != nil {
return err
}
// ds4 bundles its own dynamic loader (lib/ld.so) for glibc compatibility,
// like backend/cpp/ds4/run.sh does for grpc-server. Launch ds4-worker via
// that loader when present; otherwise exec it directly. (This is a
// deliberate divergence from worker_llamacpp.go, which has no bundled loader.)
backendPath := filepath.Dir(worker)
env := os.Environ()
loader := filepath.Join(backendPath, "lib", "ld.so")
if _, statErr := os.Stat(loader); statErr == nil {
env = append(env, "LD_LIBRARY_PATH="+filepath.Join(backendPath, "lib")+":"+os.Getenv("LD_LIBRARY_PATH"))
args := append([]string{loader}, ds4WorkerArgs(worker, r.ExtraDS4Args)...)
return syscall.Exec(loader, args, env)
}
return syscall.Exec(worker, ds4WorkerArgs(worker, r.ExtraDS4Args), env)
}

View File

@@ -0,0 +1,28 @@
package worker
import (
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("ds4 worker CLI", func() {
It("uses the ds4 backend gallery name and worker binary name", func() {
Expect(ds4GalleryName).To(Equal("ds4"))
Expect(ds4WorkerBinaryName).To(Equal("ds4-worker"))
})
It("assembles direct exec args as [binary, extra-split...]", func() {
args := ds4WorkerArgs("/b/ds4-worker", "--role worker --model m.gguf --layers 20:output --coordinator 10.0.0.1 1234")
Expect(args).To(Equal([]string{
"/b/ds4-worker",
"--role", "worker",
"--model", "m.gguf",
"--layers", "20:output",
"--coordinator", "10.0.0.1", "1234",
}))
})
It("drops empty extra args to a bare binary invocation", func() {
Expect(ds4WorkerArgs("/b/ds4-worker", "")).To(Equal([]string{"/b/ds4-worker"}))
})
})

View File

@@ -96,7 +96,7 @@ func (r *VLLMDistributed) Run(ctx *cliContext.Context) error {
FrontendURL: r.RegisterTo,
RegistrationToken: r.RegistrationToken,
}
nodeID, _, regErr := regClient.RegisterWithRetry(context.Background(), r.registrationBody(), 10)
nodeID, _, _, _, regErr := regClient.RegisterWithRetry(context.Background(), r.registrationBody(), 10)
if regErr != nil {
return fmt.Errorf("registering with frontend: %w", regErr)
}

View File

@@ -15,6 +15,8 @@ import (
"time"
"github.com/mudler/xlog"
"github.com/mudler/LocalAI/pkg/httpclient"
)
// RegistrationClient talks to the frontend's /api/node/* endpoints.
@@ -37,7 +39,7 @@ func (c *RegistrationClient) httpTimeout() time.Duration {
// httpClient returns the shared HTTP client, initializing it on first use.
func (c *RegistrationClient) httpClient() *http.Client {
c.clientOnce.Do(func() {
c.client = &http.Client{Timeout: c.httpTimeout()}
c.client = httpclient.NewWithTimeout(c.httpTimeout())
})
return c.client
}
@@ -56,65 +58,77 @@ func (c *RegistrationClient) setAuth(req *http.Request) {
// RegisterResponse is the JSON body returned by /api/node/register.
type RegisterResponse struct {
ID string `json:"id"`
APIToken string `json:"api_token,omitempty"`
ID string `json:"id"`
Status string `json:"status,omitempty"` // "pending" until an admin approves the node
APIToken string `json:"api_token,omitempty"`
NatsJWT string `json:"nats_jwt,omitempty"`
NatsUserSeed string `json:"nats_user_seed,omitempty"`
}
// Register sends a single registration request and returns the node ID and
// (optionally) an auto-provisioned API token.
func (c *RegistrationClient) Register(ctx context.Context, body map[string]any) (string, string, error) {
// RegisterFull sends a single registration request and returns the full
// response (node ID, approval status, and optional API token / NATS creds).
// Re-registration is idempotent: the frontend preserves the node row and mints
// a fresh NATS JWT each call, so this doubles as the credential-refresh call.
func (c *RegistrationClient) RegisterFull(ctx context.Context, body map[string]any) (*RegisterResponse, error) {
jsonBody, _ := json.Marshal(body)
url := c.baseURL() + "/api/node/register"
req, err := http.NewRequestWithContext(ctx, http.MethodPost, url, bytes.NewReader(jsonBody))
if err != nil {
return "", "", fmt.Errorf("creating request: %w", err)
return nil, fmt.Errorf("creating request: %w", err)
}
req.Header.Set("Content-Type", "application/json")
c.setAuth(req)
resp, err := c.httpClient().Do(req)
if err != nil {
return "", "", fmt.Errorf("posting to %s: %w", url, err)
return nil, fmt.Errorf("posting to %s: %w", url, err)
}
defer resp.Body.Close()
if resp.StatusCode < 200 || resp.StatusCode >= 300 {
return "", "", fmt.Errorf("registration failed with status %d", resp.StatusCode)
return nil, fmt.Errorf("registration failed with status %d", resp.StatusCode)
}
var result RegisterResponse
if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
return "", "", fmt.Errorf("decoding response: %w", err)
return nil, fmt.Errorf("decoding response: %w", err)
}
return result.ID, result.APIToken, nil
return &result, nil
}
// Register sends a single registration request and returns the node ID and
// optional credentials (API token for agent workers, NATS JWT when configured).
func (c *RegistrationClient) Register(ctx context.Context, body map[string]any) (nodeID, apiToken, natsJWT, natsSeed string, err error) {
res, err := c.RegisterFull(ctx, body)
if err != nil {
return "", "", "", "", err
}
return res.ID, res.APIToken, res.NatsJWT, res.NatsUserSeed, nil
}
// RegisterWithRetry retries registration with exponential backoff.
func (c *RegistrationClient) RegisterWithRetry(ctx context.Context, body map[string]any, maxRetries int) (string, string, error) {
func (c *RegistrationClient) RegisterWithRetry(ctx context.Context, body map[string]any, maxRetries int) (nodeID, apiToken, natsJWT, natsSeed string, err error) {
backoff := 2 * time.Second
maxBackoff := 30 * time.Second
var nodeID, apiToken string
var err error
for attempt := 1; attempt <= maxRetries; attempt++ {
nodeID, apiToken, err = c.Register(ctx, body)
nodeID, apiToken, natsJWT, natsSeed, err = c.Register(ctx, body)
if err == nil {
return nodeID, apiToken, nil
return nodeID, apiToken, natsJWT, natsSeed, nil
}
if attempt == maxRetries {
return "", "", fmt.Errorf("failed after %d attempts: %w", maxRetries, err)
return "", "", "", "", fmt.Errorf("failed after %d attempts: %w", maxRetries, err)
}
xlog.Warn("Registration failed, retrying", "attempt", attempt, "next_retry", backoff, "error", err)
select {
case <-ctx.Done():
return "", "", ctx.Err()
return "", "", "", "", ctx.Err()
case <-time.After(backoff):
}
backoff = min(backoff*2, maxBackoff)
}
return nodeID, apiToken, err
return nodeID, apiToken, natsJWT, natsSeed, err
}
// Heartbeat sends a single heartbeat POST with the given body.

View File

@@ -0,0 +1,200 @@
package workerregistry
import (
"context"
"fmt"
"sync"
"time"
"github.com/mudler/LocalAI/pkg/natsauth"
"github.com/mudler/xlog"
)
// statusPending mirrors nodes.StatusPending. It is duplicated rather than
// imported so the lightweight registration client does not pull in the nodes
// package (and its gorm/DB dependencies).
const statusPending = "pending"
// defaultMaxAttempts bounds how many times Acquire registers (and how many
// consecutive times RefreshLoop may fail) before giving up. It is high enough
// to ride out a slow admin approval or a transient frontend outage, but finite
// so an unauthorized/unapprovable worker exits and surfaces the problem (via a
// non-zero exit and the resulting restart) rather than waiting forever.
const defaultMaxAttempts = 100
// RegisterFunc performs one idempotent registration round-trip.
type RegisterFunc func(ctx context.Context) (*RegisterResponse, error)
// NATSCredentialManager acquires NATS credentials at startup — waiting through
// admin approval when required — and refreshes them before the minted JWT
// expires, by re-registering (which mints a fresh JWT). The live NATS
// connection adopts a refreshed JWT on its next reconnect via Provider. Safe
// for concurrent use.
//
// It addresses two failure modes: a worker that needs credentials but registers
// while still pending approval (it would otherwise give up and never connect),
// and a long-running worker whose 24h JWT expires with no way to renew it.
type NATSCredentialManager struct {
register RegisterFunc
requireCreds bool // block until credentials are present (frontend minting in use)
// Tunables; defaults set by NewNATSCredentialManager, overridable in tests.
initialBackoff time.Duration
maxBackoff time.Duration
maxAttempts int // bound on Acquire attempts / consecutive refresh failures (<=0 = unlimited)
refreshLead float64 // refresh once this fraction of the JWT lifetime has elapsed
refreshRetry time.Duration
expiryOf func(jwt string) (time.Time, bool)
mu sync.RWMutex
jwt string
seed string
nodeID string
}
// NewNATSCredentialManager builds a manager over register. When requireCreds is
// true, Acquire blocks until the node is approved and credentials are minted.
func NewNATSCredentialManager(register RegisterFunc, requireCreds bool) *NATSCredentialManager {
return &NATSCredentialManager{
register: register,
requireCreds: requireCreds,
initialBackoff: 2 * time.Second,
maxBackoff: 30 * time.Second,
maxAttempts: defaultMaxAttempts,
refreshLead: 0.75,
refreshRetry: 30 * time.Second,
expiryOf: jwtExpiry,
}
}
// jwtExpiry decodes the expiry of a minted user JWT. ok is false when the token
// is empty/undecodable or carries no expiry (e.g. a non-expiring service JWT).
func jwtExpiry(token string) (time.Time, bool) {
if token == "" {
return time.Time{}, false
}
uc, err := natsauth.DecodeUserClaims(token)
if err != nil || uc.Expires == 0 {
return time.Time{}, false
}
return time.Unix(uc.Expires, 0), true
}
func (m *NATSCredentialManager) store(res *RegisterResponse) {
m.mu.Lock()
defer m.mu.Unlock()
m.nodeID = res.ID
if res.NatsJWT != "" && res.NatsUserSeed != "" {
m.jwt, m.seed = res.NatsJWT, res.NatsUserSeed
}
}
// Current returns the latest NATS credentials (both empty until acquired).
func (m *NATSCredentialManager) Current() (jwt, seed string) {
m.mu.RLock()
defer m.mu.RUnlock()
return m.jwt, m.seed
}
// NodeID returns the node ID from the most recent registration.
func (m *NATSCredentialManager) NodeID() string {
m.mu.RLock()
defer m.mu.RUnlock()
return m.nodeID
}
// Provider returns a callback compatible with messaging.WithUserJWTProvider,
// supplying the current credentials on each (re)connect.
func (m *NATSCredentialManager) Provider() func() (string, string) {
return m.Current
}
// HasCredentials reports whether complete NATS credentials have been obtained.
func (m *NATSCredentialManager) HasCredentials() bool {
jwt, seed := m.Current()
return jwt != "" && seed != ""
}
// Acquire registers and, when requireCreds is set, keeps re-registering with
// exponential backoff until the node is approved (status != pending) and
// credentials are minted. Without requireCreds it returns the first successful
// response (the historical one-shot behavior, preserved for anonymous NATS).
func (m *NATSCredentialManager) Acquire(ctx context.Context) (*RegisterResponse, error) {
backoff := m.initialBackoff
var lastReason error
for attempt := 1; m.maxAttempts <= 0 || attempt <= m.maxAttempts; attempt++ {
res, err := m.register(ctx)
switch {
case err != nil:
lastReason = err
xlog.Warn("Registration failed, retrying", "attempt", attempt, "next_retry", backoff, "error", err)
case !m.requireCreds:
m.store(res)
return res, nil
case res.Status == statusPending:
lastReason = fmt.Errorf("node %s still pending admin approval", res.ID)
xlog.Info("Node pending admin approval; waiting", "node", res.ID, "attempt", attempt, "next_retry", backoff)
case res.NatsJWT == "" || res.NatsUserSeed == "":
lastReason = fmt.Errorf("node %s approved but NATS credentials not minted", res.ID)
xlog.Info("Node approved but NATS credentials not yet minted; waiting", "node", res.ID, "attempt", attempt, "next_retry", backoff)
default:
m.store(res)
return res, nil
}
select {
case <-ctx.Done():
return nil, ctx.Err()
case <-time.After(backoff):
}
backoff = min(backoff*2, m.maxBackoff)
}
return nil, fmt.Errorf("giving up acquiring NATS credentials after %d attempts: %w", m.maxAttempts, lastReason)
}
// RefreshLoop re-registers to mint a fresh JWT before the current one expires,
// updating the credentials returned by Current/Provider so the NATS connection
// adopts them on its next reconnect. It returns nil when ctx is cancelled or
// when the current credential has no expiry (nothing to refresh), and a non-nil
// error after maxAttempts consecutive refresh failures — letting the caller
// exit the worker so it restarts and re-acquires (or surfaces the outage)
// rather than silently drifting toward an expired, unrenewable JWT.
func (m *NATSCredentialManager) RefreshLoop(ctx context.Context) error {
failures := 0
for {
jwt, _ := m.Current()
exp, ok := m.expiryOf(jwt)
if !ok {
xlog.Debug("NATS credential has no expiry; refresh loop exiting")
return nil
}
wait := max(time.Duration(float64(time.Until(exp))*m.refreshLead), 0)
select {
case <-ctx.Done():
return nil
case <-time.After(wait):
}
res, err := m.register(ctx)
if err == nil && res.NatsJWT != "" && res.NatsUserSeed != "" {
m.store(res)
failures = 0
xlog.Info("Refreshed NATS credentials", "node", res.ID)
continue
}
failures++
if err != nil {
xlog.Warn("NATS credential refresh failed; will retry", "attempt", failures, "error", err)
} else {
xlog.Warn("NATS credential refresh returned no credentials; will retry", "attempt", failures)
}
if m.maxAttempts > 0 && failures >= m.maxAttempts {
return fmt.Errorf("NATS credential refresh failed %d times in a row", failures)
}
// Back off before retrying so a persistent failure near expiry does not spin.
select {
case <-ctx.Done():
return nil
case <-time.After(m.refreshRetry):
}
}
}

View File

@@ -0,0 +1,198 @@
package workerregistry
import (
"context"
"sync"
"testing"
"time"
"github.com/mudler/LocalAI/pkg/natsauth"
"github.com/nats-io/nkeys"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
func TestWorkerRegistry(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "WorkerRegistry")
}
// fakeRegister returns a sequence of canned responses/errors, one per call, and
// records how many times it was invoked. The last entry repeats once exhausted.
type fakeRegister struct {
mu sync.Mutex
steps []step
calls int
}
type step struct {
res *RegisterResponse
err error
}
func (f *fakeRegister) fn() RegisterFunc {
return func(context.Context) (*RegisterResponse, error) {
f.mu.Lock()
defer f.mu.Unlock()
i := f.calls
f.calls++
if i >= len(f.steps) {
i = len(f.steps) - 1
}
return f.steps[i].res, f.steps[i].err
}
}
func (f *fakeRegister) count() int {
f.mu.Lock()
defer f.mu.Unlock()
return f.calls
}
var _ = Describe("NATSCredentialManager", func() {
approved := func(jwt, seed string) *RegisterResponse {
return &RegisterResponse{ID: "node-1", Status: "healthy", NatsJWT: jwt, NatsUserSeed: seed}
}
pending := &RegisterResponse{ID: "node-1", Status: "pending"}
Describe("Acquire (#4 — wait through admin approval)", func() {
It("keeps re-registering until the node is approved and credentials are minted", func() {
f := &fakeRegister{steps: []step{
{res: pending}, // not approved yet
{res: approved("", "")}, // approved but JWT not minted yet
{res: approved("jwt-1", "seed-1")}, // finally minted
}}
m := NewNATSCredentialManager(f.fn(), true /* requireCreds */)
m.initialBackoff = time.Millisecond
m.maxBackoff = time.Millisecond
res, err := m.Acquire(context.Background())
Expect(err).ToNot(HaveOccurred())
Expect(res.ID).To(Equal("node-1"))
Expect(f.count()).To(Equal(3))
jwt, seed := m.Current()
Expect(jwt).To(Equal("jwt-1"))
Expect(seed).To(Equal("seed-1"))
Expect(m.HasCredentials()).To(BeTrue())
Expect(m.NodeID()).To(Equal("node-1"))
})
It("returns immediately on the first success when credentials are not required (anonymous NATS)", func() {
f := &fakeRegister{steps: []step{{res: pending}}}
m := NewNATSCredentialManager(f.fn(), false /* requireCreds */)
res, err := m.Acquire(context.Background())
Expect(err).ToNot(HaveOccurred())
Expect(res.Status).To(Equal("pending"))
Expect(f.count()).To(Equal(1))
Expect(m.HasCredentials()).To(BeFalse())
})
It("aborts when the context is cancelled while waiting for approval", func() {
f := &fakeRegister{steps: []step{{res: pending}}}
m := NewNATSCredentialManager(f.fn(), true)
m.initialBackoff = 10 * time.Millisecond
ctx, cancel := context.WithCancel(context.Background())
cancel()
_, err := m.Acquire(ctx)
Expect(err).To(MatchError(context.Canceled))
})
It("gives up after a bounded number of attempts so the worker exits and alerts", func() {
f := &fakeRegister{steps: []step{{res: pending}}} // never approved
m := NewNATSCredentialManager(f.fn(), true)
m.initialBackoff = time.Millisecond
m.maxBackoff = time.Millisecond
m.maxAttempts = 5
_, err := m.Acquire(context.Background())
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("after 5 attempts"))
Expect(err.Error()).To(ContainSubstring("pending admin approval"))
Expect(f.count()).To(Equal(5))
})
})
Describe("RefreshLoop (#5 — renew before the JWT expires)", func() {
It("re-registers before expiry and updates the credentials served to new connections", func() {
f := &fakeRegister{steps: []step{{res: approved("jwt-2", "seed-2")}}}
m := NewNATSCredentialManager(f.fn(), true)
m.refreshLead = 0.5
m.refreshRetry = time.Millisecond
// jwt-1 expires soon; jwt-2 is long-lived so the loop then idles.
m.expiryOf = func(jwt string) (time.Time, bool) {
switch jwt {
case "jwt-1":
return time.Now().Add(40 * time.Millisecond), true
case "jwt-2":
return time.Now().Add(time.Hour), true
default:
return time.Time{}, false
}
}
m.store(approved("jwt-1", "seed-1"))
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
go func() { _ = m.RefreshLoop(ctx) }()
Eventually(func() string {
jwt, _ := m.Current()
return jwt
}, "2s", "10ms").Should(Equal("jwt-2"))
})
It("returns an error after the bounded number of consecutive failures so the caller can exit", func() {
f := &fakeRegister{steps: []step{{err: context.DeadlineExceeded}}} // refresh always fails
m := NewNATSCredentialManager(f.fn(), true)
m.refreshLead = 0.5
m.refreshRetry = time.Millisecond
m.maxAttempts = 3
m.expiryOf = func(string) (time.Time, bool) { return time.Now().Add(time.Millisecond), true }
m.store(approved("jwt-1", "seed-1"))
errCh := make(chan error, 1)
go func() { errCh <- m.RefreshLoop(context.Background()) }()
Eventually(errCh, "2s").Should(Receive(MatchError(ContainSubstring("3 times in a row"))))
})
It("exits promptly when the current credential has no expiry (nothing to refresh)", func() {
f := &fakeRegister{steps: []step{{res: approved("x", "y")}}}
m := NewNATSCredentialManager(f.fn(), true)
m.expiryOf = func(string) (time.Time, bool) { return time.Time{}, false }
m.store(approved("static", "seed"))
done := make(chan struct{})
go func() { _ = m.RefreshLoop(context.Background()); close(done) }()
Eventually(done, "1s").Should(BeClosed())
Expect(f.count()).To(Equal(0)) // never tried to re-register
})
})
Describe("jwtExpiry default", func() {
It("decodes the expiry of a real minted worker JWT", func() {
akp, err := nkeys.CreateAccount()
Expect(err).ToNot(HaveOccurred())
seed, err := akp.Seed()
Expect(err).ToNot(HaveOccurred())
cfg := natsauth.Config{AccountSeed: string(seed), WorkerJWTTTL: time.Hour}
token, _, err := cfg.MintWorkerJWT("node-1", "backend")
Expect(err).ToNot(HaveOccurred())
exp, ok := jwtExpiry(token)
Expect(ok).To(BeTrue())
Expect(exp).To(BeTemporally("~", time.Now().Add(time.Hour), 2*time.Minute))
})
It("reports no expiry for an empty or undecodable token", func() {
_, ok := jwtExpiry("")
Expect(ok).To(BeFalse())
_, ok = jwtExpiry("not-a-jwt")
Expect(ok).To(BeFalse())
})
})
})

View File

@@ -6,6 +6,8 @@ import (
"fmt"
"io"
"net/http"
"github.com/mudler/LocalAI/pkg/httpclient"
)
// Define a struct to hold the store API client
@@ -47,7 +49,7 @@ type FindResponse struct {
func NewStoreClient(baseUrl string) *StoreClient {
return &StoreClient{
BaseURL: baseUrl,
Client: &http.Client{},
Client: httpclient.New(),
}
}

View File

@@ -22,9 +22,11 @@ const (
UsecaseRerank = "rerank"
UsecaseDetection = "detection"
UsecaseVAD = "vad"
UsecaseAudioTransform = "audio_transform"
UsecaseDiarization = "diarization"
UsecaseRealtimeAudio = "realtime_audio"
UsecaseAudioTransform = "audio_transform"
UsecaseDiarization = "diarization"
UsecaseRealtimeAudio = "realtime_audio"
UsecaseFaceRecognition = "face_recognition"
UsecaseSpeakerRecognition = "speaker_recognition"
)
// GRPCMethod identifies a Backend service RPC from backend.proto.
@@ -47,6 +49,11 @@ const (
MethodAudioTransform GRPCMethod = "AudioTransform"
MethodDiarize GRPCMethod = "Diarize"
MethodAudioToAudioStream GRPCMethod = "AudioToAudioStream"
MethodFaceVerify GRPCMethod = "FaceVerify"
MethodFaceAnalyze GRPCMethod = "FaceAnalyze"
MethodVoiceVerify GRPCMethod = "VoiceVerify"
MethodVoiceEmbed GRPCMethod = "VoiceEmbed"
MethodVoiceAnalyze GRPCMethod = "VoiceAnalyze"
)
// UsecaseInfo describes a single known_usecase value and how it maps
@@ -154,6 +161,16 @@ var UsecaseInfoMap = map[string]UsecaseInfo{
GRPCMethod: MethodAudioToAudioStream,
Description: "Self-contained any-to-any audio model for the Realtime API — accepts microphone audio and emits speech + transcript (+ optional function calls) from a single backend via the AudioToAudioStream RPC.",
},
UsecaseFaceRecognition: {
Flag: FLAG_FACE_RECOGNITION,
GRPCMethod: MethodFaceVerify,
Description: "Face recognition — verify identity, analyze attributes (age/gender/emotion) via FaceVerify and FaceAnalyze RPCs.",
},
UsecaseSpeakerRecognition: {
Flag: FLAG_SPEAKER_RECOGNITION,
GRPCMethod: MethodVoiceVerify,
Description: "Speaker recognition — verify identity, embed and analyze voice via VoiceVerify, VoiceEmbed and VoiceAnalyze RPCs.",
},
}
// BackendCapability describes which gRPC methods and usecases a backend supports.
@@ -198,6 +215,13 @@ var BackendCapabilities = map[string]BackendCapability{
AcceptsVideos: true,
Description: "vLLM engine — high-throughput LLM serving with optional multimodal",
},
"sglang": {
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodTokenizeString},
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseTokenize, UsecaseVision},
DefaultUsecases: []string{UsecaseChat},
AcceptsImages: true,
Description: "SGLang — fast LLM inference with structured generation and optional vision",
},
"vllm-omni": {
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodGenerateImage, MethodGenerateVideo, MethodTTS},
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseImage, UsecaseVideo, UsecaseTTS, UsecaseVision},
@@ -291,6 +315,12 @@ var BackendCapabilities = map[string]BackendCapability{
DefaultUsecases: []string{UsecaseTranscript},
Description: "NVIDIA NeMo speech recognition",
},
"parakeet-cpp": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
PossibleUsecases: []string{UsecaseTranscript},
DefaultUsecases: []string{UsecaseTranscript},
Description: "NVIDIA NeMo Parakeet ASR (parakeet.cpp)",
},
"qwen-asr": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
PossibleUsecases: []string{UsecaseTranscript},
@@ -309,6 +339,18 @@ var BackendCapabilities = map[string]BackendCapability{
DefaultUsecases: []string{UsecaseTranscript, UsecaseTTS},
Description: "VibeVoice — bidirectional speech (transcription and synthesis)",
},
"vibevoice-cpp": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription, MethodTTS, MethodTTSStream},
PossibleUsecases: []string{UsecaseTranscript, UsecaseTTS},
DefaultUsecases: []string{UsecaseTranscript, UsecaseTTS},
Description: "VibeVoice C++ — bidirectional speech, C++ backend with streaming TTS",
},
"sherpa-onnx": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription, MethodTTS, MethodTTSStream, MethodVAD},
PossibleUsecases: []string{UsecaseTranscript, UsecaseTTS, UsecaseVAD},
DefaultUsecases: []string{UsecaseTranscript},
Description: "Sherpa-ONNX — multi-model speech toolkit (ASR, TTS, VAD)",
},
// --- TTS backends ---
"piper": {
@@ -353,6 +395,12 @@ var BackendCapabilities = map[string]BackendCapability{
DefaultUsecases: []string{UsecaseTTS},
Description: "Qwen TTS",
},
"qwen3-tts-cpp": {
GRPCMethods: []GRPCMethod{MethodTTS},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "Qwen3 TTS C++ — text-to-speech, C++ backend",
},
"faster-qwen3-tts": {
GRPCMethods: []GRPCMethod{MethodTTS},
PossibleUsecases: []string{UsecaseTTS},
@@ -434,6 +482,27 @@ var BackendCapabilities = map[string]BackendCapability{
DefaultUsecases: []string{UsecaseDetection},
Description: "RF-DETR object detection",
},
"rfdetr-cpp": {
GRPCMethods: []GRPCMethod{MethodDetect},
PossibleUsecases: []string{UsecaseDetection},
DefaultUsecases: []string{UsecaseDetection},
Description: "RF-DETR C++ object detection",
},
// --- Face and speaker recognition backends ---
"insightface": {
GRPCMethods: []GRPCMethod{MethodEmbedding, MethodDetect, MethodFaceVerify, MethodFaceAnalyze},
PossibleUsecases: []string{UsecaseEmbeddings, UsecaseDetection, UsecaseFaceRecognition},
DefaultUsecases: []string{UsecaseFaceRecognition},
AcceptsImages: true,
Description: "InsightFace — face detection, embedding, verification and attribute analysis",
},
"speaker-recognition": {
GRPCMethods: []GRPCMethod{MethodVoiceVerify, MethodVoiceEmbed, MethodVoiceAnalyze},
PossibleUsecases: []string{UsecaseSpeakerRecognition},
DefaultUsecases: []string{UsecaseSpeakerRecognition},
Description: "Speaker recognition — voice identity verification and analysis",
},
"silero-vad": {
GRPCMethods: []GRPCMethod{MethodVAD},
PossibleUsecases: []string{UsecaseVAD},

View File

@@ -5,6 +5,8 @@ import (
"fmt"
"time"
"github.com/mudler/LocalAI/core/services/messaging"
"github.com/mudler/LocalAI/pkg/natsauth"
"github.com/mudler/xlog"
)
@@ -16,7 +18,29 @@ type DistributedConfig struct {
NatsURL string // --nats-url / LOCALAI_NATS_URL
StorageURL string // --storage-url / LOCALAI_STORAGE_URL (S3 endpoint)
RegistrationToken string // --registration-token / LOCALAI_REGISTRATION_TOKEN (required token for node registration)
AutoApproveNodes bool // --auto-approve-nodes / LOCALAI_AUTO_APPROVE_NODES (skip admin approval for new workers)
// RegistrationRequireAuth fails startup when distributed mode is enabled but
// RegistrationToken is empty. The default (false) keeps the historical
// fail-open behavior with a loud warning; production should set it so the
// node-register endpoints and the worker file-transfer server cannot run
// unauthenticated. Mirrors NatsRequireAuth for the NATS bus.
RegistrationRequireAuth bool // LOCALAI_REGISTRATION_REQUIRE_AUTH
// RequireAuth is the umbrella switch (LOCALAI_DISTRIBUTED_REQUIRE_AUTH) for
// distributed-mode auth: when true it implies BOTH NatsRequireAuth and
// RegistrationRequireAuth, so a single knob locks down the bus and the
// registration/file-transfer layer together. The granular flags remain
// available to enforce just one layer.
RequireAuth bool // LOCALAI_DISTRIBUTED_REQUIRE_AUTH
AutoApproveNodes bool // --auto-approve-nodes / LOCALAI_AUTO_APPROVE_NODES (skip admin approval for new workers)
// NATS JWT auth (optional; see pkg/natsauth and docs/features/distributed-mode.md)
NatsAccountSeed string // LOCALAI_NATS_ACCOUNT_SEED — account signing seed to mint per-node worker JWTs
NatsServiceJWT string // LOCALAI_NATS_SERVICE_JWT — user JWT for frontends / agent workers
NatsServiceSeed string // LOCALAI_NATS_SERVICE_SEED — signing seed paired with service JWT
NatsWorkerJWTTTL time.Duration // LOCALAI_NATS_WORKER_JWT_TTL — minted worker JWT lifetime (default 24h)
NatsRequireAuth bool // LOCALAI_NATS_REQUIRE_AUTH — fail startup if NATS credentials are missing
NatsTLSCA string // LOCALAI_NATS_TLS_CA — PEM file for private CA (server verify)
NatsTLSCert string // LOCALAI_NATS_TLS_CERT — client cert for NATS mTLS
NatsTLSKey string // LOCALAI_NATS_TLS_KEY — client key paired with NatsTLSCert
// S3 configuration (used when StorageURL is set)
StorageBucket string // --storage-bucket / LOCALAI_STORAGE_BUCKET
@@ -49,6 +73,17 @@ type DistributedConfig struct {
AgentWorkerConcurrency int `yaml:"agent_worker_concurrency" json:"agent_worker_concurrency" env:"LOCALAI_AGENT_WORKER_CONCURRENCY"`
JobWorkerConcurrency int `yaml:"job_worker_concurrency" json:"job_worker_concurrency" env:"LOCALAI_JOB_WORKER_CONCURRENCY"`
// PrefixCacheDisabled turns off prefix-cache-aware routing, falling back to
// round-robin (the floor). Prefix-cache routing is ON by default in
// distributed mode; this flag exists so operators can opt out. The CLI
// surfaces a default-true --distributed-prefix-cache enable flag and sets
// this when the operator passes --distributed-prefix-cache=false.
PrefixCacheDisabled bool
// PrefixCacheTTL is the idle-timeout for prefix-cache index entries and
// drives the background eviction cadence (eviction runs every TTL/2). Zero
// means use the prefixcache package default (5m).
PrefixCacheTTL time.Duration
}
// Validate checks that the distributed configuration is internally consistent.
@@ -65,10 +100,23 @@ func (c DistributedConfig) Validate() error {
(c.StorageAccessKey == "" && c.StorageSecretKey != "") {
return fmt.Errorf("storage-access-key and storage-secret-key must both be set or both empty")
}
// Warn about missing registration token (not an error)
// The registration token guards both the node HTTP register/heartbeat
// endpoints and the worker file-transfer server (which fails open on an
// empty token). Enforce it when registration auth is required (the granular
// flag or the umbrella); otherwise warn.
if c.RegistrationToken == "" {
xlog.Warn("distributed mode running without registration token — node endpoints are unprotected")
if c.RegistrationAuthRequired() {
return fmt.Errorf("registration auth is required (LOCALAI_REGISTRATION_REQUIRE_AUTH or LOCALAI_DISTRIBUTED_REQUIRE_AUTH) but LOCALAI_REGISTRATION_TOKEN is empty")
}
xlog.Warn("distributed mode running without registration token — node endpoints and the worker file-transfer server are unprotected; set LOCALAI_REGISTRATION_TOKEN, or LOCALAI_DISTRIBUTED_REQUIRE_AUTH=true to fail closed")
}
if err := c.NatsAuthConfig().Validate(); err != nil {
return err
}
if err := c.NatsTLSFiles().Validate(); err != nil {
return err
}
c.NatsAuthConfig().WarnIfInsecure(true)
// Check for negative durations
for name, d := range map[string]time.Duration{
FlagMCPToolTimeout: c.MCPToolTimeout,
@@ -112,6 +160,76 @@ func WithRegistrationToken(token string) AppOption {
}
}
func WithNatsAccountSeed(seed string) AppOption {
return func(o *ApplicationConfig) {
o.Distributed.NatsAccountSeed = seed
}
}
func WithNatsServiceJWT(jwt string) AppOption {
return func(o *ApplicationConfig) {
o.Distributed.NatsServiceJWT = jwt
}
}
func WithNatsServiceSeed(seed string) AppOption {
return func(o *ApplicationConfig) {
o.Distributed.NatsServiceSeed = seed
}
}
func WithNatsWorkerJWTTTL(d time.Duration) AppOption {
return func(o *ApplicationConfig) {
o.Distributed.NatsWorkerJWTTTL = d
}
}
var EnableNatsRequireAuth = func(o *ApplicationConfig) {
o.Distributed.NatsRequireAuth = true
}
// EnableRegistrationRequireAuth makes an empty registration token a hard error
// in distributed mode (see DistributedConfig.RegistrationRequireAuth).
var EnableRegistrationRequireAuth = func(o *ApplicationConfig) {
o.Distributed.RegistrationRequireAuth = true
}
// EnableDistributedRequireAuth is the umbrella switch implying both
// NatsRequireAuth and RegistrationRequireAuth (see DistributedConfig.RequireAuth).
var EnableDistributedRequireAuth = func(o *ApplicationConfig) {
o.Distributed.RequireAuth = true
}
// RegistrationAuthRequired reports whether an empty registration token must be
// treated as a fatal misconfiguration — the granular flag or the umbrella.
func (c DistributedConfig) RegistrationAuthRequired() bool {
return c.RegistrationRequireAuth || c.RequireAuth
}
// NatsAuthRequired reports whether NATS JWT credentials must be present — the
// granular flag or the umbrella.
func (c DistributedConfig) NatsAuthRequired() bool {
return c.NatsRequireAuth || c.RequireAuth
}
func WithNatsTLSCA(path string) AppOption {
return func(o *ApplicationConfig) {
o.Distributed.NatsTLSCA = path
}
}
func WithNatsTLSCert(path string) AppOption {
return func(o *ApplicationConfig) {
o.Distributed.NatsTLSCert = path
}
}
func WithNatsTLSKey(path string) AppOption {
return func(o *ApplicationConfig) {
o.Distributed.NatsTLSKey = path
}
}
func WithStorageURL(url string) AppOption {
return func(o *ApplicationConfig) {
o.Distributed.StorageURL = url
@@ -158,6 +276,20 @@ var EnableAutoApproveNodes = func(o *ApplicationConfig) {
o.Distributed.AutoApproveNodes = true
}
// DisablePrefixCache turns off prefix-cache-aware routing (falls back to
// round-robin). Prefix-cache routing is enabled by default in distributed mode.
var DisablePrefixCache = func(o *ApplicationConfig) {
o.Distributed.PrefixCacheDisabled = true
}
// WithPrefixCacheTTL sets the prefix-cache index idle-timeout (and the
// background eviction cadence, which runs every TTL/2).
func WithPrefixCacheTTL(d time.Duration) AppOption {
return func(o *ApplicationConfig) {
o.Distributed.PrefixCacheTTL = d
}
}
// Flag names for distributed timeout / interval configuration. These are
// the kebab-case identifiers kong derives from the matching RunCMD struct
// fields; they appear in Validate error messages and any other operator-
@@ -192,6 +324,44 @@ const (
// DefaultMaxUploadSize is the default maximum upload body size (50 GB).
const DefaultMaxUploadSize int64 = 50 << 30
// NatsTLSFiles returns NATS TLS/mTLS PEM paths for the messaging client.
func (c DistributedConfig) NatsTLSFiles() messaging.TLSFiles {
return messaging.TLSFiles{
CA: c.NatsTLSCA,
Cert: c.NatsTLSCert,
Key: c.NatsTLSKey,
}
}
// NatsMessagingOptions builds messaging client options (JWT + TLS) for distributed components.
// Pass explicit userJWT/userSeed when set (e.g. worker overrides); empty uses service JWT from config.
func (c DistributedConfig) NatsMessagingOptions(userJWT, userSeed string) []messaging.Option {
var opts []messaging.Option
jwt, seed := userJWT, userSeed
if jwt == "" && seed == "" {
auth := c.NatsAuthConfig()
jwt, seed = auth.ServiceUserJWT, auth.ServiceUserSeed
}
if jwt != "" && seed != "" {
opts = append(opts, messaging.WithUserJWT(jwt, seed))
}
if tls := c.NatsTLSFiles(); tls.Enabled() {
opts = append(opts, messaging.WithTLS(tls))
}
return opts
}
// NatsAuthConfig builds pkg/natsauth settings from distributed configuration.
func (c DistributedConfig) NatsAuthConfig() natsauth.Config {
return natsauth.Config{
AccountSeed: c.NatsAccountSeed,
ServiceUserJWT: c.NatsServiceJWT,
ServiceUserSeed: c.NatsServiceSeed,
WorkerJWTTTL: c.NatsWorkerJWTTTL,
RequireAuth: c.NatsAuthRequired(),
}
}
// BackendInstallTimeoutOrDefault returns the configured timeout or the default.
func (c DistributedConfig) BackendInstallTimeoutOrDefault() time.Duration {
return cmp.Or(c.BackendInstallTimeout, DefaultBackendInstallTimeout)

View File

@@ -88,3 +88,66 @@ var _ = Describe("DistributedConfig.Validate negative-duration errors", func() {
Expect(c.Validate()).To(Succeed())
})
})
var _ = Describe("DistributedConfig.Validate registration auth", func() {
It("rejects an empty registration token when RequireAuth is set", func() {
c := config.DistributedConfig{
Enabled: true,
NatsURL: "nats://localhost:4222",
RegistrationRequireAuth: true,
}
err := c.Validate()
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("LOCALAI_REGISTRATION_REQUIRE_AUTH"))
Expect(err.Error()).To(ContainSubstring("LOCALAI_REGISTRATION_TOKEN"))
})
It("accepts a set registration token when RequireAuth is set", func() {
c := config.DistributedConfig{
Enabled: true,
NatsURL: "nats://localhost:4222",
RegistrationToken: "s3cret",
RegistrationRequireAuth: true,
}
Expect(c.Validate()).To(Succeed())
})
It("warns but succeeds with an empty token when RequireAuth is unset", func() {
c := config.DistributedConfig{
Enabled: true,
NatsURL: "nats://localhost:4222",
}
Expect(c.Validate()).To(Succeed())
})
It("rejects an empty token when the umbrella RequireAuth is set", func() {
c := config.DistributedConfig{
Enabled: true,
NatsURL: "nats://localhost:4222",
RequireAuth: true,
// Provide NATS creds so only the registration-token gap remains.
NatsServiceJWT: "jwt",
NatsServiceSeed: "seed",
NatsAccountSeed: "acct",
}
err := c.Validate()
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("LOCALAI_DISTRIBUTED_REQUIRE_AUTH"))
Expect(err.Error()).To(ContainSubstring("LOCALAI_REGISTRATION_TOKEN"))
})
It("the umbrella implies NATS auth is required", func() {
c := config.DistributedConfig{
Enabled: true,
NatsURL: "nats://localhost:4222",
RegistrationToken: "tok", // registration layer satisfied
RequireAuth: true, // umbrella → NATS creds now required
}
Expect(c.NatsAuthRequired()).To(BeTrue())
Expect(c.RegistrationAuthRequired()).To(BeTrue())
// Missing NATS service JWT/seed must now be fatal.
err := c.Validate()
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("LOCALAI_NATS_REQUIRE_AUTH"))
})
})

View File

@@ -9,10 +9,11 @@ import (
"encoding/json"
"fmt"
"io"
"net/http"
"os"
"sort"
"strings"
"github.com/mudler/LocalAI/pkg/httpclient"
)
const (
@@ -55,7 +56,7 @@ var allowedFields = map[string]bool{
func main() {
fmt.Fprintf(os.Stderr, "Fetching %s ...\n", unslothURL)
resp, err := http.Get(unslothURL)
resp, err := httpclient.New(httpclient.WithFollowRedirects()).Get(unslothURL)
if err != nil {
fatal("fetch failed: %v", err)
}

View File

@@ -128,6 +128,22 @@ func DefaultRegistry() map[string]FieldMetaOverride {
Advanced: true,
Order: 21,
},
"reasoning_effort": {
Section: "llm",
Label: "Reasoning Effort",
Description: "Default reasoning effort, forwarded to the backend as the reasoning_effort chat_template_kwarg (jinja models like gpt-oss / LFM2.5 honor it). A per-request reasoning_effort overrides it. 'none' also turns thinking off.",
Component: "select",
Options: []FieldOption{
{Value: "", Label: "Unset (model default)"},
{Value: "none", Label: "none (disable thinking)"},
{Value: "minimal", Label: "minimal"},
{Value: "low", Label: "low"},
{Value: "medium", Label: "medium"},
{Value: "high", Label: "high"},
},
Advanced: true,
Order: 22,
},
"cache_type_k": {
Section: "llm",
Label: "KV Cache Type (K)",
@@ -277,6 +293,21 @@ func DefaultRegistry() map[string]FieldMetaOverride {
AutocompleteProvider: ProviderModelsVAD,
Order: 63,
},
"pipeline.reasoning_effort": {
Section: "pipeline",
Label: "Reasoning Effort",
Description: "Reasoning effort for the pipeline's LLM, forwarded to the backend as the reasoning_effort chat_template_kwarg (jinja models like gpt-oss / LFM2.5 honor it). Overrides the LLM model's own reasoning_effort. 'none' also turns thinking off.",
Component: "select",
Options: []FieldOption{
{Value: "", Label: "Default (model config)"},
{Value: "none", Label: "none (disable thinking)"},
{Value: "minimal", Label: "minimal"},
{Value: "low", Label: "low"},
{Value: "medium", Label: "medium"},
{Value: "high", Label: "high"},
},
Order: 64,
},
// --- Functions ---
"function.grammar.parallel_calls": {

View File

@@ -63,6 +63,13 @@ type ModelConfig struct {
FunctionsConfig functions.FunctionsConfig `yaml:"function,omitempty" json:"function,omitempty"`
ReasoningConfig reasoning.Config `yaml:"reasoning,omitempty" json:"reasoning,omitempty"`
// ReasoningEffort is the default reasoning effort (none|minimal|low|medium|high)
// for this model. A per-request reasoning_effort overrides it. It is forwarded
// to the backend as the reasoning_effort chat_template_kwarg (see
// gRPCPredictOpts), so jinja-templated models that key on it — e.g. gpt-oss
// (Harmony) or LFM2.5 — honor it; "none" also toggles enable_thinking off.
ReasoningEffort string `yaml:"reasoning_effort,omitempty" json:"reasoning_effort,omitempty"`
FeatureFlag FeatureFlag `yaml:"feature_flags,omitempty" json:"feature_flags,omitempty"` // Feature Flag registry. We move fast, and features may break on a per model/backend basis. Registry for (usually temporary) flags that indicate aborting something early.
// LLM configs (GPT4ALL, Llama.cpp, ...)
LLMConfig `yaml:",inline" json:",inline"`
@@ -487,6 +494,40 @@ type Pipeline struct {
LLM string `yaml:"llm,omitempty" json:"llm,omitempty"`
Transcription string `yaml:"transcription,omitempty" json:"transcription,omitempty"`
VAD string `yaml:"vad,omitempty" json:"vad,omitempty"`
// ReasoningEffort sets the reasoning effort (none|minimal|low|medium|high) for
// the pipeline's LLM without editing the LLM model config. Overrides the LLM's
// own reasoning_effort. Unset leaves the LLM model config in charge.
ReasoningEffort string `yaml:"reasoning_effort,omitempty" json:"reasoning_effort,omitempty"`
}
// ApplyReasoningEffort resolves the effective reasoning effort — a per-request
// value (requestEffort) overrides the config's own ReasoningEffort default —
// stores it on the config so gRPCPredictOpts forwards it to the backend as the
// reasoning_effort chat_template_kwarg, and maps it onto the enable_thinking
// toggle the backend also reads:
// - "none" always disables thinking.
// - any explicit level enables it, UNLESS the config already disabled reasoning
// (an operator's explicit disable wins over a request asking to think).
//
// An empty requestEffort keeps the config's own default. With no effort set
// anywhere it is a no-op, leaving the model's reasoning settings untouched.
func (c *ModelConfig) ApplyReasoningEffort(requestEffort string) {
effort := requestEffort
if effort == "" {
effort = c.ReasoningEffort
}
c.ReasoningEffort = effort
switch strings.ToLower(effort) {
case "none":
disable := true
c.ReasoningConfig.DisableReasoning = &disable
case "minimal", "low", "medium", "high":
if c.ReasoningConfig.DisableReasoning == nil || !*c.ReasoningConfig.DisableReasoning {
enable := false
c.ReasoningConfig.DisableReasoning = &enable
}
}
}
// @Description File configuration for model downloads
@@ -694,6 +735,18 @@ func (c *ModelConfig) IsModelURL() bool {
return uri.LooksLikeURL()
}
// ModelID returns the identifier used to reference this model across the
// system: the configured Name, falling back to Model when Name is empty.
// This is the single source of truth for the id fed to model.WithModelID and
// the prefix-cache chain salt; both MUST agree with the router's tracking key
// or the prefix-cache salt diverges silently.
func (c ModelConfig) ModelID() string {
if c.Name != "" {
return c.Name
}
return c.Model
}
// ModelFileName returns the filename of the model
// If the model is a URL, it will return the MD5 of the URL which is the filename
func (c *ModelConfig) ModelFileName() string {
@@ -732,6 +785,17 @@ func (cfg *ModelConfig) SetDefaults(opts ...ConfigLoaderOption) {
cfg.Proxy.Mode = ProxyModePassthrough
}
// When templating is delegated to the backend (use_tokenizer_template),
// the backend also owns tool-call grammar generation and parsing. Sending
// a LocalAI-generated grammar alongside overrides the backend's native
// (name-first) tool pipeline and makes it stream the tool-call JSON back as
// plain content (issue #10052). The GGUF auto-import path already couples
// these two flags; enforce it here so gallery and hand-written configs that
// set use_tokenizer_template directly stay consistent.
if cfg.TemplateConfig.UseTokenizerTemplate {
cfg.FunctionsConfig.GrammarConfig.NoGrammar = true
}
// Apply model-family-specific inference defaults before generic fallbacks.
// This ensures gallery-installed and runtime-loaded models get optimal parameters.
ApplyInferenceDefaults(cfg, cfg.Name, cfg.Model)

View File

@@ -10,6 +10,23 @@ import (
)
var _ = Describe("Test cases for config related functions", func() {
Context("ModelID", func() {
It("returns Name when set", func() {
c := ModelConfig{Name: "my-name"}
c.Model = "my-model"
Expect(c.ModelID()).To(Equal("my-name"))
})
It("falls back to Model when Name is empty", func() {
c := ModelConfig{}
c.Model = "my-model"
Expect(c.ModelID()).To(Equal("my-model"))
})
It("returns empty string when both are empty", func() {
c := ModelConfig{}
Expect(c.ModelID()).To(Equal(""))
})
})
Context("Test Read configuration functions", func() {
It("Test Validate", func() {
tmp, err := os.CreateTemp("", "config.yaml")
@@ -471,4 +488,33 @@ concurrency_groups:
Expect(configs[0].GetConcurrencyGroups()).To(Equal([]string{"vram-heavy", "120b"}))
})
})
// When templating is delegated to the backend (use_tokenizer_template),
// the backend also owns tool-call grammar generation and parsing. A
// LocalAI-generated grammar sent alongside would override the backend's
// native (name-first) tool pipeline and make it stream the tool-call JSON
// back as plain content (issue #10052). SetDefaults must therefore couple
// the two: tokenizer template implies grammar generation is disabled.
Context("use_tokenizer_template couples with grammar disable (issue #10052)", func() {
It("disables Go grammar generation when the tokenizer template is used", func() {
cfg := &ModelConfig{
TemplateConfig: TemplateConfig{UseTokenizerTemplate: true},
}
Expect(cfg.FunctionsConfig.GrammarConfig.NoGrammar).To(BeFalse())
cfg.SetDefaults()
Expect(cfg.FunctionsConfig.GrammarConfig.NoGrammar).To(BeTrue(),
"use_tokenizer_template must imply grammar.disable so tools go to the backend's native pipeline")
})
It("leaves grammar generation enabled when the tokenizer template is not used", func() {
cfg := &ModelConfig{}
cfg.SetDefaults()
Expect(cfg.FunctionsConfig.GrammarConfig.NoGrammar).To(BeFalse(),
"models that template in Go still rely on the Go-generated grammar")
})
})
})

View File

@@ -0,0 +1,52 @@
package config_test
import (
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/config"
)
// ApplyReasoningEffort resolves the effective reasoning effort (request value
// overrides the model config default), stores it on the config so it reaches the
// backend, and maps it onto the enable_thinking toggle.
var _ = Describe("ModelConfig.ApplyReasoningEffort", func() {
It("uses the request value over the config default", func() {
c := &config.ModelConfig{ReasoningEffort: "high"}
c.ApplyReasoningEffort("none")
Expect(c.ReasoningEffort).To(Equal("none"))
Expect(c.ReasoningConfig.DisableReasoning).ToNot(BeNil())
Expect(*c.ReasoningConfig.DisableReasoning).To(BeTrue())
})
It("falls back to the config default when the request omits it", func() {
c := &config.ModelConfig{ReasoningEffort: "none"}
c.ApplyReasoningEffort("")
Expect(c.ReasoningEffort).To(Equal("none"))
Expect(c.ReasoningConfig.DisableReasoning).ToNot(BeNil())
Expect(*c.ReasoningConfig.DisableReasoning).To(BeTrue())
})
It("enables thinking for an explicit effort level", func() {
c := &config.ModelConfig{}
c.ApplyReasoningEffort("medium")
Expect(c.ReasoningEffort).To(Equal("medium"))
Expect(c.ReasoningConfig.DisableReasoning).ToNot(BeNil())
Expect(*c.ReasoningConfig.DisableReasoning).To(BeFalse())
})
It("does not let a level override an operator's config-level disable", func() {
disabled := true
c := &config.ModelConfig{}
c.ReasoningConfig.DisableReasoning = &disabled
c.ApplyReasoningEffort("high")
Expect(*c.ReasoningConfig.DisableReasoning).To(BeTrue())
})
It("is a no-op on the toggle when no effort is set anywhere", func() {
c := &config.ModelConfig{}
c.ApplyReasoningEffort("")
Expect(c.ReasoningEffort).To(Equal(""))
Expect(c.ReasoningConfig.DisableReasoning).To(BeNil())
})
})

View File

@@ -115,6 +115,10 @@ var defaultImporters = []Importer{
&NemoImporter{},
&FasterWhisperImporter{},
&QwenASRImporter{},
// ParakeetCppImporter matches only parakeet GGUFs (<arch>-<size>-<quant>.gguf);
// kept ahead of LlamaCPPImporter so its .gguf bundles aren't claimed by the
// generic GGUF importer.
&ParakeetCppImporter{},
// TTS (Batch 2)
&PiperImporter{},
&BarkImporter{},

View File

@@ -0,0 +1,180 @@
package importers
import (
"encoding/json"
"path/filepath"
"strings"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/gallery"
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/pkg/downloader"
hfapi "github.com/mudler/LocalAI/pkg/huggingface-api"
"go.yaml.in/yaml/v2"
)
var _ Importer = &ParakeetCppImporter{}
// ParakeetCppImporter recognises parakeet.cpp GGUF weights, the C++/ggml port
// of NVIDIA NeMo Parakeet. The signal is narrow on purpose: parakeet.cpp names
// its weights "<arch>-<size>-<quant>.gguf" (e.g. tdt_ctc-110m-f16.gguf,
// rnnt-0.6b-q4_k.gguf, realtime_eou_120m-v1-q8_0.gguf), so we only match a
// .gguf whose name carries a parakeet architecture token. That keeps us from
// claiming arbitrary llama-style GGUFs (the importer is registered before
// llama-cpp), and it deliberately does NOT match the upstream nvidia/parakeet-*
// NeMo repos (which ship .nemo checkpoints, not runnable GGUFs).
// preferences.backend="parakeet-cpp" forces the importer regardless.
type ParakeetCppImporter struct{}
func (i *ParakeetCppImporter) Name() string { return "parakeet-cpp" }
func (i *ParakeetCppImporter) Modality() string { return "asr" }
func (i *ParakeetCppImporter) AutoDetects() bool { return true }
func (i *ParakeetCppImporter) Match(details Details) bool {
preferences, err := details.Preferences.MarshalJSON()
if err != nil {
return false
}
preferencesMap := make(map[string]any)
if len(preferences) > 0 {
if err := json.Unmarshal(preferences, &preferencesMap); err != nil {
return false
}
}
if b, ok := preferencesMap["backend"].(string); ok && b == "parakeet-cpp" {
return true
}
// Direct URL or path to a parakeet GGUF.
if isParakeetGGUF(filepath.Base(details.URI)) {
return true
}
// HF repo shipping at least one parakeet GGUF.
if details.HuggingFace != nil {
for _, f := range details.HuggingFace.Files {
if isParakeetGGUF(filepath.Base(f.Path)) {
return true
}
}
}
return false
}
func (i *ParakeetCppImporter) Import(details Details) (gallery.ModelConfig, error) {
preferences, err := details.Preferences.MarshalJSON()
if err != nil {
return gallery.ModelConfig{}, err
}
preferencesMap := make(map[string]any)
if len(preferences) > 0 {
if err := json.Unmarshal(preferences, &preferencesMap); err != nil {
return gallery.ModelConfig{}, err
}
}
name, ok := preferencesMap["name"].(string)
if !ok {
name = filepath.Base(details.URI)
}
description, ok := preferencesMap["description"].(string)
if !ok {
description = "Imported from " + details.URI
}
// parakeet quants are near-lossless even at Q4_K (WER 0.0 vs NeMo on 110m),
// so default to the smallest, then fall back up the size ladder; the last
// file wins if none match (mirrors whisper / llama-cpp).
preferredQuants, _ := preferencesMap["quantizations"].(string)
quants := []string{"q4_k", "q5_k", "q6_k", "q8_0", "f16"}
if preferredQuants != "" {
quants = strings.Split(preferredQuants, ",")
}
cfg := gallery.ModelConfig{
Name: name,
Description: description,
}
modelConfig := config.ModelConfig{
Name: name,
Description: description,
Backend: "parakeet-cpp",
KnownUsecaseStrings: []string{"transcript"},
}
uri := downloader.URI(details.URI)
directGGUF := isParakeetGGUF(filepath.Base(details.URI))
switch {
case uri.LooksLikeURL() && directGGUF:
// Direct file URL (e.g. .../resolve/main/tdt_ctc-110m-f16.gguf). The
// exact file is known, no quant pick.
fileName, err := uri.FilenameFromUrl()
if err != nil {
return gallery.ModelConfig{}, err
}
target := filepath.Join("parakeet-cpp", "models", name, fileName)
cfg.Files = append(cfg.Files, gallery.File{
URI: details.URI,
Filename: target,
})
modelConfig.PredictionOptions = schema.PredictionOptions{
BasicModelRequest: schema.BasicModelRequest{Model: target},
}
case details.HuggingFace != nil:
// HF repo: collect every parakeet GGUF, pick the preferred quant, and
// nest under parakeet-cpp/models/<name>/ so a multi-quant repo doesn't
// collide on disk.
var ggufFiles []hfapi.ModelFile
for _, f := range details.HuggingFace.Files {
if isParakeetGGUF(filepath.Base(f.Path)) {
ggufFiles = append(ggufFiles, f)
}
}
if chosen, ok := pickPreferredGGMLFile(ggufFiles, quants); ok {
target := filepath.Join("parakeet-cpp", "models", name, filepath.Base(chosen.Path))
cfg.Files = append(cfg.Files, gallery.File{
URI: chosen.URL,
Filename: target,
SHA256: chosen.SHA256,
})
modelConfig.PredictionOptions = schema.PredictionOptions{
BasicModelRequest: schema.BasicModelRequest{Model: target},
}
}
default:
// Bare URI with no HF metadata (pref-only path): point at the basename
// so users can tweak the YAML after import.
modelConfig.PredictionOptions = schema.PredictionOptions{
BasicModelRequest: schema.BasicModelRequest{Model: filepath.Base(details.URI)},
}
}
data, err := yaml.Marshal(modelConfig)
if err != nil {
return gallery.ModelConfig{}, err
}
cfg.ConfigFile = string(data)
return cfg, nil
}
// isParakeetGGUF reports whether name is a parakeet.cpp GGUF: a .gguf file
// whose name carries a parakeet architecture token. The .gguf check is
// case-insensitive; the tokens cover the published naming
// (<arch>-<size>-<quant>.gguf) plus a generic "parakeet" fallback.
func isParakeetGGUF(name string) bool {
lower := strings.ToLower(name)
if !strings.HasSuffix(lower, ".gguf") {
return false
}
for _, tok := range []string{"tdt_ctc", "tdt-", "tdt_", "rnnt", "ctc-", "ctc_", "realtime_eou", "parakeet"} {
if strings.Contains(lower, tok) {
return true
}
}
return false
}

View File

@@ -0,0 +1,103 @@
package importers_test
import (
"encoding/json"
"fmt"
"github.com/mudler/LocalAI/core/gallery/importers"
hfapi "github.com/mudler/LocalAI/pkg/huggingface-api"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
// hfWith builds Details carrying a synthetic HF file list so detection can be
// exercised without hitting the network.
func parakeetDetails(uri string, prefs string, files ...hfapi.ModelFile) importers.Details {
return importers.Details{
URI: uri,
Preferences: json.RawMessage(prefs),
HuggingFace: &hfapi.ModelDetails{Files: files},
}
}
var _ = Describe("ParakeetCppImporter", func() {
imp := &importers.ParakeetCppImporter{}
Context("Importer interface metadata", func() {
It("exposes name/modality/autodetect", func() {
Expect(imp.Name()).To(Equal("parakeet-cpp"))
Expect(imp.Modality()).To(Equal("asr"))
Expect(imp.AutoDetects()).To(BeTrue())
})
})
Context("detection (Match)", func() {
It("matches an HF repo shipping a parakeet GGUF", func() {
d := parakeetDetails("huggingface://mudler/parakeet-cpp-gguf", `{}`,
hfapi.ModelFile{Path: "tdt_ctc-110m-f16.gguf"},
hfapi.ModelFile{Path: "README.md"},
)
Expect(imp.Match(d)).To(BeTrue())
})
It("matches a direct URL to a parakeet GGUF", func() {
d := parakeetDetails("https://huggingface.co/mudler/parakeet-cpp-gguf/resolve/main/rnnt-0.6b-q4_k.gguf", `{}`)
Expect(imp.Match(d)).To(BeTrue())
})
It("honours preferences.backend=parakeet-cpp for arbitrary URIs", func() {
d := parakeetDetails("https://example.com/whatever", `{"backend": "parakeet-cpp"}`)
Expect(imp.Match(d)).To(BeTrue())
})
It("does NOT claim a generic llama-style GGUF", func() {
d := parakeetDetails("huggingface://someorg/some-llm-gguf", `{}`,
hfapi.ModelFile{Path: "llama-3-8b-instruct-q4_k_m.gguf"},
)
Expect(imp.Match(d)).To(BeFalse())
})
It("does NOT claim the upstream NeMo repo (.nemo, no GGUF)", func() {
d := parakeetDetails("huggingface://nvidia/parakeet-tdt_ctc-110m", `{}`,
hfapi.ModelFile{Path: "parakeet-tdt_ctc-110m.nemo"},
)
Expect(imp.Match(d)).To(BeFalse())
})
})
Context("import (Import)", func() {
It("picks the default quant (q4_k) from a multi-quant HF repo", func() {
d := parakeetDetails("huggingface://mudler/parakeet-cpp-gguf", `{"name":"parakeet-110m"}`,
hfapi.ModelFile{Path: "tdt_ctc-110m-f16.gguf", URL: "https://hf/f16", SHA256: "aaa"},
hfapi.ModelFile{Path: "tdt_ctc-110m-q4_k.gguf", URL: "https://hf/q4k", SHA256: "bbb"},
hfapi.ModelFile{Path: "tdt_ctc-110m-q8_0.gguf", URL: "https://hf/q8", SHA256: "ccc"},
)
cfg, err := imp.Import(d)
Expect(err).ToNot(HaveOccurred())
Expect(cfg.ConfigFile).To(ContainSubstring("backend: parakeet-cpp"), fmt.Sprintf("%+v", cfg))
Expect(cfg.ConfigFile).To(ContainSubstring("transcript"))
Expect(cfg.Files).To(HaveLen(1))
Expect(cfg.Files[0].URI).To(Equal("https://hf/q4k"), "default quant should be q4_k")
Expect(cfg.Files[0].Filename).To(ContainSubstring("parakeet-cpp/models/parakeet-110m/tdt_ctc-110m-q4_k.gguf"))
})
It("honours a preferred quantization override", func() {
d := parakeetDetails("huggingface://mudler/parakeet-cpp-gguf", `{"name":"p","quantizations":"q8_0"}`,
hfapi.ModelFile{Path: "tdt_ctc-110m-f16.gguf", URL: "https://hf/f16"},
hfapi.ModelFile{Path: "tdt_ctc-110m-q8_0.gguf", URL: "https://hf/q8"},
)
cfg, err := imp.Import(d)
Expect(err).ToNot(HaveOccurred())
Expect(cfg.Files).To(HaveLen(1))
Expect(cfg.Files[0].URI).To(Equal("https://hf/q8"))
})
It("uses the exact file for a direct GGUF URL", func() {
d := parakeetDetails("https://huggingface.co/mudler/parakeet-cpp-gguf/resolve/main/ctc-0.6b-q5_k.gguf", `{"name":"ctc"}`)
cfg, err := imp.Import(d)
Expect(err).ToNot(HaveOccurred())
Expect(cfg.Files).To(HaveLen(1))
Expect(cfg.Files[0].Filename).To(ContainSubstring("parakeet-cpp/models/ctc/ctc-0.6b-q5_k.gguf"))
})
})
})

Some files were not shown because too many files have changed in this diff Show More