When the models directory holds many GGUF files, startup parsed every
model's full GGUF — including the tokenizer vocab arrays
(tokenizer.ggml.tokens/scores/merges, often >100k entries) — once per
model while guessing defaults. On slow storage (e.g. a models directory
on a Docker volume) those hundreds of thousands of tiny reads dominate
boot time before the HTTP server comes up.
The default-guessing path and the VRAM metadata reader only consume
scalar metadata and array lengths, never the array contents. Parse with
SkipLargeMetadata (seek past large arrays) and UseMMap (fault in a few
header pages instead of issuing per-element read() syscalls). For a
256k-token vocab this cuts the parse from ~524k read() syscalls to 8.
The mapping is released when ParseGGUFFile returns.
Fixes#9790
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
Gemma4 MTP (ggml-org/llama.cpp#23398) registers the prediction head as a
separate `gemma4-assistant` architecture. That assistant GGUF still carries
`<arch>.nextn_predict_layers`, so the architecture-agnostic detection in
HasEmbeddedMTPHead matched it and appended the `spec_type:draft-mtp` defaults.
Unlike the DeepSeek/Qwen embedded-head models, an assistant checkpoint cannot
self-speculate: it is a draft model that requires a paired target context
(`ctx_other`) and throws if loaded alone. Auto-applying the self-spec defaults
to a standalone assistant import therefore produces a broken config.
Guard the detection against draft-only assistant architectures (the `-assistant`
suffix is upstream's naming convention) so importing one no longer yields a
self-speculation config. Two-model target+draft pairing remains expressible
manually via `draft_model:` and is left to a follow-up.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(parakeet-cpp): real segment timestamps (NeMo-faithful)
Offline: replace the single synthetic whole-clip segment with multiple
segments grouped exactly like NeMo's get_segment_offsets - a new segment
after sentence-ending punctuation ('. ? !'), each carrying start/end and
its time-window token ids. The optional model option segment_gap_threshold
(NeMo's unit: encoder FRAMES, default 0=off) adds NeMo's silence-gap split,
converted to seconds via the JSON frame_sec the engine now reports.
Per-segment words are still gated behind timestamp_granularities=["word"];
a zero-word document falls back to a single text segment.
Streaming: when libparakeet.so exposes the ABI v4 JSON entry points
(probed), drive parakeet_capi_stream_feed_json / _finalize_json and
accumulate the streamed per-word timestamps into per-utterance segments
(EOU stays the boundary), so streaming FinalResult segments now carry
start/end. Falls back to the text-only feed against an older library.
Pure-Go specs cover splitWordsIntoSegments (punctuation + gap rules, NeMo
elif order, fallback), transcriptResultFromDoc (multi-segment, token
windows, word-granularity gate), and the streaming segmenter.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(audio): document parakeet-cpp segment timestamps + segment_gap_threshold
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(parakeet-cpp): update model-gated specs for multi-segment output
The offline AudioTranscription specs asserted the old single synthetic
segment (Segments HaveLen(1), Segments[0].Text == res.Text). With
NeMo-faithful segmentation a multi-sentence clip now yields multiple
punctuation-delimited segments, so assert the new contract instead:
one-or-more time-ordered segments, each with text and (under word
granularity) per-segment words whose span tracks the segment start/end.
Caught by running the model-gated suite on the dgx (GB10) against the
real tdt_ctc-110m + realtime_eou models.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* chore(turboquant): bump TheTom/llama-cpp-turboquant to 7d9715f1
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(turboquant): drop obsolete legacy-spec shim after fork rebased
The TheTom/llama-cpp-turboquant fork (pin c9aa86a) rebased past the
upstream common_params_speculative refactor (ggml-org/llama.cpp
#22397/#22838/#22964), the model_tgt rename (#22838) and get_media_marker
(#21962). The old fork-compat shim forced now-wrong legacy code paths,
breaking the build with errors like 'struct common_params_speculative has
no member named mparams_dft / type' and 'server_context_impl has no member
named model'.
Remove the obsolete LOCALAI_LEGACY_LLAMA_CPP_SPEC branches from the shared
grpc-server.cpp (stock llama-cpp and the modern fork both take the modern
path now), and narrow the one remaining gap (the fork still lacks
common_params::checkpoint_min_step) to a dedicated
LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP guard injected by
patch-grpc-server.sh. The patch script now only adds the turbo2/3/4
KV-cache types and injects that one macro.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(turboquant): HIP-port the fork's CUDA additions (copy2d 3D-peer + cudaEventCreate)
The turboquant fork adds/modifies a few ggml-cuda.cu spots with CUDA APIs that
ggml's HIP/MUSA shim does not provide, breaking the -gpu-rocm-hipblas-turboquant
build. patches/0001-hip-guard-copy2d-peer-fastpath.patch (applied by
apply-patches.sh) ports them:
- Guard ggml_cuda_copy2d_across_devices's 3D-peer copy fast path with
#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) so HIP/MUSA fall through
to the existing cudaMemcpyAsync staging fallback (HIP genuinely lacks
cudaMemcpy3DPeerAsync, per the fork's own comment).
- Create the device event in ggml_backend_cuda_device_event_new with the
HIP-aliased cudaEventCreateWithFlags(.., cudaEventDisableTiming) instead of the
un-aliased plain cudaEventCreate, matching this file's own usage elsewhere.
CUDA builds are unaffected.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* ci(turboquant): drop the ROCm/hipblas build flavor
The TheTom/llama-cpp-turboquant fork is not ROCm-clean at the current pin:
beyond the CUDA-API gaps already patched (3D-peer copy, cudaEventCreate),
its llama.cpp base fails to compile the flash-attention MMA f16 kernels for
head-dim 640 under HIP (cols_per_warp evaluates to 0 -> division-by-zero /
non-constant static asserts in fattn-mma-f16.cuh). That is a deep
ggml-on-ROCm kernel issue, not something a small fork patch can paper over.
Drop -gpu-rocm-hipblas-turboquant from the build matrix so turboquant still
ships for cpu / cublas / vulkan / sycl. Re-add it once the fork's HIP path
compiles (or upstream ggml fixes the large-head-dim MMA kernels for ROCm).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The Hugo relearn theme does not provide an "alert" shortcode, so the
docs deploy failed at the Build site step:
failed to extract shortcode: template for shortcode "alert" not found
docs/content/features/distributed-mode.md:136
Convert the warning block to the theme-supported notice shortcode used
everywhere else in the docs.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(stablediffusion-ggml): support Ideogram4 unconditional diffusion model
Bump stable-diffusion.cpp from 1f9ee88 to b9254dd, the upstream commit that
adds Ideogram4 support (leejet/stable-diffusion.cpp#1609). Ideogram4 derives
its classifier-free guidance from a separate unconditional diffusion model,
exposed upstream through the new sd_ctx_params_t.uncond_diffusion_model_path
field.
Wire that field into the gosd wrapper via a new uncond_diffusion_model_path
option. The _path suffix is deliberate: the Go loader only resolves options
whose name contains "path" to an absolute path under the model directory, so
this keeps the option consistent with diffusion_model_path and
high_noise_diffusion_model_path.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(gallery): add Ideogram4 stablediffusion-ggml models
Single-file GGUF weights for Ideogram4 are now published
(stduhpf/ideogram-4-gguf), so add the model to the gallery. Ideogram4 is a
text-to-image model with strong, accurate in-image text rendering, driven by
a Qwen3-VL-8B text encoder and real classifier-free guidance from a separate
unconditional diffusion model (the uncond_diffusion_model_path support added
in the preceding commit).
Two index entries, both built on gallery/virtual.yaml with the full config
inlined in overrides (same pattern as the other models, no dedicated template
file):
- ideogram-4-iq4nl-ggml (4-bit, ~11.6GB diffusion)
- ideogram-4-q8_0-ggml (8-bit, ~20GB diffusion)
Each bundles the diffusion + unconditional GGUF (stduhpf), the
Qwen3-VL-8B-Instruct text encoder (unsloth), and the FLUX.2 VAE (Comfy-Org
mirror, non-gated). cfg_scale is 7 to match the upstream Ideogram4 default,
since it performs real CFG unlike the guidance-distilled Flux/Z-Image models.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(parakeet-cpp): honor request language (multilingual nemotron) on batched + streaming paths
Reads opts.GetLanguage() and threads it through to the new
parakeet_capi_transcribe_pcm_batch_json_lang and parakeet_capi_stream_begin_lang
C-API entry points, both probed with Dlsym so the backend still loads against an
older libparakeet.so (falling back to the non-lang paths, i.e. model default).
parakeet.cpp's batched C-API takes a single target_lang for the whole batch, so
the dispatcher only coalesces same-language requests: a request whose language
differs from the batch leader is held as a single carry-over and becomes the
leader of the next batch, never dropped and never left waiting (including on
shutdown). A new batcher test asserts no dispatched batch is ever mixed-language
and that every submitted request still receives a reply.
Assisted-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(gallery): add parakeet-cpp-nemotron-3.5-asr-streaming-0.6b; bump parakeet.cpp pin
Adds the multilingual prompt-conditioned streaming model to the gallery (q8_0
default, OpenMDW-1.1) and bumps the parakeet-cpp backend pin to the parakeet.cpp
commit that ships nemotron support plus batched causal subsampling and the
batched target_lang C-API.
Assisted-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
chore(parakeet-cpp): bump pin to banded long-audio attention (843600590)
Update PARAKEET_VERSION to mudler/parakeet.cpp@843600590f
(merge of parakeet.cpp#9). Brings NeMo rel_pos_local_attn banded/Longformer
attention with the chunk-matmul construction: long audio now uses O(T*window)
attention instead of global O(T^2), fixing the encoder OOM on long clips
(~16.6-min clip: 54GB->9.4GB peak, ~4x faster) at NeMo's full [128,128] window.
Short clips are unchanged (global path). No C-ABI change.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
chore: bump LocalAGI and localrecall (index-backed RRF hybrid search)
Bumps the agent stack to pull in the PostgreSQL hybrid-search fix:
- mudler/localrecall -> v0.6.3-...-9a3b3321a9cd (mudler/LocalRecall#46, merged)
- mudler/LocalAGI -> ...-14aed1ae4336 (mudler/LocalAGI#477, merged)
localrecall's hybrid search previously sorted on a wrapped scalar
similarity expression, which blinded the planner into a full sequential
scan over every row and exceeded the statement timeout on large
collections, returning an empty result set. It now uses the canonical
Reciprocal Rank Fusion pattern (index-backed candidate retrieval + FULL
OUTER JOIN + weighted RRF).
Fixes#10186
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat: forward reasoning_effort to the backend so jinja models honor it
reasoning_effort was only mapped to the binary enable_thinking toggle and
otherwise reached Go-side templates — it was never sent to the backend. So
jinja-templated models whose chat template keys on reasoning_effort (gpt-oss
Harmony, LFM2.5) could not be driven by it: LFM2.5 ignores enable_thinking and
kept emitting <think>.
Forward the effective reasoning_effort to the backend as a chat_template_kwarg
(mirroring enable_thinking) in grpc-server.cpp, and put it in PredictOptions
metadata (gRPCPredictOpts). Add a config-level default: ModelConfig.reasoning_effort
and Pipeline.reasoning_effort, resolved by ModelConfig.ApplyReasoningEffort
(request value overrides config default, none->disable / level->enable, an
operator's reasoning.disable wins). request.go now uses that helper.
Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): set the pipeline LLM's reasoning_effort
Apply Pipeline.ReasoningEffort to the pipeline's LLM config when the realtime
model is built (per-session copy, overrides the LLM's own reasoning_effort),
and surface the resolved effort on the template input so Go-templated models
get it too. jinja models receive it via the backend metadata. This lets a
realtime pipeline disable thinking on models that only honor reasoning_effort
(e.g. LFM2.5), which enable_thinking can't.
Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The worker HTTP file-transfer server is authenticated by the registration
token via checkBearerToken, which fails open on an empty token: every
/v1/files, /v1/files-list and /v1/backend-logs request is then served
unauthenticated, granting read/write to the worker's models/staging/data
directories. The fail-open was also silent (the only auth log sat on the
unreachable reject branch), and the worker process never runs
DistributedConfig.Validate(), so the existing frontend warning did not
cover the component that exposes the server.
Mirror the NatsRequireAuth pattern: keep anonymous as the default but make
it loud and opt-in enforceable.
- Log a prominent warning when the file-transfer server starts tokenless.
- Add LOCALAI_REGISTRATION_REQUIRE_AUTH: DistributedConfig.Validate() errors
on an empty token (frontend) and the worker refuses to start (fail-fast,
before registration), so production can fail closed. Also satisfies the
F-003 suggestion to fail Validate() on distributed + empty token.
- Add LOCALAI_DISTRIBUTED_REQUIRE_AUTH umbrella switch implying both
RegistrationRequireAuth and NatsRequireAuth — one production knob locking
down the registration/file-transfer layer and the NATS bus together; the
granular flags remain available as single-layer overrides. Wired into the
frontend, supervisor worker, and agent worker (vLLM worker has neither a
NATS connection nor a file-transfer server, so it is left untouched).
- Document in distributed-mode.md (warning callout + flag tables).
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(distributed): self-heal stale 'model not loaded' routing
In distributed mode the registry can list a model as loaded on a node
while the worker has evicted it (autonomous LRU eviction, an out-of-band
unload, etc.) yet the backend process survives. The router's cached-node
check only verifies the process is alive (probeHealth), so it routes there
and inference fails with "<backend>: model not loaded" — and stays broken
until the controller restarts and rebuilds its registry.
InFlightTrackingClient now reconciles this: when a tracked inference call
returns a model-not-loaded error, it drops the stale replica row
(RemoveNodeModel) so the next request reloads the model on a healthy node
instead of routing back to the evicted one. The original error is returned
unchanged; only the registry is corrected.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(distributed): typed model-not-loaded error via gRPC status code
Replace the controller-side error-string match with a shared, code-aware
helper. Go error types don't survive the gRPC boundary, so the signal is
carried as a status code (FailedPrecondition):
- pkg/grpc/grpcerrors: ModelNotLoaded(backend) constructor +
IsModelNotLoaded(err) checker (status-code first, message fallback for
backends not yet migrated).
- InFlightTrackingClient.reconcile now uses grpcerrors.IsModelNotLoaded.
- Migrate the Go backends that emit this error (parakeet-cpp, cloud-proxy,
rfdetr-cpp) to the typed constructor.
Acting on a false positive is harmless (the model is just reloaded).
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
FLAG_FACE_RECOGNITION and FLAG_SPEAKER_RECOGNITION already existed as
ModelConfigUsecase bitmask flags, and GuessUsecases already gate-checks
both backends by name — but BackendCapabilities had no entries for
either, so the UI could not classify them.
Also missing were the Method* constants for the five proto-defined RPCs
these backends implement (FaceVerify, FaceAnalyze, VoiceVerify,
VoiceEmbed, VoiceAnalyze) and the corresponding Usecase* strings
and UsecaseInfoMap entries needed to wire them into the rest of the
capability system.
Changes:
- Add MethodFaceVerify, MethodFaceAnalyze, MethodVoiceVerify,
MethodVoiceEmbed, MethodVoiceAnalyze GRPCMethod constants
- Add UsecaseFaceRecognition ("face_recognition") and
UsecaseSpeakerRecognition ("speaker_recognition") Usecase constants
- Add UsecaseInfoMap entries for both new usecases, referencing the
existing FLAG_FACE_RECOGNITION and FLAG_SPEAKER_RECOGNITION flags
- Register insightface: Embedding + Detect + FaceVerify + FaceAnalyze
- Register speaker-recognition: VoiceVerify + VoiceEmbed + VoiceAnalyze
Follows up on #10107 which left these two out because they needed new
constants first.
Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
Distributed file-staging treated every model path field (ModelFile, etc.)
as a single regular file: it os.Open'd the path and streamed its fd as the
HTTP PUT body. For directory-based models — e.g. qwen3-tts-cpp, whose
weights and tokenizer ggufs live under one directory referenced by
parameters.model — opening the directory succeeds but reading its fd
returns EISDIR, so routing the model to a remote NATS worker failed with
"read /models/<model>: is a directory". Single-file models were unaffected,
so only multi-file pipelines (e.g. the realtime TTS stage) broke.
stageModelFiles now detects a directory path field and stages each
contained file individually (via the new stageDirectory helper), preserving
structure with the existing StagingKeyMapper and rewriting the field to the
remote directory (deriving ModelPath as before). countStageableFiles makes
the progress total count a directory's files so the staging tracker stays
accurate.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The qwen3-tts.cpp backend honored the request `language` field only via exact lowercase two-letter codes in the C++ language_to_id table, silently defaulting to English for anything else (en-US, EN, english, ...).
Add normalizeLanguage() in the Go handler: lowercase + trim, strip the region/locale suffix (en-US, pt_BR, zh-Hans -> en/pt/zh), and resolve common English full names (english -> en). The canonical codes match the existing C++ table, so no C++ change is needed. Covered by a pure-Go Ginkgo spec. Also document the language field and accepted forms under the Qwen3-TTS docs.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The OpenAI-compatible TTS endpoint accepts an `instructions` field, but it
was silently dropped at the HTTP->gRPC boundary: neither schema.TTSRequest
nor the gRPC TTSRequest proto carried it, so backends could only read such a
value from static YAML options (identical for every request). This blocked
per-line emotion/style and, for Qwen3-TTS VoiceDesign, limited a model config
to a single designed voice.
Plumb a generic per-request instruction string end to end, plus an optional
backend-specific params map:
- proto: add `optional string instructions` and `map<string,string> params`
to TTSRequest.
- schema: add Instructions (maps OpenAI `instructions`) and Params (LocalAI
extension) to schema.TTSRequest.
- core: thread both through ModelTTS/ModelTTSStream via a newTTSRequest helper
that attaches instructions only when non-empty (so backends can fall back to
YAML when unset); forward them from the /v1/audio/speech handler.
- qwen-tts: prefer the per-request instruction over the YAML `instruct` option
(used by both mode detection and generation) and merge per-request params.
- chatterbox: merge per-request params (coerced to float/int/bool) over YAML
options into generate() kwargs.
Fully backward compatible: empty instructions fall back to the YAML option and
backends that don't support style/voice instructions ignore the field.
Closes#10164
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): NATS JWT auth, TLS/mTLS options, and e2e coverage
Mint per-node NATS user JWTs at registration when LOCALAI_NATS_ACCOUNT_SEED
is set, and connect workers with scoped credentials from the register response.
Add optional LOCALAI_NATS_TLS_CA/CERT/KEY for private CA and mTLS alongside
tls:// URLs, plus test-e2e-distributed and NatsJWT container e2e specs.
Document JWT setup (nats-auth-setup.sh) and TLS env vars in distributed-mode.
Assisted-by: Grok:grok grok-build
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(distributed): correct NATS JWT scoping and harden client auth
The JWT-auth path added in 46467cc7 had several gaps that fail silently
under LOCALAI_NATS_REQUIRE_AUTH:
- Agent-worker minted JWTs did not allow the subjects the agent worker
actually subscribes to (jobs.mcp-ci.new and nodes.<id>.backend.stop),
so MCP-CI jobs and backend-stop session cleanup were silently dropped.
Scope the agent permission set to those subjects.
- NATS subscription permission violations were swallowed (Subscribe
returned a live-but-dead subscription). Confirm subscriptions with a
server round-trip so a denial surfaces synchronously, and log async
permission errors.
- The backend worker connected anonymously when given a JWT without its
paired seed; reject the unpaired credential instead.
- The documented service-user permissions in nats-auth-setup.sh omitted
prefixcache.>, which the frontend publishes and subscribes; add it.
Also: add a credential-provider hook to the messaging client (consumed by
the follow-up credential-lifecycle change), drop the always-nil error from
NatsMessagingOptions, run go mod tidy (jwt/v2 and nkeys are now direct),
and gofmt the feature's files.
Tests: an agent-JWT e2e spec that connects to the enforcing NATS server
and exercises every subscription the agent worker makes, plus permission
allow-list coverage unit tests.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(distributed): acquire and auto-refresh worker NATS credentials
Workers fetched NATS credentials once at startup, which broke two cases
under JWT auth: a worker that registered while still pending admin
approval never received a minted JWT (it connected unauthenticated and
gave up), and a long-running worker's 24h JWT expired with no way to renew
it.
Introduce workerregistry.NATSCredentialManager, built on idempotent
re-registration (the frontend preserves the node row and mints a fresh JWT
each call):
- Acquire re-registers through admin approval until the node is approved
and credentials are minted (or returns the first success when auth is
not required, preserving anonymous-NATS behavior).
- RefreshLoop re-registers before the JWT expires (~75% of its lifetime),
updating the credentials served to the connection.
- Both are bounded (default 100 attempts / consecutive failures) and
return an error on exhaustion, so an unapprovable or unrenewable worker
exits non-zero and surfaces the problem instead of hanging or drifting
toward an expired credential.
The messaging client gains WithUserJWTProvider, fetching credentials on
each (re)connect so the connection transparently adopts a refreshed JWT
when the server expires the old one. RegisterFull exposes the approval
status and full response; Register delegates to it.
Both the backend worker and the agent worker are wired to this: explicit
env credentials are used as-is, minted credentials are acquired-with-wait
and refreshed, and a permanent refresh failure shuts the worker down so it
restarts and re-acquires.
Tests cover Acquire (wait-through-pending, bounded give-up, context
cancel), RefreshLoop (refresh-before-expiry, bounded failure, no-expiry
exit) and jwtExpiry decoding. Docs updated in distributed-mode.md.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
The direct (non-batched) transcription path handed the original upload
path straight to the C library via parakeet_capi_transcribe_path_json.
That loader only understands 16 kHz mono WAV/PCM, so any other format
(MP3, etc.) failed with "parakeet: failed to load audio: <file>".
Only the batched path converted the input (via decodeWavMono16k ->
utils.AudioToWav). Every other audio backend (whisper, crispasr)
converts unconditionally with utils.AudioToWav before handing the file
to its engine; the parakeet-cpp fallback was the lone exception.
Extract a convertToWavMono16k helper (reused by decodeWavMono16k) that
produces a 16 kHz mono WAV in a temp dir, and run the non-batched path
through it before calling the C loader. WAV inputs already in the target
format are passed through without ffmpeg.
Add specs covering the helper (decodable copy + cleanup, and an error on
a missing input) that need neither the model, the C library, nor ffmpeg.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
docs: fix distributed-mode diagram - workers coordinate via NATS, not PostgreSQL
The architecture diagram drew the worker-bound arrows from the PostgreSQL area of the control plane, implying workers connect to PostgreSQL. They do not: PostgreSQL is the frontends shared state, while workers coordinate over NATS (backend.install events) and receive LoadModel over gRPC from a frontend. Re-route the worker arrows to originate from the NATS chip.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>