Compare commits

..

225 Commits

Author SHA1 Message Date
Ettore Di Giacinto
cd56a05c3e ci(vllm): disable tests-vllm-grpc job (heterogeneous runners)
Both ubuntu-latest and bigger-runner have inconsistent CPU baselines:
some instances support the AVX-512 VNNI/BF16 instructions the prebuilt
vllm 0.14.1+cpu wheel was compiled with, others SIGILL on import of
vllm.model_executor.models.registry. The libnuma packaging fix doesn't
help when the wheel itself can't be loaded.

FROM_SOURCE=true compiles vllm against the actual host CPU and works
everywhere, but takes 30-50 minutes per run — too slow for a smoke
test on every PR.

Comment out the job for now. The test itself is intact and passes
locally; run it via 'make test-extra-backend-vllm' on a host with the
required SIMD baseline. Re-enable when:
  - we have a self-hosted runner label with guaranteed AVX-512 VNNI/BF16, or
  - vllm publishes a CPU wheel with a wider baseline, or
  - we set up a docker layer cache that makes FROM_SOURCE acceptable

The detect-changes vllm output, the test harness changes (tests/
e2e-backends + tools cap), the make target (test-extra-backend-vllm),
the package.sh and the Dockerfile/install.sh plumbing all stay in
place.
2026-04-13 07:46:57 +00:00
Ettore Di Giacinto
d74cd56b14 feat(vllm): bundle libnuma/libgomp via package.sh
The vllm CPU wheel ships a _C extension that dlopens libnuma.so.1 at
import time; torch's CPU kernels in turn use libgomp.so.1 (OpenMP).
Without these on the host, vllm._C silently fails to register its
torch ops and EngineCore crashes with:

  AttributeError: '_OpNamespace' '_C_utils' object has no attribute
    'init_cpu_threads_env'

Rather than asking every user to install libnuma1/libgomp1 on their
host (or every LocalAI base image to ship them), bundle them into
the backend image itself — same pattern fish-speech and the GPU libs
already use. libbackend.sh adds ${EDIR}/lib to LD_LIBRARY_PATH at
run time so the bundled copies are picked up automatically.

- backend/python/vllm/package.sh (new): copies libnuma.so.1 and
  libgomp.so.1 from the builder's multilib paths into ${BACKEND}/lib,
  preserving soname symlinks. Runs during Dockerfile.python's
  'Run backend-specific packaging' step (which already invokes
  package.sh if present).
- backend/Dockerfile.python: install libnuma1 + libgomp1 in the
  builder stage so package.sh has something to copy (the Ubuntu
  base image otherwise only has libgomp in the gcc dep chain).
- test-extra.yml: drop the workaround that installed these libs on
  the runner host — with the backend image self-contained, the
  runner no longer needs them, and the test now exercises the
  packaging path end-to-end the way a production host would.
2026-04-12 20:20:21 +00:00
Ettore Di Giacinto
017bdee4e4 ci(vllm): install libnuma1 + libgomp1 on bigger-runner
The vllm 0.14.1+cpu wheel ships a _C C++ extension that dlopens
libnuma.so.1 at import time. When the runner host doesn't have it,
the extension silently fails to register its torch ops, so
EngineCore crashes on init_device with:

  AttributeError: '_OpNamespace' '_C_utils' object has no attribute
    'init_cpu_threads_env'

Also add libgomp1 (OpenMP runtime, used by torch CPU kernels) to be
safe on stripped-down runners.
2026-04-12 20:18:13 +00:00
Ettore Di Giacinto
c4dc495ea1 ci(vllm): install make + build deps on bigger-runner
bigger-runner is a bare self-hosted runner used by backend.yml for
docker image builds — it has docker but not the usual ubuntu-latest
toolchain. The make-based test target needs make, build-essential
(cgo in 'go test'), and curl/unzip (the Makefile protoc target
downloads protoc from github releases).

protoc-gen-go and protoc-gen-go-grpc come via 'go install' in the
install-go-tools target, which setup-go makes possible.
2026-04-12 20:08:09 +00:00
Ettore Di Giacinto
ea2bbabffd ci(vllm): use bigger-runner instead of source build
The prebuilt vllm 0.14.1+cpu wheel requires SIMD instructions (AVX-512
VNNI/BF16) that stock ubuntu-latest GitHub runners don't support —
vllm.model_executor.models.registry SIGILLs on import during LoadModel.

Source compilation works but takes 30-40 minutes per CI run, which is
too slow for an e2e smoke test. Instead, switch tests-vllm-grpc to the
bigger-runner self-hosted label (already used by backend.yml for the
llama-cpp CUDA build) — that hardware has the required SIMD baseline
and the prebuilt wheel runs cleanly.

FROM_SOURCE=true is kept as an opt-in escape hatch:
- install.sh still has the CPU source-build path for hosts that need it
- backend/Dockerfile.python still declares the ARG + ENV
- Makefile docker-build-backend still forwards the build-arg when set
Default CI path uses the fast prebuilt wheel; source build can be
re-enabled by exporting FROM_SOURCE=true in the environment.
2026-04-12 16:02:49 +00:00
Ettore Di Giacinto
329df11989 fix(vllm): build from source on CI to avoid SIGILL on prebuilt wheel
The prebuilt vllm 0.14.1+cpu wheel from GitHub releases is compiled with
SIMD instructions (AVX-512 VNNI/BF16 or AMX-BF16) that not every CPU
supports. GitHub Actions ubuntu-latest runners SIGILL when vllm spawns
the model_executor.models.registry subprocess for introspection, so
LoadModel never reaches the actual inference path.

- install.sh: when FROM_SOURCE=true on a CPU build, temporarily hide
  requirements-cpu-after.txt so installRequirements installs the base
  deps + torch CPU without pulling the prebuilt wheel, then clone vllm
  and compile it with VLLM_TARGET_DEVICE=cpu. The resulting binaries
  target the host's actual CPU.
- backend/Dockerfile.python: accept a FROM_SOURCE build-arg and expose
  it as an ENV so install.sh sees it during `make`.
- Makefile docker-build-backend: forward FROM_SOURCE as --build-arg
  when set, so backends that need source builds can opt in.
- Makefile test-extra-backend-vllm: call docker-build-vllm via a
  recursive $(MAKE) invocation so FROM_SOURCE flows through.
- .github/workflows/test-extra.yml: set FROM_SOURCE=true on the
  tests-vllm-grpc job. Slower but reliable — the prebuilt wheel only
  works on hosts that share the build-time SIMD baseline.

Answers 'did you test locally?': yes, end-to-end on my local machine
with the prebuilt wheel (CPU supports AVX-512 VNNI). The CI runner CPU
gap was not covered locally — this commit plugs that gap.
2026-04-12 15:14:42 +00:00
Ettore Di Giacinto
c7f444d18b ci(test-extra): run vllm e2e tests on CPU
Adds tests-vllm-grpc to the test-extra workflow, mirroring the
llama-cpp and ik-llama-cpp gRPC jobs. Triggers when files under
backend/python/vllm/ change (or on run-all), builds the local-ai
vllm container image, and runs the tests/e2e-backends harness with
BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct, tool_parser:hermes,
and the tools capability enabled.

Uses ubuntu-latest (no GPU) — vllm runs on CPU via the cpu-vllm
wheel we pinned in requirements-cpu-after.txt. Frees disk space
before the build since the docker image + torch + vllm wheel is
sizeable.
2026-04-12 14:53:44 +00:00
Ettore Di Giacinto
e7f406169a test(e2e-backends): add tools capability + HF model name support
Extends tests/e2e-backends to cover backends that:
- Resolve HuggingFace model ids natively (vllm, vllm-omni) instead of
  loading a local file: BACKEND_TEST_MODEL_NAME is passed verbatim as
  ModelOptions.Model with no download/ModelFile.
- Parse tool calls into ChatDelta.tool_calls: new "tools" capability
  sends a Predict with a get_weather function definition and asserts
  the Reply contains a matching ToolCallDelta. Uses UseTokenizerTemplate
  with OpenAI-style Messages so the backend can wire tools into the
  model's chat template.
- Need backend-specific Options[]: BACKEND_TEST_OPTIONS lets a test set
  e.g. "tool_parser:hermes,reasoning_parser:qwen3" at LoadModel time.

Adds make target test-extra-backend-vllm that:
- docker-build-vllm
- loads Qwen/Qwen2.5-0.5B-Instruct
- runs health,load,predict,stream,tools with tool_parser:hermes

Drops backend/python/vllm/test_{cpu_inference,tool_calls}.py — those
standalone scripts were scaffolding used while bringing up the Python
backend; the e2e-backends harness now covers the same ground uniformly
alongside llama-cpp and ik-llama-cpp.
2026-04-12 14:51:58 +00:00
Ettore Di Giacinto
034a60bf76 ci(backend): build cpu-vllm container image
Add the cpu-vllm variant to the backend container build matrix so the
image registered in backend/index.yaml (cpu-vllm / cpu-vllm-development)
is actually produced by CI.

Follows the same pattern as the other CPU python backends
(cpu-diffusers, cpu-chatterbox, etc.) with build-type='' and no CUDA.
backend_pr.yml auto-picks this up via its matrix filter from backend.yml.
2026-04-12 14:48:28 +00:00
Ettore Di Giacinto
c99188f106 fix(vllm): tool parser constructor compat + e2e tool calling test
Concrete vLLM tool parsers override the abstract base's __init__ and
drop the tools kwarg (e.g. Hermes2ProToolParser only takes tokenizer).
Instantiating with tools= raised TypeError which was silently caught,
leaving chat_deltas.tool_calls empty.

Retry the constructor without the tools kwarg on TypeError — tools
aren't required by these parsers since extract_tool_calls finds tool
syntax in the raw model output directly.

Validated with Qwen/Qwen2.5-0.5B-Instruct + hermes parser on CPU:
the backend correctly returns ToolCallDelta{name='get_weather',
arguments='{"location": "Paris, France"}'} in ChatDelta.

test_tool_calls.py is a standalone smoke test that spawns the gRPC
backend, sends a chat completion with tools, and asserts the response
contains a structured tool call.
2026-04-12 14:48:28 +00:00
Ettore Di Giacinto
c2f73a987e fix(vllm): CPU build compatibility with vllm 0.14.1
Validated end-to-end on CPU with Qwen2.5-0.5B-Instruct (LoadModel, Predict,
TokenizeString, Free all working).

- requirements-cpu-after.txt: pin vllm to 0.14.1+cpu (pre-built wheel from
  GitHub releases) for x86_64 and aarch64. vllm 0.14.1 is the newest CPU
  wheel whose torch dependency resolves against published PyTorch builds
  (torch==2.9.1+cpu). Later vllm CPU wheels currently require
  torch==2.10.0+cpu which is only available on the PyTorch test channel
  with incompatible torchvision.
- requirements-cpu.txt: bump torch to 2.9.1+cpu, add torchvision/torchaudio
  so uv resolves them consistently from the PyTorch CPU index.
- install.sh: add --index-strategy=unsafe-best-match for CPU builds so uv
  can mix the PyTorch index and PyPI for transitive deps (matches the
  existing intel profile behaviour).
- backend.py LoadModel: vllm >= 0.14 removed AsyncLLMEngine.get_model_config
  so the old code path errored out with AttributeError on model load.
  Switch to the new get_tokenizer()/tokenizer accessor with a fallback
  to building the tokenizer directly from request.Model.
2026-04-12 14:48:28 +00:00
Ettore Di Giacinto
b215843807 feat(vllm): CPU support + shared utils + vllm-omni feature parity
- Split vllm install per acceleration: move generic `vllm` out of
  requirements-after.txt into per-profile after files (cublas12, hipblas,
  intel) and add CPU wheel URL for cpu-after.txt
- requirements-cpu.txt now pulls torch==2.7.0+cpu from PyTorch CPU index
- backend/index.yaml: register cpu-vllm / cpu-vllm-development variants
- New backend/python/common/vllm_utils.py: shared parse_options,
  messages_to_dicts, setup_parsers helpers (used by both vllm backends)
- vllm-omni: replace hardcoded chat template with tokenizer.apply_chat_template,
  wire native parsers via shared utils, emit ChatDelta with token counts,
  add TokenizeString and Free RPCs, detect CPU and set VLLM_TARGET_DEVICE
- Add test_cpu_inference.py: standalone script to validate CPU build with
  a small model (Qwen2.5-0.5B-Instruct)
2026-04-12 14:48:28 +00:00
Ettore Di Giacinto
6786f05c64 feat(vllm): wire native tool/reasoning parsers + chat deltas + logprobs
- Use vLLM's ToolParserManager/ReasoningParserManager to extract structured
  output (tool calls, reasoning content) instead of reimplementing parsing
- Convert proto Messages to dicts and pass tools to apply_chat_template
- Emit ChatDelta with content/reasoning_content/tool_calls in Reply
- Extract prompt_tokens, completion_tokens, and logprobs from output
- Replace boolean GuidedDecoding with proper GuidedDecodingParams from Grammar
- Add TokenizeString and Free RPC methods
- Fix missing `time` import used by load_video()
2026-04-12 14:48:28 +00:00
Ettore Di Giacinto
6cf8263c30 feat(config): add vLLM parser defaults hook and importer auto-detection
Introduces parser_defaults.json mapping model families to vLLM
tool_parser/reasoning_parser names, with longest-pattern-first matching.

The vllmDefaults hook auto-fills tool_parser and reasoning_parser
options at load time for known families, while the VLLMImporter writes
the same values into generated YAML so users can review and edit them.

Adds tests covering MatchParserDefaults, hook registration via
SetDefaults, and the user-override behavior.
2026-04-12 14:48:28 +00:00
Ettore Di Giacinto
a30719f04a refactor(config): introduce backend hook system and migrate llama-cpp defaults
Adds RegisterBackendHook/runBackendHooks so each backend can register
default-filling functions that run during ModelConfig.SetDefaults().

Migrates the existing GGUF guessing logic into hooks_llamacpp.go,
registered for both 'llama-cpp' and the empty backend (auto-detect).
Removes the old guesser.go shim.
2026-04-12 14:48:28 +00:00
Ettore Di Giacinto
40b1c6f943 fix(schema): serialize ToolCallID and Reasoning in Messages.ToProto
The ToProto conversion was dropping tool_call_id and reasoning_content
even though both proto and Go fields existed, breaking multi-turn tool
calling and reasoning passthrough to backends.
2026-04-12 14:48:28 +00:00
Ettore Di Giacinto
9ca03cf9cc feat(backends): add ik-llama-cpp (#9326)
* feat(backends): add ik-llama-cpp

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: add grpc e2e suite, hook to CI, update README

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Apply suggestion from @mudler

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

* Apply suggestion from @mudler

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-12 13:51:28 +02:00
Ettore Di Giacinto
151ad271f2 feat(rocm): bump to 7.x (#9323)
feat(rocm): bump to 7.2.1

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-12 08:51:30 +02:00
Ettore Di Giacinto
2865f0f8d3 feat(ux): backend management enhancement (#9325)
* feat: add PreferDevelopmentBackends setting, expose isMeta/isDevelopment in API

- Add PreferDevelopmentBackends config field, CLI flag, runtime setting
- Add IsDevelopment() method to GalleryBackend
- Use AvailableBackendsUnfiltered in UI API to show all backends
- Expose isMeta, isDevelopment, preferDevelopmentBackends in backend API response

* feat: upgrade banner with Upgrade All button, detect pre-existing backends

- Add upgrade banner on Backends page showing count and Upgrade All button
- Fix upgrade detection for backends installed before version tracking:
  flag as upgradeable when gallery has a version but installed has none
- Fix OCI digest check to flag backends with no stored digest as upgradeable
2026-04-12 00:35:22 +02:00
LocalAI [bot]
6fbda277c5 chore: ⬆️ Update ggml-org/llama.cpp to ff5ef8278615a2462b79b50abdf3cc95cfb31c6f (#9319)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-11 23:15:23 +02:00
Ettore Di Giacinto
7a0e6ae6d2 feat(qwen3tts.cpp): add new backend (#9316)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-11 23:14:26 +02:00
LocalAI [bot]
e4bfc42a2d chore: ⬆️ Update leejet/stable-diffusion.cpp to 6b675a5ede9b0edf0a0f44191e8b79d7ef27615a (#9320)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-11 23:07:30 +02:00
LocalAI [bot]
7edd3ea96f chore(model-gallery): ⬆️ update checksum (#9321)
⬆️ Checksum updates in gallery/index.yaml

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-11 22:53:48 +02:00
LocalAI [bot]
b20a2f1cea feat(swagger): update swagger (#9318)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-11 22:31:36 +02:00
Ettore Di Giacinto
8ab0744458 feat: backend versioning, upgrade detection and auto-upgrade (#9315)
* feat: add backend versioning data model foundation

Add Version, URI, and Digest fields to BackendMetadata for tracking
installed backend versions and enabling upgrade detection. Add Version
field to GalleryBackend. Add UpgradeAvailable/AvailableVersion fields
to SystemBackend. Implement GetImageDigest() for lightweight OCI digest
lookups via remote.Head. Record version, URI, and digest at install time
in InstallBackend() and propagate version through meta backends.

* feat: add backend upgrade detection and execution logic

Add CheckBackendUpgrades() to compare installed backend versions/digests
against gallery entries, and UpgradeBackend() to perform atomic upgrades
with backup-based rollback on failure. Includes Agent A's data model
changes (Version/URI/Digest fields, GetImageDigest).

* feat: add AutoUpgradeBackends config and runtime settings

Add configuration and runtime settings for backend auto-upgrade:
- RuntimeSettings field for dynamic config via API/JSON
- ApplicationConfig field, option func, and roundtrip conversion
- CLI flag with LOCALAI_AUTO_UPGRADE_BACKENDS env var
- Config file watcher support for runtime_settings.json
- Tests for ToRuntimeSettings, ApplyRuntimeSettings, and roundtrip

* feat(ui): add backend version display and upgrade support

- Add upgrade check/trigger API endpoints to config and api module
- Backends page: version badge, upgrade indicator, upgrade button
- Manage page: version in metadata, context-aware upgrade/reinstall button
- Settings page: auto-upgrade backends toggle

* feat: add upgrade checker service, API endpoints, and CLI command

- UpgradeChecker background service: checks every 6h, auto-upgrades when enabled
- API endpoints: GET /backends/upgrades, POST /backends/upgrades/check, POST /backends/upgrade/:name
- CLI: `localai backends upgrade` command, version display in `backends list`
- BackendManager interface: add UpgradeBackend and CheckUpgrades methods
- Wire upgrade op through GalleryService backend handler
- Distributed mode: fan-out upgrade to worker nodes via NATS

* fix: use advisory lock for upgrade checker in distributed mode

In distributed mode with multiple frontend instances, use PostgreSQL
advisory lock (KeyBackendUpgradeCheck) so only one instance runs
periodic upgrade checks and auto-upgrades. Prevents duplicate
upgrade operations across replicas.

Standalone mode is unchanged (simple ticker loop).

* test: add e2e tests for backend upgrade API

- Test GET /api/backends/upgrades returns 200 (even with no upgrade checker)
- Test POST /api/backends/upgrade/:name accepts request and returns job ID
- Test full upgrade flow: trigger upgrade via API, wait for job completion,
  verify run.sh updated to v2 and metadata.json has version 2.0.0
- Test POST /api/backends/upgrades/check returns 200
- Fix nil check for applicationInstance in upgrade API routes
2026-04-11 22:31:15 +02:00
thelittlefireman
7c1865b307 Fix load of z-image-turbo (#9264)
* Fix load of z-image and improve speed

Signed-off-by: thelittlefireman <5165783+thelittlefireman@users.noreply.github.com>

* Remove diffusion_flash_attn from z-image-ggml.yaml

Removed 'diffusion_flash_attn' parameter from configuration.

Signed-off-by: thelittlefireman <5165783+thelittlefireman@users.noreply.github.com>

---------

Signed-off-by: thelittlefireman <5165783+thelittlefireman@users.noreply.github.com>
2026-04-11 08:42:13 +02:00
LocalAI [bot]
62a674ce12 chore: ⬆️ Update ggml-org/llama.cpp to e62fa13c2497b2cd1958cb496e9489e86bbd5182 (#9312)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-11 08:39:10 +02:00
LocalAI [bot]
c39213443b feat(swagger): update swagger (#9310)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-11 08:38:55 +02:00
LocalAI [bot]
606f462da4 chore: ⬆️ Update PABannier/sam3.cpp to 01832ef85fcc8eb6488f1d01cd247f07e96ff5a9 (#9311)
⬆️ Update PABannier/sam3.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-11 08:38:30 +02:00
Ettore Di Giacinto
5c35e85fe2 feat: allow to pin models and skip from reaping (#9309)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-11 08:38:17 +02:00
Leigh Phillips
062e0d0d00 feat: Add toggle mechanism to enable/disable models from loading on demand (#9304)
* feat: add toggle mechanism to enable/disable models from loading on demand

Implements #9303 - Adds ability to disable models from being auto-loaded
while keeping them in the collection.

Backend changes:
- Add Disabled field to ModelConfig struct with IsDisabled() getter
- New ToggleModelEndpoint handler (PUT /models/toggle/:name/:action)
- Request middleware returns 403 when disabled model is requested
- Capabilities endpoint exposes disabled status

Frontend changes:
- Toggle switch in System > Models table Actions column
- Visual indicators: dimmed row, red Disabled badge, muted icons
- Tooltip describes toggle function on hover
- Loading state while API call is in progress

* fix: remove extra closing brace causing syntax error in request middleware

* refactor: reorder Actions column - Stop button before toggle switch

* refactor: migrate from toggle to toggle-state per PR review feedback
2026-04-10 18:17:41 +02:00
LocalAI [bot]
d4cd6c284f chore: ⬆️ Update ggml-org/llama.cpp to d132f22fc92f36848f7ccf2fc9987cd0b0120825 (#9302)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-10 08:46:45 +02:00
Ettore Di Giacinto
3bb8b65d31 chore(qwen3-asr): pass prompt as context to transcribe (#9301)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-10 08:45:59 +02:00
Ettore Di Giacinto
9748a1cbc6 fix(streaming): skip chat deltas for role-init elements to prevent first token duplication (#9299)
When TASK_RESPONSE_TYPE_OAI_CHAT is used, the first streaming token
produces a JSON array with two elements: a role-init chunk and the
actual content chunk. The grpc-server loop called attach_chat_deltas
for both elements with the same raw_result pointer, stamping the first
token's ChatDelta.Content on both replies. The Go side accumulated both,
emitting the first content token twice to SSE clients.

Fix: in the array iteration loops in PredictStream, detect role-init
elements (delta has "role" key) and skip attach_chat_deltas for them.
Only content/reasoning elements get chat deltas attached.

Reasoning models are unaffected because their first token goes into
reasoning_content, not content.
2026-04-10 08:45:47 +02:00
LocalAI [bot]
6bc76dda6d feat(swagger): update swagger (#9300)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-10 01:05:53 +02:00
Ettore Di Giacinto
e1a6010874 fix(streaming): deduplicate tool call emissions during streaming (#9292)
The Go-side incremental JSON parser was emitting the same tool call on
every streaming token because it lacked the len > lastEmittedCount guard
that the XML parser had. On top of that, the post-streaming default:
case re-emitted all tool calls from index 0, duplicating everything.

This produced duplicate delta.tool_calls events causing clients to
accumulate arguments as "{args}{args}" — invalid JSON.

Fixes:
- JSON incremental parser: add len(jsonResults) > lastEmittedCount guard
  and loop from lastEmittedCount (matching the XML parser pattern)
- Post-streaming default: case: skip i < lastEmittedCount entries that
  were already emitted during streaming
- JSON parser: use blocking channel send (matching XML parser behavior)
2026-04-10 00:44:25 +02:00
Ettore Di Giacinto
706cf5d43c feat(sam.cpp): add sam.cpp detection backend (#9288)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-09 21:49:11 +02:00
Ettore Di Giacinto
13a6ed709c fix: thinking models with tools returning empty content (reasoning-only retry loop) (#9290)
When clients like Nextcloud or Home Assistant send requests with tools
to thinking models (e.g. Gemma 4 with <|channel>thought tags), the
response was empty despite the backend producing valid content.

Root cause: the C++ autoparser puts clean content in both the raw
Response and ChatDeltas. The Go-side PrependThinkingTokenIfNeeded
then prepends the thinking start token to the already-clean content,
causing ExtractReasoning to classify the entire response as unclosed
reasoning. This made cbRawResult empty, triggering a retry loop that
never succeeds.

Two fixes:
- inference.go: check ChatDeltas for content/tool_calls regardless of
  whether Response is empty, so skipCallerRetry fires correctly
- chat.go: when ChatDeltas have content but no tool calls, use that
  content directly instead of falling back to the empty cbRawResult
2026-04-09 18:30:31 +02:00
Ettore Di Giacinto
85be4ff03c feat(api): add ollama compatibility (#9284)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-09 14:15:14 +02:00
Ettore Di Giacinto
b0d9ce4905 Remove header from OpenAI Realtime API documentation
Removed the header from the Realtime API documentation.

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-09 09:00:28 +02:00
LocalAI [bot]
7081b54c09 chore: ⬆️ Update leejet/stable-diffusion.cpp to e8323cabb0e4511ba18a50b1cb34cf1f87fc71ef (#9281)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-09 08:12:23 +02:00
Ettore Di Giacinto
2b05420f95 chore(llama.cpp): bump to 'd12cc3d1ca6bba741cd77887ac9c9ee18c8415c7' (#9282)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-09 08:12:05 +02:00
Ettore Di Giacinto
b64347b6aa chore: add gemma4 to the gallery
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-08 23:44:16 +00:00
Ettore Di Giacinto
e00ce981f0 fix: try to add whisperx and faster-whisper for more variants (#9278)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-08 21:23:38 +02:00
Ettore Di Giacinto
285f7d4340 chore: add embeddingemma
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-08 17:40:55 +00:00
Richard Palethorpe
ea6e850809 feat: Add Kokoros backend (#9212)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-04-08 19:23:16 +02:00
Ettore Di Giacinto
b7247fc148 fix(whisperx): add alias
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-08 14:40:08 +00:00
Ettore Di Giacinto
39c6b3ed66 feat: track files being staged (#9275)
This changeset makes visible when files are being staged, so users are
aware that the model "isn't ready yet" for requests.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-08 14:33:58 +02:00
Ettore Di Giacinto
0e9d1a6588 chore(ci): drop unnecessary test
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-08 12:19:54 +00:00
Ettore Di Giacinto
510d6759fe fix(nodes): better detection if nodes goes down or model is not available (#9274)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-08 12:11:02 +02:00
Ettore Di Giacinto
154fa000d3 fix(autoscaling): extract load model from Route() and use as well when doing autoscale (#9270)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-08 08:27:51 +02:00
LocalAI [bot]
0526e60f8d chore: ⬆️ Update ggml-org/llama.cpp to 66c4f9ded01b29d9120255be1ed8d5835bcbb51d (#9269)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-08 08:27:38 +02:00
LocalAI [bot]
db600fb5b2 docs: ⬆️ update docs version mudler/LocalAI (#9268)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-08 08:27:27 +02:00
Richard Palethorpe
9ac1bdc587 feat(ui): Interactive model config editor with autocomplete (#9149)
* feat(ui): Add dynamic model editor with autocomplete

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* chore(docs): Add link to longformat installation video

Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-04-07 14:42:23 +02:00
dependabot[bot]
fdc9f7bf35 chore(deps): bump go.opentelemetry.io/otel/exporters/prometheus from 0.64.0 to 0.65.0 (#9254)
chore(deps): bump go.opentelemetry.io/otel/exporters/prometheus

Bumps [go.opentelemetry.io/otel/exporters/prometheus](https://github.com/open-telemetry/opentelemetry-go) from 0.64.0 to 0.65.0.
- [Release notes](https://github.com/open-telemetry/opentelemetry-go/releases)
- [Changelog](https://github.com/open-telemetry/opentelemetry-go/blob/main/CHANGELOG.md)
- [Commits](https://github.com/open-telemetry/opentelemetry-go/compare/exporters/prometheus/v0.64.0...exporters/prometheus/v0.65.0)

---
updated-dependencies:
- dependency-name: go.opentelemetry.io/otel/exporters/prometheus
  dependency-version: 0.65.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-07 00:39:52 +02:00
LocalAI [bot]
8e59346091 chore: ⬆️ Update leejet/stable-diffusion.cpp to 8afbeb6ba9702c15d41a38296f2ab1fe5c829fa0 (#9262)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-07 00:39:38 +02:00
LocalAI [bot]
e6e4e19633 chore: ⬆️ Update ace-step/acestep.cpp to e0c8d75a672fca5684c88c68dbf6d12f58754258 (#9261)
⬆️ Update ace-step/acestep.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-07 00:39:24 +02:00
Ettore Di Giacinto
505c417fa7 fix(gpu): better detection for MacOS and Thor (#9263)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-07 00:39:07 +02:00
LocalAI [bot]
17215f6fbc docs: ⬆️ update docs version mudler/LocalAI (#9260)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-07 00:38:50 +02:00
LocalAI [bot]
bccaba1f66 chore: ⬆️ Update ggml-org/llama.cpp to d0a6dfeb28a09831d904fc4d910ddb740da82834 (#9259)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-07 00:38:36 +02:00
Ettore Di Giacinto
0f9d516a6c fix(anthropic): do not emit empty tokens and fix SSE tool calls (#9258)
This fixes Claude Code compatibility

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-07 00:38:21 +02:00
dependabot[bot]
33b124c6f1 chore(deps): bump github.com/aws/aws-sdk-go-v2/config from 1.32.12 to 1.32.14 (#9256)
chore(deps): bump github.com/aws/aws-sdk-go-v2/config

Bumps [github.com/aws/aws-sdk-go-v2/config](https://github.com/aws/aws-sdk-go-v2) from 1.32.12 to 1.32.14.
- [Release notes](https://github.com/aws/aws-sdk-go-v2/releases)
- [Commits](https://github.com/aws/aws-sdk-go-v2/compare/config/v1.32.12...config/v1.32.14)

---
updated-dependencies:
- dependency-name: github.com/aws/aws-sdk-go-v2/config
  dependency-version: 1.32.14
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-06 21:46:52 +02:00
dependabot[bot]
6b8007e88e chore(deps): bump github.com/jaypipes/ghw from 0.23.0 to 0.24.0 (#9250)
Bumps [github.com/jaypipes/ghw](https://github.com/jaypipes/ghw) from 0.23.0 to 0.24.0.
- [Release notes](https://github.com/jaypipes/ghw/releases)
- [Commits](https://github.com/jaypipes/ghw/compare/v0.23.0...v0.24.0)

---
updated-dependencies:
- dependency-name: github.com/jaypipes/ghw
  dependency-version: 0.24.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-06 21:46:18 +02:00
dependabot[bot]
b3837c2078 chore(deps): bump google.golang.org/grpc from 1.79.3 to 1.80.0 (#9253)
Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.79.3 to 1.80.0.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](https://github.com/grpc/grpc-go/compare/v1.79.3...v1.80.0)

---
updated-dependencies:
- dependency-name: google.golang.org/grpc
  dependency-version: 1.80.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-06 21:45:50 +02:00
Ettore Di Giacinto
92f99b1ec3 fix(token): login via legacy api keys (#9249)
We were not checking against the api keys when db == nil.

This commit also cleanups now unused middleware

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-06 21:45:09 +02:00
LocalAI [bot]
ad232fdb1a docs: ⬆️ update docs version mudler/LocalAI (#9241)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-06 10:53:07 +02:00
LocalAI [bot]
11637b5a1b chore: ⬆️ Update leejet/stable-diffusion.cpp to 7397ddaa86f4e8837d5261724678cde0f36d4d89 (#9242)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-06 10:52:51 +02:00
LocalAI [bot]
0dda4fe6f0 chore: ⬆️ Update ggml-org/llama.cpp to 761797ffdf2ce3f118e82c663b1ad7d935fbd656 (#9243)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-06 10:52:38 +02:00
Ettore Di Giacinto
773489eeb1 fix(chat): do not retry if we had chatdeltas or tooldeltas from backend (#9244)
* fix(chat): do not retry if we had chatdeltas or tooldeltas from backend

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix: use oai compat for llama.cpp

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix: apply to non-streaming path too

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* map also other fields

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-06 10:52:23 +02:00
Ettore Di Giacinto
06fbe48b3f feat(llama.cpp): wire speculative decoding settings (#9238)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-05 14:56:30 +02:00
Ettore Di Giacinto
232e324a68 fix(autoparser): correctly pass by logprobs (#9239)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-05 09:39:22 +02:00
ER-EPR
39c954764c Update index.yaml and add Qwen3.5 model files (#9237)
* Update index.yaml

Signed-off-by: ER-EPR <38782737+ER-EPR@users.noreply.github.com>

* Add mmproj files for Qwen3.5 models

Signed-off-by: ER-EPR <38782737+ER-EPR@users.noreply.github.com>

* Update file paths for Qwen models in index.yaml

Signed-off-by: ER-EPR <38782737+ER-EPR@users.noreply.github.com>

* Update index.yaml

Signed-off-by: ER-EPR <38782737+ER-EPR@users.noreply.github.com>

* Refactor Qwen3-Reranker-0.6B entry in index.yaml

Signed-off-by: ER-EPR <38782737+ER-EPR@users.noreply.github.com>

* Update qwen3.yaml configuration parameters

Signed-off-by: ER-EPR <38782737+ER-EPR@users.noreply.github.com>

---------

Signed-off-by: ER-EPR <38782737+ER-EPR@users.noreply.github.com>
2026-04-05 09:21:21 +02:00
Ettore Di Giacinto
9b7d5513fc chore(gallery): add mmproj file for gemma4
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-05 02:02:52 +02:00
LocalAI [bot]
84cd8c0e7f chore: ⬆️ Update ggml-org/llama.cpp to b8635075ffe27b135c49afb9a8b5c434bd42c502 (#9231)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-04 23:02:58 +02:00
LocalAI [bot]
d990f2790c chore(model-gallery): ⬆️ update checksum (#9233)
⬆️ Checksum updates in gallery/index.yaml

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-04 23:02:41 +02:00
Ettore Di Giacinto
53deeb1107 fix(reasoning): suppress partial tag tokens during autoparser warm-up
The C++ PEG parser needs a few tokens to identify the reasoning format
(e.g. "<|channel>thought\n" for Gemma 4). During this warm-up, the gRPC
layer was sending raw partial tag tokens to Go, which leaked into the
reasoning field.

- Clear reply.message in gRPC when autoparser is active but has no diffs
  yet, matching llama.cpp server behavior of only emitting classified output
- Prefer C++ autoparser chat deltas for reasoning/content in all streaming
  paths, falling back to Go-side extraction for backends without autoparser
  (e.g. vLLM)
- Override non-streaming no-tools result with chat delta content when available
- Guard PrependThinkingTokenIfNeeded against partial tag prefixes during
  streaming accumulation
- Reorder default thinking tokens so <|channel>thought is checked before
  <|think|> (Gemma 4 templates contain both)
2026-04-04 20:45:57 +00:00
Ettore Di Giacinto
c5a840f6af fix(reasoning): warm-up
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-04 20:25:24 +00:00
Ettore Di Giacinto
6d9d77d590 fix(reasoning): accumulate and strip reasoning tags from autoparser results (#9227)
fix(reasoning): acccumulate and strip reasoning tags from autoparser results

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-04 18:15:32 +02:00
Ettore Di Giacinto
6f304d1201 chore(refactor): use interface (#9226)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-04 17:29:37 +02:00
Richard Palethorpe
557d0f0f04 feat(api): Allow coding agents to interactively discover how to control and configure LocalAI (#9084)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-04-04 15:14:35 +02:00
Ettore Di Giacinto
b7e3589875 fix(anthropic): show null index when not present, default to 0 (#9225)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-04 15:13:17 +02:00
Ettore Di Giacinto
716ddd697b feat(autoparser): prefer chat deltas from backends when emitted (#9224)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-04 12:12:08 +02:00
Ettore Di Giacinto
223deb908d fix(nats): improve error handling (#9222)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-04 12:11:54 +02:00
Ettore Di Giacinto
9f8821bba8 feat(gemma4): add thinking support (#9221)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-04 12:11:38 +02:00
Ettore Di Giacinto
84e51b68ef fix(ui): pass by staticApiKeyRequired to show login when only api key is configured (#9220)
This fixes #9213

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-04 12:11:22 +02:00
LocalAI [bot]
7962dd16f7 chore: ⬆️ Update ggml-org/llama.cpp to d006858316d4650bb4da0c6923294ccd741caefd (#9215)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-04 09:44:39 +02:00
LocalAI [bot]
a1466b305a docs: ⬆️ update docs version mudler/LocalAI (#9214)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-04 09:44:25 +02:00
github-actions[bot]
57c0026715 chore: bump inference defaults from unsloth (#9219)
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-04 09:44:12 +02:00
Ettore Di Giacinto
1ed6b9e5ed fix(llama.cpp): correctly parse grpc header for bearer token auth
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-03 21:38:41 +00:00
LocalAI [bot]
e4ee74354f chore(model gallery): 🤖 add 1 new models via gallery agent (#9210)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-03 16:23:17 +02:00
Ettore Di Giacinto
8577bdcebc Update asset links in README.md
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-03 10:24:08 +02:00
Ettore Di Giacinto
0d489c7a0d Add guided tour and update screenshots section
Updated README to include a guided tour section with links to various assets and details about agents and usage metrics.

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-03 10:23:03 +02:00
Ettore Di Giacinto
11dc54bda9 fix(docs): commit distribution.md
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-03 10:14:13 +02:00
Ettore Di Giacinto
7e0b73deaa fix(docs): fix broken references to distributed mode
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-03 09:46:06 +02:00
LocalAI [bot]
c0a023d13d chore: ⬆️ Update ggml-org/llama.cpp to a1cfb645307edc61a89e41557f290f441043d3c2 (#9203)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-03 08:30:15 +02:00
Loryan Strant
0d3ae1c295 docs: Update Home Assistant integrations list (#9206)
Update Home Assistant integrations list

Signed-off-by: Loryan Strant <51473494+loryanstrant@users.noreply.github.com>
2026-04-03 08:30:00 +02:00
LocalAI [bot]
e9f10f2f50 chore(model gallery): 🤖 add 1 new models via gallery agent (#9202)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-02 21:22:19 +02:00
Ettore Di Giacinto
b95b0b72ff chore(ci): fix gallery agent
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-02 18:02:18 +00:00
LocalAI [bot]
26f1b94f4d chore: ⬆️ Update ggml-org/llama.cpp to 95a6ebabb277c4cc18247e7bc2a5502133caca63 (#9199)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-02 08:53:16 +02:00
LocalAI [bot]
2d40725ca2 chore: ⬆️ Update leejet/stable-diffusion.cpp to 87ecb95cbc65dc8e58e3d88f4f4a59a0939796f5 (#9200)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-02 08:53:04 +02:00
Ettore Di Giacinto
6c635e8353 feat: add resume endpoint to undrain nodes (#9197)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-01 18:21:43 +02:00
LocalAI [bot]
cc5f33ce95 chore: ⬆️ Update ggml-org/llama.cpp to 0fcb3760b2b9a3a496ef14621a7e4dad7a8df90f (#9196)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-01 00:48:40 +02:00
LocalAI [bot]
ba7cdd532a chore: ⬆️ Update leejet/stable-diffusion.cpp to 09b12d5f6d51d862749e8e0ee8baac8f012089e2 (#9195)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-01 00:48:25 +02:00
Ettore Di Giacinto
6b6c136210 fix(inflight): count inflight from load model, but release afterwards (#9194)
This should fix the count of 1 in flight always showing in the node list

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-31 23:24:45 +02:00
Ettore Di Giacinto
e587ecc485 chore(ui): allow to unload forcefully
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-31 17:20:53 +00:00
Ettore Di Giacinto
f259036a27 feat(gpu): add jetson/tegra detection
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-31 15:45:07 +00:00
Ettore Di Giacinto
221ff0f28f feat(ui): show cluster status in home in distributed mode
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-31 15:37:58 +00:00
Ettore Di Giacinto
16d5cb00bd chore: css cleanups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-31 16:37:38 +02:00
Richard Palethorpe
952635fba6 feat(distributed): Avoid resending models to backend nodes (#9193)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-03-31 16:28:13 +02:00
Ettore Di Giacinto
3cc05af2e5 chore(nodes): restore offline nodes too
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-31 14:22:18 +00:00
Copilot
87a63316c7 stablediffusion-ggml: replace hand-maintained enum string arrays with upstream API calls (#9192)
* Initial plan

* Remove hand-maintained enum string arrays in gosd.cpp, use upstream API functions

Agent-Logs-Url: https://github.com/mudler/LocalAI/sessions/561fb489-89ed-4588-8f1e-7b967d91ba37

Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-31 14:53:38 +02:00
Richard Palethorpe
efdcbbe332 feat(api): Return 404 when model is not found except for model names in HF format (#9133)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-03-31 10:48:21 +02:00
Ettore Di Giacinto
b4fff9293d chore: small ui improvements in the node page
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-31 08:41:40 +00:00
dependabot[bot]
8180221b7e chore(deps): bump grpcio from 1.78.1 to 1.80.0 in /backend/python/common/template (#9176)
chore(deps): bump grpcio in /backend/python/common/template

Bumps [grpcio](https://github.com/grpc/grpc) from 1.78.1 to 1.80.0.
- [Release notes](https://github.com/grpc/grpc/releases)
- [Commits](https://github.com/grpc/grpc/compare/v1.78.1...v1.80.0)

---
updated-dependencies:
- dependency-name: grpcio
  dependency-version: 1.80.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-31 10:11:04 +02:00
dependabot[bot]
52a9755e08 chore(deps): bump grpcio from 1.78.1 to 1.80.0 in /backend/python/rerankers (#9181)
chore(deps): bump grpcio in /backend/python/rerankers

Bumps [grpcio](https://github.com/grpc/grpc) from 1.78.1 to 1.80.0.
- [Release notes](https://github.com/grpc/grpc/releases)
- [Commits](https://github.com/grpc/grpc/compare/v1.78.1...v1.80.0)

---
updated-dependencies:
- dependency-name: grpcio
  dependency-version: 1.80.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-31 10:10:50 +02:00
dependabot[bot]
a2a1d919f9 chore(deps): bump grpcio from 1.78.1 to 1.80.0 in /backend/python/coqui (#9182)
Bumps [grpcio](https://github.com/grpc/grpc) from 1.78.1 to 1.80.0.
- [Release notes](https://github.com/grpc/grpc/releases)
- [Commits](https://github.com/grpc/grpc/compare/v1.78.1...v1.80.0)

---
updated-dependencies:
- dependency-name: grpcio
  dependency-version: 1.80.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-31 10:10:35 +02:00
dependabot[bot]
a3d37931ec chore(deps): bump grpcio from 1.78.1 to 1.80.0 in /backend/python/vllm (#9177)
Bumps [grpcio](https://github.com/grpc/grpc) from 1.78.1 to 1.80.0.
- [Release notes](https://github.com/grpc/grpc/releases)
- [Commits](https://github.com/grpc/grpc/compare/v1.78.1...v1.80.0)

---
updated-dependencies:
- dependency-name: grpcio
  dependency-version: 1.80.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-31 10:10:17 +02:00
dependabot[bot]
5b2e25ebb0 chore(deps): bump grpcio from 1.78.1 to 1.80.0 in /backend/python/transformers (#9180)
chore(deps): bump grpcio in /backend/python/transformers

Bumps [grpcio](https://github.com/grpc/grpc) from 1.78.1 to 1.80.0.
- [Release notes](https://github.com/grpc/grpc/releases)
- [Commits](https://github.com/grpc/grpc/compare/v1.78.1...v1.80.0)

---
updated-dependencies:
- dependency-name: grpcio
  dependency-version: 1.80.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-31 10:10:03 +02:00
LocalAI [bot]
b0b37a472f chore: ⬆️ Update ggml-org/llama.cpp to 08f21453aec846867b39878500d725a05bd32683 (#9190)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-31 09:27:08 +02:00
Ettore Di Giacinto
3db12eaa7a fix(oauth/invite): do not register user (prending approval) without correct invite (#9189)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-31 08:29:07 +02:00
Ettore Di Giacinto
8862e3ce60 feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler (#9186)
* always enable parallel requests

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: move tests to ginkgo

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(smart router): order by available vram

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-31 08:28:56 +02:00
LocalAI [bot]
80699a3f70 feat(swagger): update swagger (#9187)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-30 23:48:06 +02:00
dependabot[bot]
309a59f61e chore(deps): bump actions/upload-pages-artifact from 3 to 4 (#9179)
Bumps [actions/upload-pages-artifact](https://github.com/actions/upload-pages-artifact) from 3 to 4.
- [Release notes](https://github.com/actions/upload-pages-artifact/releases)
- [Commits](https://github.com/actions/upload-pages-artifact/compare/v3...v4)

---
updated-dependencies:
- dependency-name: actions/upload-pages-artifact
  dependency-version: '4'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-30 23:16:23 +02:00
dependabot[bot]
65c9380389 chore(deps): bump go.opentelemetry.io/otel/exporters/prometheus from 0.62.0 to 0.64.0 (#9178)
chore(deps): bump go.opentelemetry.io/otel/exporters/prometheus

Bumps [go.opentelemetry.io/otel/exporters/prometheus](https://github.com/open-telemetry/opentelemetry-go) from 0.62.0 to 0.64.0.
- [Release notes](https://github.com/open-telemetry/opentelemetry-go/releases)
- [Changelog](https://github.com/open-telemetry/opentelemetry-go/blob/main/CHANGELOG.md)
- [Commits](https://github.com/open-telemetry/opentelemetry-go/compare/exporters/prometheus/v0.62.0...exporters/prometheus/v0.64.0)

---
updated-dependencies:
- dependency-name: go.opentelemetry.io/otel/exporters/prometheus
  dependency-version: 0.64.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-30 23:16:09 +02:00
dependabot[bot]
79963c56bf chore(deps): bump github.com/pion/webrtc/v4 from 4.2.9 to 4.2.11 (#9185)
Bumps [github.com/pion/webrtc/v4](https://github.com/pion/webrtc) from 4.2.9 to 4.2.11.
- [Release notes](https://github.com/pion/webrtc/releases)
- [Commits](https://github.com/pion/webrtc/compare/v4.2.9...v4.2.11)

---
updated-dependencies:
- dependency-name: github.com/pion/webrtc/v4
  dependency-version: 4.2.11
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-30 23:15:54 +02:00
dependabot[bot]
7004ce0b78 chore(deps): bump github.com/nats-io/nats.go from 1.49.0 to 1.50.0 (#9183)
Bumps [github.com/nats-io/nats.go](https://github.com/nats-io/nats.go) from 1.49.0 to 1.50.0.
- [Release notes](https://github.com/nats-io/nats.go/releases)
- [Commits](https://github.com/nats-io/nats.go/compare/v1.49.0...v1.50.0)

---
updated-dependencies:
- dependency-name: github.com/nats-io/nats.go
  dependency-version: 1.50.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-30 23:15:39 +02:00
dependabot[bot]
702d0e0e4d chore(deps): bump google.golang.org/grpc from 1.79.1 to 1.79.3 (#9175)
Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.79.1 to 1.79.3.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](https://github.com/grpc/grpc-go/compare/v1.79.1...v1.79.3)

---
updated-dependencies:
- dependency-name: google.golang.org/grpc
  dependency-version: 1.79.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-30 23:15:25 +02:00
dependabot[bot]
d6de208d6c chore(deps): bump actions/configure-pages from 5 to 6 (#9174)
Bumps [actions/configure-pages](https://github.com/actions/configure-pages) from 5 to 6.
- [Release notes](https://github.com/actions/configure-pages/releases)
- [Commits](https://github.com/actions/configure-pages/compare/v5...v6)

---
updated-dependencies:
- dependency-name: actions/configure-pages
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-30 23:15:10 +02:00
dependabot[bot]
7451145e0c chore(deps): bump actions/deploy-pages from 4 to 5 (#9172)
Bumps [actions/deploy-pages](https://github.com/actions/deploy-pages) from 4 to 5.
- [Release notes](https://github.com/actions/deploy-pages/releases)
- [Commits](https://github.com/actions/deploy-pages/compare/v4...v5)

---
updated-dependencies:
- dependency-name: actions/deploy-pages
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-30 23:14:57 +02:00
dependabot[bot]
cfda3dd0df chore(deps): bump actions/checkout from 4 to 6 (#9173)
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4...v6)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-30 23:14:43 +02:00
Richard Palethorpe
e0eb2fd734 chore(ci): Scope tests extras backend tests (#9170)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-03-30 17:46:07 +00:00
Ettore Di Giacinto
dd3376e0a9 chore(workers): improve logging, set header timeouts (#9171)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-30 17:26:55 +02:00
Richard Palethorpe
520e1ce3cd fix(kokoro): Download phonemization model during installation (#9165)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-03-30 15:08:48 +02:00
LocalAI [bot]
3d738164b7 chore: ⬆️ Update ggml-org/llama.cpp to 7c203670f8d746382247ed369fea7fbf10df8ae0 (#9160)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-30 08:27:26 +02:00
LocalAI [bot]
56db76599a chore: ⬆️ Update ggml-org/whisper.cpp to 95ea8f9bfb03a15db08a8989966fd1ae3361e20d (#9168)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-30 08:27:11 +02:00
LocalAI [bot]
ad57cdfefe chore: ⬆️ Update leejet/stable-diffusion.cpp to f16a110f8776398ef23a2a6b7b57522c2471637a (#9167)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-30 08:26:45 +02:00
Richard Palethorpe
c2f7d1c18b feat(ui): Add media history to studio pages (e.g. past images) (#9151)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-03-30 00:49:55 +02:00
ER-EPR
afe79568d6 fix: huggingface repo change the file name so Update index.yaml is needed (#9163)
* Update index.yaml

Signed-off-by: ER-EPR <38782737+ER-EPR@users.noreply.github.com>

* Add mmproj files for Qwen3.5 models

Signed-off-by: ER-EPR <38782737+ER-EPR@users.noreply.github.com>

* Update file paths for Qwen models in index.yaml

Signed-off-by: ER-EPR <38782737+ER-EPR@users.noreply.github.com>

---------

Signed-off-by: ER-EPR <38782737+ER-EPR@users.noreply.github.com>
2026-03-30 00:48:17 +02:00
Ettore Di Giacinto
59108fbe32 feat: add distributed mode (#9124)
* feat: add distributed mode (experimental)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix data races, mutexes, transactions

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactorings

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix events and tool stream in agent chat

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* use ginkgo

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(cron): compute correctly time boundaries avoiding re-triggering

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* enhancements, refactorings

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* do not flood of healthy checks

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* do not list obvious backends as text backends

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* tests fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Drop redundant healthcheck

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* enhancements, refactorings

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-30 00:47:27 +02:00
LocalAI [bot]
4c870288d9 chore: ⬆️ Update ggml-org/llama.cpp to 59d840209a5195c2f6e2e81b5f8339a0637b59d9 (#9144)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-28 18:18:06 +01:00
Ettore Di Giacinto
8da7212763 fix(ci): checkout submodules
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-28 00:33:31 +01:00
Ettore Di Giacinto
6e76052f9d ci: set gh-pages
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-27 21:26:55 +00:00
Richard Palethorpe
cf84db36ec fix(voxcpm): Force using a recent voxcpm version to kick the dependency solver (#9150)
fix(voxcpm): Allow packages to be fetched from all indexes

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-03-27 15:38:51 +01:00
Richard Palethorpe
d3f629f183 feat: Merge repeated log lines in the terminal (#9141)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-03-26 22:16:13 +01:00
Richard Palethorpe
b1aa707a92 fix(coqui,nemo,voxcpm): Add dependencies to allow CI to progress (#9142)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-03-26 18:03:56 +01:00
LocalAI [bot]
731176ce3a feat(swagger): update swagger (#9136)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-26 07:58:11 +01:00
LocalAI [bot]
b86fa63f70 chore: ⬆️ Update ggml-org/llama.cpp to a970515bdb0b1d09519106847660b0d0c84d2472 (#9137)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-26 07:56:41 +01:00
walcz-de
00fcf6936c fix: implement encoding_format=base64 for embeddings endpoint (#9135)
The OpenAI Node.js SDK v4+ sends encoding_format=base64 by default.
LocalAI previously ignored this parameter and always returned a float
JSON array, causing a silent data corruption bug in any Node.js client
(AnythingLLM Desktop, LangChain.js, LlamaIndex.TS, …):

  // What the client does when it expects base64 but receives a float array:
  Buffer.from(floatArray, 'base64')

Node.js treats a non-string first argument as a byte array — each
float32 value is truncated to a single byte — and Float32Array then
reads those bytes as floats, yielding dims/4 values.  Vector databases
(Qdrant, pgvector, …) then create collections with the wrong dimension,
causing all similarity searches to fail silently.

  e.g. granite-embedding-107m (384 dims) → 96 stored in Qdrant
       jina-embeddings-v3      (1024 dims) → 256 stored in Qdrant

Changes:
- core/schema/prediction.go: add EncodingFormat string field to
  PredictionOptions so the request parameter is parsed and available
  throughout the request pipeline
- core/schema/openai.go: add EmbeddingBase64 string field to Item;
  add MarshalJSON so the "embedding" JSON key emits either []float32
  or a base64 string depending on which field is populated — all other
  Item consumers (image, video endpoints) are unaffected
- core/http/endpoints/openai/embeddings.go: add floatsToBase64()
  which packs a float32 slice as little-endian bytes and base64-encodes
  it; add embeddingItem() helper; both InputToken and InputStrings loops
  now honour encoding_format=base64

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 17:38:07 +01:00
Richard Palethorpe
26384c5c70 fix(docs): Use notice instead of alert (#9134)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-03-25 13:55:48 +01:00
LocalAI [bot]
7209457f53 chore: ⬆️ Update ace-step/acestep.cpp to 6f35c874ee11e86d511b860019b84976f5b52d3a (#9128)
⬆️ Update ace-step/acestep.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-25 07:52:31 +01:00
LocalAI [bot]
9bc68b2721 chore: ⬆️ Update ggml-org/llama.cpp to 9f102a1407ed5d73b8c954f32edab50f8dfa3f58 (#9127)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-25 07:52:14 +01:00
Richard Palethorpe
7bdd198fd3 fix(downloader): Rewrite full https HF URI with HF_ENDPOINT (#9107)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-03-24 18:32:52 +01:00
dependabot[bot]
b296e3d94b chore(deps): bump github.com/mudler/skillserver from 0.0.5 to 0.0.6 (#9116)
Bumps [github.com/mudler/skillserver](https://github.com/mudler/skillserver) from 0.0.5 to 0.0.6.
- [Release notes](https://github.com/mudler/skillserver/releases)
- [Commits](https://github.com/mudler/skillserver/compare/v0.0.5...v0.0.6)

---
updated-dependencies:
- dependency-name: github.com/mudler/skillserver
  dependency-version: 0.0.6
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-24 08:51:02 +01:00
dependabot[bot]
c91855a9b2 chore(deps): bump peter-evans/create-pull-request from 7 to 8 (#9114)
Bumps [peter-evans/create-pull-request](https://github.com/peter-evans/create-pull-request) from 7 to 8.
- [Release notes](https://github.com/peter-evans/create-pull-request/releases)
- [Commits](https://github.com/peter-evans/create-pull-request/compare/v7...v8)

---
updated-dependencies:
- dependency-name: peter-evans/create-pull-request
  dependency-version: '8'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-24 08:50:50 +01:00
dependabot[bot]
e8e445cd43 chore(deps): bump actions/checkout from 4 to 6 (#9110)
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4...v6)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-24 08:50:36 +01:00
dependabot[bot]
735c426072 chore(deps): bump github.com/modelcontextprotocol/go-sdk from 1.4.0 to 1.4.1 (#9118)
chore(deps): bump github.com/modelcontextprotocol/go-sdk

Bumps [github.com/modelcontextprotocol/go-sdk](https://github.com/modelcontextprotocol/go-sdk) from 1.4.0 to 1.4.1.
- [Release notes](https://github.com/modelcontextprotocol/go-sdk/releases)
- [Commits](https://github.com/modelcontextprotocol/go-sdk/compare/v1.4.0...v1.4.1)

---
updated-dependencies:
- dependency-name: github.com/modelcontextprotocol/go-sdk
  dependency-version: 1.4.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-24 00:36:23 +01:00
dependabot[bot]
0976b8a17b chore(deps): bump github.com/google/go-containerregistry from 0.21.2 to 0.21.3 (#9121)
chore(deps): bump github.com/google/go-containerregistry

Bumps [github.com/google/go-containerregistry](https://github.com/google/go-containerregistry) from 0.21.2 to 0.21.3.
- [Release notes](https://github.com/google/go-containerregistry/releases)
- [Commits](https://github.com/google/go-containerregistry/compare/v0.21.2...v0.21.3)

---
updated-dependencies:
- dependency-name: github.com/google/go-containerregistry
  dependency-version: 0.21.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-24 00:35:55 +01:00
LocalAI [bot]
2ad8c149e0 chore: ⬆️ Update ggml-org/llama.cpp to 1772701f99dd3fc13f5783b282c2361eda8ca47c (#9123)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-24 00:35:40 +01:00
LocalAI [bot]
31fcb1425d chore: ⬆️ Update ggml-org/llama.cpp to 49bfddeca18e62fa3d39114a23e9fcbdf8a22388 (#9102)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-23 01:11:18 +01:00
LocalAI [bot]
470d5e506f feat(swagger): update swagger (#9103)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-22 21:28:20 +01:00
Ettore Di Giacinto
0ee49cf42e Fix formatting in LocalAI description
Updated the description formatting for LocalAI.

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-03-22 21:28:07 +01:00
Ettore Di Giacinto
cecd8d6aa5 chore(docs): simplify
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-22 20:24:44 +00:00
Ettore Di Giacinto
15935e9d5f fix(auth): do not allow to register in invite mode (#9101)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-22 20:44:03 +01:00
Ettore Di Giacinto
5d410e5a03 fix(download): do not remove dst dir until we try all fallbacks (#9100)
This actually caused fallbacks to be compeletely no-op as we were
removing the destination dir before calling containerd.Apply

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-22 10:29:57 +01:00
Ettore Di Giacinto
5df77d7e8c Remove how-tos section link from README
Removed outdated link to community curated how-tos section.

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-03-22 10:05:13 +01:00
Ettore Di Giacinto
f891d60d26 fix(llama.cpp): bundle libdl, librt, libpthread in llama-cpp backend (#9099)
chore(llama.cpp): bundle libdl, librt, libpthread in llama-cpp backend

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-22 00:58:14 +01:00
Ettore Di Giacinto
be25217955 chore(transformers): bump to >5.0 and generically load models (#9097)
* chore(transformers): bump to >5.0

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: refactor to use generic model loading

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-22 00:57:54 +01:00
LocalAI [bot]
b74111feed chore: ⬆️ Update ggml-org/llama.cpp to 990e4d96980d0b016a2b07049cc9031642fb9903 (#9095)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-22 00:57:39 +01:00
LocalAI [bot]
bf92117259 chore: ⬆️ Update ggml-org/whisper.cpp to 76684141a5d059be71cbe23dc2f0ed552213ba2d (#9094)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-22 00:57:28 +01:00
Ettore Di Giacinto
031a36c995 feat: inferencing default, automatic tool parsing fallback and wire min_p (#9092)
* feat: wire min_p

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat: inferencing defaults

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(refactor): re-use iterative parser

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: generate automatically inference defaults from unsloth

Instead of trying to re-invent the wheel and maintain here the inference
defaults, prefer to consume unsloth ones, and contribute there as
necessary.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: apply defaults also to models installed via gallery

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: be consistent and apply fallback to all endpoint

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-22 00:57:15 +01:00
LocalAI [bot]
8036d22ec6 chore: ⬆️ Update ace-step/acestep.cpp to 7326a7bea0c2037982ec924f7364e998df70450c (#9086)
⬆️ Update ace-step/acestep.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-22 00:56:52 +01:00
Ettore Di Giacinto
f7e8d9e791 feat(quantization): add quantization backend (#9096)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-22 00:56:34 +01:00
Ettore Di Giacinto
4b183b7bb6 feat: add quota system (#9090)
* feat: add quota system

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fix tests

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-21 10:09:49 +01:00
Ettore Di Giacinto
f38e91d80b feat(ui): add predictor for usage, user-breakdown statistics (#9091)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-21 10:09:36 +01:00
LocalAI [bot]
aa3e82976e chore: ⬆️ Update ggml-org/llama.cpp to 4cb7e0bd61e7e1101e8ab10db5dee70c5717a386 (#9087)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-21 09:41:11 +01:00
Ettore Di Giacinto
d9c1db2b87 feat: add (experimental) fine-tuning support with TRL (#9088)
* feat: add fine-tuning endpoint

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(experimental): add fine-tuning endpoint and TRL support

This changeset defines new GRPC signatues for Fine tuning backends, and
add TRL backend as initial fine-tuning engine. This implementation also
supports exporting to GGUF and automatically importing it to LocalAI
after fine-tuning.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* commit TRL backend, stop by killing process

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* move fine-tune to generic features

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* add evals, reorder menu

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fix tests

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-21 02:08:02 +01:00
LocalAI [bot]
f7e3aab4fc feat(swagger): update swagger (#9085)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-20 21:45:03 +01:00
Richard Palethorpe
73bdc3b50d fix(realtime): Set the alias for opus so the development backend can be selected (#9083)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-03-20 15:08:07 +01:00
Richard Palethorpe
cb63bdb9e4 feat(ui): Add model pipeline editor (#9070)
This creates a new model config page. Presently just allows configuring
pipelines, but can be extending the future to other types of models.
However pipelines are quite easy to create a form for and require
editing to create.

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-03-20 15:07:34 +01:00
Richard Palethorpe
8cd3f9fc47 feat(ui, openai): Structured errors and link to traces in error toast (#9068)
First when sending errors over SSE we now clearly identify them as such
instead of just sending the error string as a chat completion message.

We use this in the UI to identify errors and link to them to the traces.

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-03-20 15:06:07 +01:00
lif
e0ab1a8b43 fix: use exact tag matching for model gallery tag filtering (#9041)
The Search() method uses strings.Contains() on comma-joined tags,
causing substring false positives (e.g., "asr" matching "image-diffusers").

Add FilterByTag() method that checks each tag with strings.EqualFold()
for exact, case-insensitive matching. Add 'tag' query parameter to
/api/models and /api/backends endpoints. Update the React frontend to
send filter selections as 'tag' instead of 'term'.

Closes #8775

Signed-off-by: majiayu000 <1835304752@qq.com>
2026-03-20 08:37:45 +01:00
Ettore Di Giacinto
c3174f9543 chore(deps): bump llama-cpp to 'a0bbcdd9b6b83eeeda6f1216088f42c33d464e38' (#9079)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-20 08:12:21 +01:00
LocalAI [bot]
2b12875302 fix: Add tracing settings loading from runtime_settings.json (#9081)
Tracing settings (EnableTracing and TracingMaxItems) were not being
loaded from runtime_settings.json on startup, causing tracing settings
configured via WebUI to be lost after service restart.

This fix adds proper loading of tracing settings in
loadRuntimeSettingsFromFile function in core/application/startup.go.

Fixes #9072

Co-authored-by: localai-bot <localai-bot@localai.io>
2026-03-20 00:58:52 +01:00
Ettore Di Giacinto
9cdbd89c1f chore(agents.md): update with auth/feature gating instructions
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-19 22:52:28 +00:00
LocalAI [bot]
7d81bf0aa3 chore: ⬆️ Update ggml-org/whisper.cpp to 9386f239401074690479731c1e41683fbbeac557 (#9077)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-19 23:27:35 +01:00
Tv
a6d0e29eba fix(openresponses): do not omit required field ORItemParam.Arguments (#9074)
See #9047
2026-03-19 22:04:45 +01:00
LocalAI [bot]
6054d2a91b feat(swagger): update swagger (#9075)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-19 21:49:40 +01:00
Ettore Di Giacinto
aea21951a2 feat: add users and authentication support (#9061)
* feat(ui): add users and authentication support

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat: allow the admin user to impersonificate users

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: ui improvements, disable 'Users' button in navbar when no auth is configured

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat: add OIDC support

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix: gate models

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: cache requests to optimize speed

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* small UI enhancements

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(ui): style improvements

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix: cover other paths by auth

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: separate local auth, refactor

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* security hardening, approval mode

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix: fix tests and expectations

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: update localagi/localrecall

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-19 21:40:51 +01:00
LocalAI [bot]
bbe9067227 docs: Add troubleshooting guide for embedding models (fixes #9064) (#9065)
docs: Add troubleshooting guide for embedding models (#9064)

- Add section on using gallery models for embeddings
- Document common issues with embedding model configuration
- Add troubleshooting guide for Qwen3 embedding models
- Include correct configuration examples for Qwen3-Embedding-4B
- Document context size limits and dimension parameters
- Add table of Qwen3 embedding model specifications

Fixes #9064

Signed-off-by: localai-bot <localai-bot@localai.io>
Co-authored-by: localai-bot <localai-bot@localai.io>
2026-03-19 19:41:12 +01:00
LocalAI [bot]
9a9da062e1 chore: ⬆️ Update ggml-org/llama.cpp to 5744d7ec430e2f875a393770195fda530560773f (#9063)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-19 07:58:30 +01:00
LocalAI [bot]
dd1a8b174f chore: ⬆️ Update ggml-org/whisper.cpp to ef3463bb29ef90d25dfabfd1e75993111c52412d (#9062)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-19 07:58:11 +01:00
Richard Palethorpe
cfb7641eea feat(ui, gallery): Show model backends and add searchable model/backend selector (#9060)
* feat(ui, gallery): Display and filter by the backend models use

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(ui): Add searchable model backend/model selector and prevent delete models being selected

Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-03-18 21:14:41 +01:00
Richard Palethorpe
e832efeb9e fix(ui): Refresh model list on deletion (#9059)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-03-18 14:07:45 +01:00
dependabot[bot]
a42548e9d1 chore(deps): bump playwright from 1.52.0 to 1.58.2 in /core/http/react-ui in the npm_and_yarn group across 1 directory (#9055)
chore(deps): bump playwright

Bumps the npm_and_yarn group with 1 update in the /core/http/react-ui directory: [playwright](https://github.com/microsoft/playwright).


Updates `playwright` from 1.52.0 to 1.58.2
- [Release notes](https://github.com/microsoft/playwright/releases)
- [Commits](https://github.com/microsoft/playwright/compare/v1.52.0...v1.58.2)

---
updated-dependencies:
- dependency-name: playwright
  dependency-version: 1.58.2
  dependency-type: indirect
  dependency-group: npm_and_yarn
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-18 14:05:59 +01:00
LocalAI [bot]
8615ce28a8 feat: Add standalone agent run mode inspired by LocalAGI (#9056)
- Add 'agent' subcommand with 'run' and 'list' sub-commands
- Support running agents by name from pool.json registry
- Support running agents from JSON config files
- Implement foreground mode with --prompt flag for single-turn interactions
- Reuse AgentPoolService for consistent agent initialization
- Add comprehensive unit tests for config loading and overrides

Fixes #8960

Signed-off-by: localai-bot <localai-bot@users.noreply.github.com>
Co-authored-by: localai-bot <localai-bot@noreply.github.com>
2026-03-18 14:04:20 +01:00
Ettore Di Giacinto
8336efec41 fix(ui): correctly display backend if specified in the model config, re-order MCP buttons (#9053)
fix(ui): correctly display backend if specified in the model config

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-18 09:58:25 +01:00
LocalAI [bot]
8560a1e571 chore: ⬆️ Update ace-step/acestep.cpp to ab020a9aefcd364423e0665da12babc6b0c7b507 (#9046)
⬆️ Update ace-step/acestep.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-18 08:54:15 +01:00
LocalAI [bot]
29c33e6a6a chore: ⬆️ Update ggml-org/whisper.cpp to dc9611662265870df22a7230b7586176a99c1955 (#9045)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-18 08:46:35 +01:00
LocalAI [bot]
a58475dbef chore: ⬆️ Update ggml-org/llama.cpp to ee4801e5a6ee7ee4063144ab44ab4e127f76fba8 (#9044)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-18 08:46:12 +01:00
Tv
8a0edd0809 Always populate ORItemParam.Summary (#9049)
* fix(openresponses): do not omit required fields summary and id

* fix(openresponses): ensure ORItemParam.Summary is never null

Normalize Summary to an empty slice at serialization chokepoints
(sendSSEEvent, bufferEvent, buildORResponse) so it always serializes
as [] instead of null.

Closes #9047
2026-03-18 08:45:46 +01:00
Richard Palethorpe
35d509d8e7 feat(ui): Per model backend logs and various fixes (#9028)
* feat(gallery): Switch to expandable box instead of pop-over and display model files

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(ui, backends): Add individual backend logging

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(ui): Set the context settings from the model config

Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-03-18 08:31:26 +01:00
Ettore Di Giacinto
eef808d921 fix: call .String() on AllowedTools
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-17 23:16:19 +00:00
Ettore Di Giacinto
9d9ea5c1a0 chore(deps): bump skillserver
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-17 21:04:52 +00:00
LocalAI [bot]
e21ad5cfaa chore: ⬆️ Update leejet/stable-diffusion.cpp to 545fac4f3fb0117a4e962b1a04cf933a7e635933 (#9036)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-17 18:07:30 +01:00
dependabot[bot]
05ab0c0aa2 chore(deps): bump github.com/google/go-containerregistry from 0.21.1 to 0.21.2 (#9033)
chore(deps): bump github.com/google/go-containerregistry

Bumps [github.com/google/go-containerregistry](https://github.com/google/go-containerregistry) from 0.21.1 to 0.21.2.
- [Release notes](https://github.com/google/go-containerregistry/releases)
- [Commits](https://github.com/google/go-containerregistry/compare/v0.21.1...v0.21.2)

---
updated-dependencies:
- dependency-name: github.com/google/go-containerregistry
  dependency-version: 0.21.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-17 11:43:25 +01:00
dependabot[bot]
e2b6233570 chore(deps): bump actions/upload-artifact from 4 to 7 (#9030)
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 7.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v4...v7)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-17 11:42:49 +01:00
dependabot[bot]
19f995f38f chore(deps): bump github.com/ebitengine/purego from 0.9.1 to 0.10.0 (#9034)
Bumps [github.com/ebitengine/purego](https://github.com/ebitengine/purego) from 0.9.1 to 0.10.0.
- [Release notes](https://github.com/ebitengine/purego/releases)
- [Commits](https://github.com/ebitengine/purego/compare/v0.9.1...v0.10.0)

---
updated-dependencies:
- dependency-name: github.com/ebitengine/purego
  dependency-version: 0.10.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-17 11:42:06 +01:00
dependabot[bot]
ac168bbc60 chore(deps): bump github.com/anthropics/anthropic-sdk-go from 1.26.0 to 1.27.0 (#9035)
chore(deps): bump github.com/anthropics/anthropic-sdk-go

Bumps [github.com/anthropics/anthropic-sdk-go](https://github.com/anthropics/anthropic-sdk-go) from 1.26.0 to 1.27.0.
- [Release notes](https://github.com/anthropics/anthropic-sdk-go/releases)
- [Changelog](https://github.com/anthropics/anthropic-sdk-go/blob/main/CHANGELOG.md)
- [Commits](https://github.com/anthropics/anthropic-sdk-go/compare/v1.26.0...v1.27.0)

---
updated-dependencies:
- dependency-name: github.com/anthropics/anthropic-sdk-go
  dependency-version: 1.27.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-17 11:41:36 +01:00
LocalAI [bot]
5c5e537b31 chore: ⬆️ Update ace-step/acestep.cpp to 15740f4301b3ec3020875f1fb975a6cfdb2f6767 (#9038)
⬆️ Update ace-step/acestep.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-17 10:22:53 +01:00
LocalAI [bot]
118bcee196 chore: ⬆️ Update ggml-org/llama.cpp to 9b342d0a9f2f4892daec065491583ec2be129685 (#9039)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-17 10:22:42 +01:00
LocalAI [bot]
3eabd6d1d0 chore: ⬆️ Update ggml-org/whisper.cpp to 79218f51d02ffe70575ef7fba3496dfc7adda027 (#9037)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-17 08:25:31 +01:00
Ettore Di Giacinto
ee96e5e08d chore: refactor endpoints to use same inferencing path, add automatic retrial mechanism in case of errors (#9029)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-16 21:31:02 +01:00
Richard Palethorpe
3d9ccd1ddc fix(ui): Add tracing inline settings back and create UI tests (#9027)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-03-16 17:51:06 +01:00
Ettore Di Giacinto
d8161bfe57 fix(api): unescape model names (#9024)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-16 00:56:37 +01:00
Ettore Di Giacinto
5fd42399d4 feat: support streaming mode for tool calls in agent mode, fix interleaved thinking stream (#9023)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-16 00:50:19 +01:00
LocalAI [bot]
b2030255ca chore: ⬆️ Update ggml-org/llama.cpp to 88915cb55c14769738fcab7f1c6eaa6dcc9c2b0c (#9020)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-16 00:10:11 +01:00
LocalAI [bot]
9f903ec06e chore: ⬆️ Update leejet/stable-diffusion.cpp to 862a6586cb6fcec037c14f9ed902329ecec7d990 (#9019)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-16 00:09:59 +01:00
Ettore Di Giacinto
4ea461c330 fix(ui): correctly map watchdog fields (#9022)
Fixes: https://github.com/mudler/LocalAI/issues/9018

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-15 22:12:24 +01:00
Ettore Di Giacinto
042a9b8ef6 Remove Table of Contents from README
Removed the Table of Contents section from the README.

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-03-15 21:49:13 +01:00
Ettore Di Giacinto
65f1a4154a Revise README structure and content
Updated the README to include new sections and remove outdated content.

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-03-15 21:47:08 +01:00
LocalAI [bot]
c6a51289b0 fix: Automatically disable mmap for Intel SYCL backends (#9012) (#9015)
* fix: Automatically disable mmap for Intel SYCL backends

Fixes issue #9012 where Qwen3.5 models fail to load on Intel Arc GPU
with RPC EOF error.

The Intel SYCL backend has a known issue where mmap enabled causes
the backend to hang. This change automatically disables mmap when
detecting Intel or SYCL backends.

References:
- https://github.com/mudler/LocalAI/issues/9012
- Documentation mentions: SYCL hangs when mmap: true is set

* feat: Add logging for mmap auto-disable on Intel SYCL backends

As requested in PR review, add xlog.Info call to log when mmap
is automatically disabled for Intel SYCL backends. This helps
with debugging and confirms the auto-disable logic is working.

---------

Co-authored-by: localai-bot <localai-bot@users.noreply.github.com>
2026-03-15 21:06:35 +01:00
LocalAI [bot]
87525109f1 chore: ⬆️ Update ggml-org/llama.cpp to 3a6f059909ed5dab8587df5df4120315053d57a4 (#9009)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-15 09:46:45 +01:00
Ettore Di Giacinto
c596d8a5d9 fix: Change baseDir assignment to use ModelPath (#9010)
Fixes: https://github.com/mudler/LocalAI/issues/9005

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-03-15 09:45:58 +01:00
LocalAI [bot]
d79ad76e48 docs: ⬆️ update docs version mudler/LocalAI (#9008)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-03-15 08:39:49 +01:00
Ettore Di Giacinto
dde0353432 chore(api): add path to expose collection raw files
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-14 22:31:45 +00:00
798 changed files with 337391 additions and 250205 deletions

View File

@@ -28,7 +28,7 @@ Add build matrix entries for each platform/GPU type you want to support. Look at
- CUDA 13 builds: Add after other CUDA 13 builds (e.g., after `gpu-nvidia-cuda-13-chatterbox`)
**Additional build types you may need:**
- ROCm/HIP: Use `build-type: 'hipblas'` with `base-image: "rocm/dev-ubuntu-24.04:6.4.4"`
- ROCm/HIP: Use `build-type: 'hipblas'` with `base-image: "rocm/dev-ubuntu-24.04:7.2.1"`
- Intel/SYCL: Use `build-type: 'intel'` or `build-type: 'sycl_f16'`/`sycl_f32` with `base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"`
- L4T (ARM): Use `build-type: 'l4t'` with `platforms: 'linux/arm64'` and `runs-on: 'ubuntu-24.04-arm'`

View File

@@ -0,0 +1,111 @@
# Adding GGUF Models from HuggingFace to the Gallery
When adding a GGUF model from HuggingFace to the LocalAI model gallery, follow this guide.
## Gallery file
All models are defined in `gallery/index.yaml`. Find the appropriate section (embedding models near other embeddings, chat models near similar chat models) and add a new entry.
## Getting the SHA256
GGUF files on HuggingFace expose their SHA256 via the `x-linked-etag` HTTP header. Fetch it with:
```bash
curl -sI "https://huggingface.co/<org>/<repo>/resolve/main/<filename>.gguf" | grep -i x-linked-etag
```
The value (without quotes) is the SHA256 hash. Example:
```bash
curl -sI "https://huggingface.co/ggml-org/embeddinggemma-300m-qat-q8_0-GGUF/resolve/main/embeddinggemma-300m-qat-Q8_0.gguf" | grep -i x-linked-etag
# x-linked-etag: "6fa0c02a9c302be6f977521d399b4de3a46310a4f2621ee0063747881b673f67"
```
**Important**: Pay attention to exact filename casing — HuggingFace filenames are case-sensitive (e.g., `Q8_0` vs `q8_0`). Check the repo's file listing to get the exact name.
## Entry format — Embedding models
Embedding models use `gallery/virtual.yaml` as the base config and set `embeddings: true`:
```yaml
- name: "model-name"
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/<original-model-org>/<original-model-name>
- https://huggingface.co/<gguf-org>/<gguf-repo-name>
description: |
Short description of the model, its size, and capabilities.
tags:
- embeddings
overrides:
backend: llama-cpp
embeddings: true
parameters:
model: <filename>.gguf
files:
- filename: <filename>.gguf
uri: huggingface://<gguf-org>/<gguf-repo-name>/<filename>.gguf
sha256: <sha256-hash>
```
## Entry format — Chat/LLM models
Chat models typically reference a template config (e.g., `gallery/gemma.yaml`, `gallery/chatml.yaml`) that defines the prompt format. Use YAML anchors (`&name` / `*name`) if adding multiple quantization variants of the same model:
```yaml
- &model-anchor
url: "github:mudler/LocalAI/gallery/<template>.yaml@master"
name: "model-name"
icon: https://example.com/icon.png
license: <license>
urls:
- https://huggingface.co/<org>/<model>
- https://huggingface.co/<gguf-org>/<gguf-repo>
description: |
Model description.
tags:
- llm
- gguf
- gpu
- cpu
overrides:
parameters:
model: <filename>-Q4_K_M.gguf
files:
- filename: <filename>-Q4_K_M.gguf
sha256: <sha256>
uri: huggingface://<gguf-org>/<gguf-repo>/<filename>-Q4_K_M.gguf
```
To add a variant (e.g., different quantization), use YAML merge:
```yaml
- !!merge <<: *model-anchor
name: "model-name-q8"
overrides:
parameters:
model: <filename>-Q8_0.gguf
files:
- filename: <filename>-Q8_0.gguf
sha256: <sha256>
uri: huggingface://<gguf-org>/<gguf-repo>/<filename>-Q8_0.gguf
```
## Available template configs
Look at existing `.yaml` files in `gallery/` to find the right prompt template for your model architecture:
- `gemma.yaml` — Gemma-family models (gemma, embeddinggemma, etc.)
- `chatml.yaml` — ChatML format (many Mistral/OpenHermes models)
- `deepseek.yaml` — DeepSeek models
- `virtual.yaml` — Minimal base (good for embedding models that don't need chat templates)
## Checklist
1. **Find the GGUF file** on HuggingFace — note exact filename (case-sensitive)
2. **Get the SHA256** using the `curl -sI` + `x-linked-etag` method above
3. **Choose the right template** config from `gallery/` based on model architecture
4. **Add the entry** to `gallery/index.yaml` near similar models
5. **Set `embeddings: true`** if it's an embedding model
6. **Include both URLs** — the original model page and the GGUF repo
7. **Write a description** — mention model size, capabilities, and quantization type

View File

@@ -0,0 +1,259 @@
# API Endpoints and Authentication
This guide covers how to add new API endpoints and properly integrate them with the auth/permissions system.
## Architecture overview
Authentication and authorization flow through three layers:
1. **Global auth middleware** (`core/http/auth/middleware.go``auth.Middleware`) — applied to every request in `core/http/app.go`. Handles session cookies, Bearer tokens, API keys, and legacy API keys. Populates `auth_user` and `auth_role` in the Echo context.
2. **Feature middleware** (`auth.RequireFeature`) — per-feature access control applied to route groups or individual routes. Checks if the authenticated user has the specific feature enabled.
3. **Admin middleware** (`auth.RequireAdmin`) — restricts endpoints to admin users only.
When auth is disabled (no auth DB, no legacy API keys), all middleware becomes pass-through (`auth.NoopMiddleware`).
## Adding a new API endpoint
### Step 1: Create the handler
Write the endpoint handler in the appropriate package under `core/http/endpoints/`. Follow existing patterns:
```go
// core/http/endpoints/localai/my_feature.go
func MyFeatureEndpoint(app *application.Application) echo.HandlerFunc {
return func(c echo.Context) error {
// Use auth.GetUser(c) to get the authenticated user (may be nil if auth is disabled)
user := auth.GetUser(c)
// Your logic here
return c.JSON(http.StatusOK, result)
}
}
```
### Step 2: Register routes
Add routes in the appropriate file under `core/http/routes/`. The file you use depends on the endpoint category:
| File | Category |
|------|----------|
| `routes/openai.go` | OpenAI-compatible API endpoints (`/v1/...`) |
| `routes/localai.go` | LocalAI-specific endpoints (`/api/...`, `/models/...`, `/backends/...`) |
| `routes/agents.go` | Agent pool endpoints (`/api/agents/...`) |
| `routes/auth.go` | Auth endpoints (`/api/auth/...`) |
| `routes/ui_api.go` | UI backend API endpoints |
### Step 3: Apply the right middleware
Choose the appropriate protection level:
#### No auth required (public)
Exempt paths bypass auth entirely. Add to `isExemptPath()` in `middleware.go` or use the `/api/auth/` prefix (always exempt). Use sparingly — most endpoints should require auth.
#### Standard auth (any authenticated user)
The global middleware already handles this. API paths (`/api/`, `/v1/`, etc.) automatically require authentication when auth is enabled. You don't need to add any extra middleware.
```go
router.GET("/v1/my-endpoint", myHandler) // auth enforced by global middleware
```
#### Admin only
Pass `adminMiddleware` to the route. This is set up in `app.go` and passed to `Register*Routes` functions:
```go
// In the Register function signature, accept the middleware:
func RegisterMyRoutes(router *echo.Echo, app *application.Application, adminMiddleware echo.MiddlewareFunc) {
router.POST("/models/apply", myHandler, adminMiddleware)
}
```
#### Feature-gated
For endpoints that should be toggleable per-user, use feature middleware. There are two approaches:
**Approach A: Route-level middleware** (preferred for groups of related endpoints)
```go
// In app.go, create the feature middleware:
myFeatureMw := auth.RequireFeature(application.AuthDB(), auth.FeatureMyFeature)
// Pass it to the route registration function:
routes.RegisterMyRoutes(e, app, myFeatureMw)
// In the routes file, apply to a group:
g := e.Group("/api/my-feature", myFeatureMw)
g.GET("", listHandler)
g.POST("", createHandler)
```
**Approach B: RouteFeatureRegistry** (preferred for individual OpenAI-compatible endpoints)
Add an entry to `RouteFeatureRegistry` in `core/http/auth/features.go`. The `RequireRouteFeature` global middleware will automatically enforce it:
```go
var RouteFeatureRegistry = []RouteFeature{
// ... existing entries ...
{"POST", "/v1/my-endpoint", FeatureMyFeature},
}
```
## Adding a new feature
When you need a new toggleable feature (not just a new endpoint under an existing feature):
### 1. Define the feature constant
Add to `core/http/auth/permissions.go`:
```go
const (
// Add to the appropriate group:
// Agent features (default OFF for new users)
FeatureMyFeature = "my_feature"
// OR API features (default ON for new users)
FeatureMyFeature = "my_feature"
)
```
Then add it to the appropriate slice:
```go
// Default OFF — user must be explicitly granted access:
var AgentFeatures = []string{..., FeatureMyFeature}
// Default ON — user has access unless explicitly revoked:
var APIFeatures = []string{..., FeatureMyFeature}
```
### 2. Add feature metadata
In `core/http/auth/features.go`, add to the appropriate `FeatureMetas` function so the admin UI can display it:
```go
func AgentFeatureMetas() []FeatureMeta {
return []FeatureMeta{
// ... existing ...
{FeatureMyFeature, "My Feature", false}, // false = default OFF
}
}
```
### 3. Wire up the middleware
In `core/http/app.go`:
```go
myFeatureMw := auth.RequireFeature(application.AuthDB(), auth.FeatureMyFeature)
```
Then pass it to the route registration function.
### 4. Register route-feature mappings (if applicable)
If your feature gates standard API endpoints (like `/v1/...`), add entries to `RouteFeatureRegistry` in `features.go` instead of using per-route middleware.
## Accessing the authenticated user in handlers
```go
import "github.com/mudler/LocalAI/core/http/auth"
func MyHandler(c echo.Context) error {
// Get the user (nil when auth is disabled or unauthenticated)
user := auth.GetUser(c)
if user == nil {
// Handle unauthenticated — or let middleware handle it
}
// Check role
if user.Role == auth.RoleAdmin {
// admin-specific logic
}
// Check feature access programmatically (when you need conditional behavior, not full blocking)
if auth.HasFeatureAccess(db, user, auth.FeatureMyFeature) {
// feature-specific logic
}
// Check model access
if !auth.IsModelAllowed(db, user, modelName) {
return c.JSON(http.StatusForbidden, ...)
}
}
```
## Middleware composition patterns
Middleware can be composed at different levels. Here are the patterns used in the codebase:
### Group-level middleware (agents pattern)
```go
// All routes in the group share the middleware
g := e.Group("/api/agents", poolReadyMw, agentsMw)
g.GET("", listHandler)
g.POST("", createHandler)
```
### Per-route middleware (localai pattern)
```go
// Individual routes get middleware as extra arguments
router.POST("/models/apply", applyHandler, adminMiddleware)
router.GET("/metrics", metricsHandler, adminMiddleware)
```
### Middleware slice (openai pattern)
```go
// Build a middleware chain for a handler
chatMiddleware := []echo.MiddlewareFunc{
usageMiddleware,
traceMiddleware,
modelFilterMiddleware,
}
app.POST("/v1/chat/completions", chatHandler, chatMiddleware...)
```
## Error response format
Always use `schema.ErrorResponse` for auth/permission errors to stay consistent with the OpenAI-compatible API:
```go
return c.JSON(http.StatusForbidden, schema.ErrorResponse{
Error: &schema.APIError{
Message: "feature not enabled for your account",
Code: http.StatusForbidden,
Type: "authorization_error",
},
})
```
Use these HTTP status codes:
- `401 Unauthorized` — no valid credentials provided
- `403 Forbidden` — authenticated but lacking permission
- `429 Too Many Requests` — rate limited (auth endpoints)
## Usage tracking
If your endpoint should be tracked for usage (token counts, request counts), add the `usageMiddleware` to its middleware chain. See `core/http/middleware/usage.go` and how it's applied in `routes/openai.go`.
## Path protection rules
The global auth middleware classifies paths as API paths or non-API paths:
- **API paths** (always require auth when auth is enabled): `/api/`, `/v1/`, `/models/`, `/backends/`, `/backend/`, `/tts`, `/vad`, `/video`, `/stores/`, `/system`, `/ws/`, `/metrics`
- **Exempt paths** (never require auth): `/api/auth/` prefix, anything in `appConfig.PathWithoutAuth`
- **Non-API paths** (UI, static assets): pass through without auth — the React UI handles login redirects client-side
If you add endpoints under a new top-level path prefix, add it to `isAPIPath()` in `middleware.go` to ensure it requires authentication.
## Checklist
When adding a new endpoint:
- [ ] Handler in `core/http/endpoints/`
- [ ] Route registered in appropriate `core/http/routes/` file
- [ ] Auth level chosen: public / standard / admin / feature-gated
- [ ] If feature-gated: constant in `permissions.go`, metadata in `features.go`, middleware in `app.go`
- [ ] If new path prefix: added to `isAPIPath()` in `middleware.go`
- [ ] If OpenAI-compatible: entry in `RouteFeatureRegistry`
- [ ] If token-counting: `usageMiddleware` added to middleware chain
- [ ] Error responses use `schema.ErrorResponse` format
- [ ] Tests cover both authenticated and unauthenticated access

View File

@@ -10,7 +10,7 @@ Let's say the user wants to build a particular backend for a given platform. For
- At a minimum we need to set the BUILD_TYPE, BASE_IMAGE build-args
- Use .github/workflows/backend.yml as a reference it lists the needed args in the `include` job strategy matrix
- l4t and cublas also requires the CUDA major and minor version
- You can pretty print a command like `DOCKER_MAKEFLAGS=-j$(nproc --ignore=1) BUILD_TYPE=hipblas BASE_IMAGE=rocm/dev-ubuntu-24.04:6.4.4 make docker-build-coqui`
- You can pretty print a command like `DOCKER_MAKEFLAGS=-j$(nproc --ignore=1) BUILD_TYPE=hipblas BASE_IMAGE=rocm/dev-ubuntu-24.04:7.2.1 make docker-build-coqui`
- Unless the user specifies that they want you to run the command, then just print it because not all agent frontends handle long running jobs well and the output may overflow your context
- The user may say they want to build AMD or ROCM instead of hipblas, or Intel instead of SYCL or NVIDIA insted of l4t or cublas. Ask for confirmation if there is ambiguity.
- Sometimes the user may need extra parameters to be added to `docker build` (e.g. `--platform` for cross-platform builds or `--progress` to view the full logs), in which case you can generate the `docker build` command directly.

View File

@@ -49,3 +49,4 @@ The project documentation is located in `docs/content`. When adding new features
- **Feature Documentation**: If you add a new feature (like a new backend or API endpoint), create a new markdown file in `docs/content/features/` explaining what it is, how to configure it, and how to use it.
- **Configuration**: If you modify configuration options, update the relevant sections in `docs/content/`.
- **Examples**: providing concrete examples (like YAML configuration blocks) is highly encouraged to help users get started quickly.
- **Shortcodes**: Use `{{% notice note %}}`, `{{% notice tip %}}`, or `{{% notice warning %}}` for callout boxes. Do **not** use `{{% alert %}}` — that shortcode does not exist in this project's Hugo theme and will break the docs build.

View File

@@ -0,0 +1,141 @@
# Debugging and Rebuilding Backends
When a backend fails at runtime (e.g. a gRPC method error, a Python import error, or a dependency conflict), use this guide to diagnose, fix, and rebuild.
## Architecture Overview
- **Source directory**: `backend/python/<name>/` (or `backend/go/<name>/`, `backend/cpp/<name>/`)
- **Installed directory**: `backends/<name>/` — this is what LocalAI actually runs. It is populated by `make backends/<name>` which builds a Docker image, exports it, and installs it via `local-ai backends install`.
- **Virtual environment**: `backends/<name>/venv/` — the installed Python venv (for Python backends). The Python binary is at `backends/<name>/venv/bin/python`.
Editing files in `backend/python/<name>/` does **not** affect the running backend until you rebuild with `make backends/<name>`.
## Diagnosing Failures
### 1. Check the logs
Backend gRPC processes log to LocalAI's stdout/stderr. Look for lines tagged with the backend's model ID:
```
GRPC stderr id="trl-finetune-127.0.0.1:37335" line="..."
```
Common error patterns:
- **"Method not implemented"** — the backend is missing a gRPC method that the Go side calls. The model loader (`pkg/model/initializers.go`) always calls `LoadModel` after `Health`; fine-tuning backends must implement it even as a no-op stub.
- **Python import errors / `AttributeError`** — usually a dependency version mismatch (e.g. `pyarrow` removing `PyExtensionType`).
- **"failed to load backend"** — the gRPC process crashed or never started. Check stderr lines for the traceback.
### 2. Test the Python environment directly
You can run the installed venv's Python to check imports without starting the full server:
```bash
backends/<name>/venv/bin/python -c "import datasets; print(datasets.__version__)"
```
If `pip` is missing from the venv, bootstrap it:
```bash
backends/<name>/venv/bin/python -m ensurepip
```
Then use `backends/<name>/venv/bin/python -m pip install ...` to test fixes in the installed venv before committing them to the source requirements.
### 3. Check upstream dependency constraints
When you hit a dependency conflict, check what the main library expects. For example, TRL's upstream `requirements.txt`:
```
https://github.com/huggingface/trl/blob/main/requirements.txt
```
Pin minimum versions in the backend's requirements files to match upstream.
## Common Fixes
### Missing gRPC methods
If the Go side calls a method the backend doesn't implement (e.g. `LoadModel`), add a no-op stub in `backend.py`:
```python
def LoadModel(self, request, context):
"""No-op — actual loading happens elsewhere."""
return backend_pb2.Result(success=True, message="OK")
```
The gRPC contract requires `LoadModel` to succeed for the model loader to return a usable client, even if the backend doesn't need upfront model loading.
### Dependency version conflicts
Python backends often break when a transitive dependency releases a breaking change (e.g. `pyarrow` removing `PyExtensionType`). Steps:
1. Identify the broken import in the logs
2. Test in the installed venv: `backends/<name>/venv/bin/python -c "import <module>"`
3. Check upstream requirements for version constraints
4. Update **all** requirements files in `backend/python/<name>/`:
- `requirements.txt` — base deps (grpcio, protobuf)
- `requirements-cpu.txt` — CPU-specific (includes PyTorch CPU index)
- `requirements-cublas12.txt` — CUDA 12
- `requirements-cublas13.txt` — CUDA 13
5. Rebuild: `make backends/<name>`
### PyTorch index conflicts (uv resolver)
The Docker build uses `uv` for pip installs. When `--extra-index-url` points to the PyTorch wheel index, `uv` may refuse to fetch packages like `requests` from PyPI if it finds a different version on the PyTorch index first. Fix this by adding `--index-strategy=unsafe-first-match` to `install.sh`:
```bash
EXTRA_PIP_INSTALL_FLAGS+=" --upgrade --index-strategy=unsafe-first-match"
installRequirements
```
Most Python backends already do this — check `backend/python/transformers/install.sh` or similar for reference.
## Rebuilding
### Rebuild a single backend
```bash
make backends/<name>
```
This runs the Docker build (`Dockerfile.python`), exports the image to `backend-images/<name>.tar`, and installs it into `backends/<name>/`. It also rebuilds the `local-ai` Go binary (without extra tags).
**Important**: If you were previously running with `GO_TAGS=auth`, the `make backends/<name>` step will overwrite your binary without that tag. Rebuild the Go binary afterward:
```bash
GO_TAGS=auth make build
```
### Rebuild and restart
After rebuilding a backend, you must restart LocalAI for it to pick up the new backend files. The backend gRPC process is spawned on demand when the model is first loaded.
```bash
# Kill existing process
kill <pid>
# Restart
./local-ai run --debug [your flags]
```
### Quick iteration (skip Docker rebuild)
For fast iteration on a Python backend's `backend.py` without a full Docker rebuild, you can edit the installed copy directly:
```bash
# Edit the installed copy
vim backends/<name>/backend.py
# Restart LocalAI to respawn the gRPC process
```
This is useful for testing but **does not persist** — the next `make backends/<name>` will overwrite it. Always commit fixes to the source in `backend/python/<name>/`.
## Verification
After fixing and rebuilding:
1. Start LocalAI and confirm the backend registers: look for `Registering backend name="<name>"` in the logs
2. Trigger the operation that failed (e.g. start a fine-tuning job)
3. Watch the GRPC stderr/stdout lines for the backend's model ID
4. Confirm no errors in the traceback

View File

@@ -133,6 +133,7 @@ func getRealReadme(ctx context.Context, repository string) (string, error) {
result, err := cogito.ExecuteTools(llm, fragment,
cogito.WithIterations(3),
cogito.WithMaxAttempts(3),
cogito.DisableSinkState,
cogito.WithTools(&HFReadmeTool{client: hfapi.NewClient()}))
if err != nil {
return "", err
@@ -406,7 +407,7 @@ func getHuggingFaceAvatarURL(author string) string {
}
// Parse the response to get avatar URL
var userInfo map[string]interface{}
var userInfo map[string]any
body, err := io.ReadAll(resp.Body)
if err != nil {
return ""

View File

@@ -79,7 +79,20 @@ func generateYAMLEntry(model ProcessedModel, quantization string) string {
description = cleanTextContent(description)
formattedDescription := formatTextContent(description)
configFile := formatTextContent(modelConfig.ConfigFile)
// Strip name and description from config file since they are
// already present at the gallery entry level and should not
// appear under overrides.
configFileContent := modelConfig.ConfigFile
var cfgMap map[string]any
if err := yaml.Unmarshal([]byte(configFileContent), &cfgMap); err == nil {
delete(cfgMap, "name")
delete(cfgMap, "description")
if cleaned, err := yaml.Marshal(cfgMap); err == nil {
configFileContent = string(cleaned)
}
}
configFile := formatTextContent(configFileContent)
filesYAML, _ := yaml.Marshal(modelConfig.Files)

View File

@@ -3,7 +3,7 @@ package main
import (
"context"
"fmt"
"math/rand"
"math/rand/v2"
"strings"
"time"
)
@@ -13,11 +13,11 @@ func runSyntheticMode() error {
generator := NewSyntheticDataGenerator()
// Generate a random number of synthetic models (1-3)
numModels := generator.rand.Intn(3) + 1
numModels := generator.rand.IntN(3) + 1
fmt.Printf("Generating %d synthetic models for testing...\n", numModels)
var models []ProcessedModel
for i := 0; i < numModels; i++ {
for range numModels {
model := generator.GenerateProcessedModel()
models = append(models, model)
fmt.Printf("Generated synthetic model: %s\n", model.ModelID)
@@ -42,14 +42,14 @@ type SyntheticDataGenerator struct {
// NewSyntheticDataGenerator creates a new synthetic data generator
func NewSyntheticDataGenerator() *SyntheticDataGenerator {
return &SyntheticDataGenerator{
rand: rand.New(rand.NewSource(time.Now().UnixNano())),
rand: rand.New(rand.NewPCG(uint64(time.Now().UnixNano()), 0)),
}
}
// GenerateProcessedModelFile creates a synthetic ProcessedModelFile
func (g *SyntheticDataGenerator) GenerateProcessedModelFile() ProcessedModelFile {
fileTypes := []string{"model", "readme", "other"}
fileType := fileTypes[g.rand.Intn(len(fileTypes))]
fileType := fileTypes[g.rand.IntN(len(fileTypes))]
var path string
var isReadme bool
@@ -68,7 +68,7 @@ func (g *SyntheticDataGenerator) GenerateProcessedModelFile() ProcessedModelFile
return ProcessedModelFile{
Path: path,
Size: int64(g.rand.Intn(1000000000) + 1000000), // 1MB to 1GB
Size: int64(g.rand.IntN(1000000000) + 1000000), // 1MB to 1GB
SHA256: g.randomSHA256(),
IsReadme: isReadme,
FileType: fileType,
@@ -80,19 +80,19 @@ func (g *SyntheticDataGenerator) GenerateProcessedModel() ProcessedModel {
authors := []string{"microsoft", "meta", "google", "openai", "anthropic", "mistralai", "huggingface"}
modelNames := []string{"llama", "gpt", "claude", "mistral", "gemma", "phi", "qwen", "codellama"}
author := authors[g.rand.Intn(len(authors))]
modelName := modelNames[g.rand.Intn(len(modelNames))]
author := authors[g.rand.IntN(len(authors))]
modelName := modelNames[g.rand.IntN(len(modelNames))]
modelID := fmt.Sprintf("%s/%s-%s", author, modelName, g.randomString(6))
// Generate files
numFiles := g.rand.Intn(5) + 2 // 2-6 files
numFiles := g.rand.IntN(5) + 2 // 2-6 files
files := make([]ProcessedModelFile, numFiles)
// Ensure at least one model file and one readme
hasModelFile := false
hasReadme := false
for i := 0; i < numFiles; i++ {
for i := range numFiles {
files[i] = g.GenerateProcessedModelFile()
if files[i].FileType == "model" {
hasModelFile = true
@@ -140,27 +140,27 @@ func (g *SyntheticDataGenerator) GenerateProcessedModel() ProcessedModel {
// Generate sample metadata
licenses := []string{"apache-2.0", "mit", "llama2", "gpl-3.0", "bsd", ""}
license := licenses[g.rand.Intn(len(licenses))]
license := licenses[g.rand.IntN(len(licenses))]
sampleTags := []string{"llm", "gguf", "gpu", "cpu", "text-to-text", "chat", "instruction-tuned"}
numTags := g.rand.Intn(4) + 3 // 3-6 tags
numTags := g.rand.IntN(4) + 3 // 3-6 tags
tags := make([]string, numTags)
for i := 0; i < numTags; i++ {
tags[i] = sampleTags[g.rand.Intn(len(sampleTags))]
for i := range numTags {
tags[i] = sampleTags[g.rand.IntN(len(sampleTags))]
}
// Remove duplicates
tags = g.removeDuplicates(tags)
// Optionally include icon (50% chance)
icon := ""
if g.rand.Intn(2) == 0 {
if g.rand.IntN(2) == 0 {
icon = fmt.Sprintf("https://cdn-avatars.huggingface.co/v1/production/uploads/%s.png", g.randomString(24))
}
return ProcessedModel{
ModelID: modelID,
Author: author,
Downloads: g.rand.Intn(1000000) + 1000,
Downloads: g.rand.IntN(1000000) + 1000,
LastModified: g.randomDate(),
Files: files,
PreferredModelFile: preferredModelFile,
@@ -180,7 +180,7 @@ func (g *SyntheticDataGenerator) randomString(length int) string {
const charset = "abcdefghijklmnopqrstuvwxyz0123456789"
b := make([]byte, length)
for i := range b {
b[i] = charset[g.rand.Intn(len(charset))]
b[i] = charset[g.rand.IntN(len(charset))]
}
return string(b)
}
@@ -189,14 +189,14 @@ func (g *SyntheticDataGenerator) randomSHA256() string {
const charset = "0123456789abcdef"
b := make([]byte, 64)
for i := range b {
b[i] = charset[g.rand.Intn(len(charset))]
b[i] = charset[g.rand.IntN(len(charset))]
}
return string(b)
}
func (g *SyntheticDataGenerator) randomDate() string {
now := time.Now()
daysAgo := g.rand.Intn(365) // Random date within last year
daysAgo := g.rand.IntN(365) // Random date within last year
pastDate := now.AddDate(0, 0, -daysAgo)
return pastDate.Format("2006-01-02T15:04:05.000Z")
}
@@ -220,5 +220,5 @@ func (g *SyntheticDataGenerator) generateReadmeContent(modelName, author string)
fmt.Sprintf("# %s Language Model\n\nDeveloped by %s, this model represents state-of-the-art performance in natural language understanding and generation.\n\n## Key Features\n\n- Multilingual support\n- Context-aware responses\n- Efficient memory usage\n- Fast inference speed\n\n## Applications\n\n- Chatbots and virtual assistants\n- Content generation\n- Code completion\n- Educational tools", strings.Title(modelName), author),
}
return templates[g.rand.Intn(len(templates))]
return templates[g.rand.IntN(len(templates))]
}

View File

@@ -53,6 +53,19 @@ jobs:
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2204'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-cpu-vllm'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'true'
backend: "vllm"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
@@ -105,6 +118,19 @@ jobs:
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-cpu-faster-whisper'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'true'
backend: "faster-whisper"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
@@ -118,6 +144,32 @@ jobs:
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-cpu-trl'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'true'
backend: "trl"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64,linux/arm64'
tag-latest: 'auto'
tag-suffix: '-cpu-llama-cpp-quantization'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'true'
backend: "llama-cpp-quantization"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
@@ -366,6 +418,19 @@ jobs:
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-trl'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "trl"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
@@ -522,6 +587,19 @@ jobs:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-sam3-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "sam3-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
@@ -548,6 +626,19 @@ jobs:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-qwen3-tts-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "qwen3-tts-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
@@ -757,6 +848,19 @@ jobs:
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-trl'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "trl"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'l4t'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -913,6 +1017,32 @@ jobs:
backend: "mlx-distributed"
dockerfile: "./backend/Dockerfile.python"
context: "./"
- build-type: 'l4t'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-whisperx'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
ubuntu-version: '2404'
backend: "whisperx"
dockerfile: "./backend/Dockerfile.python"
context: "./"
- build-type: 'l4t'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-faster-whisper'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
ubuntu-version: '2404'
backend: "faster-whisper"
dockerfile: "./backend/Dockerfile.python"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -1056,6 +1186,32 @@ jobs:
backend: "stablediffusion-ggml"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-sam3-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "sam3-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-sam3-cpp'
base-image: "ubuntu:24.04"
ubuntu-version: '2404'
runs-on: 'ubuntu-24.04-arm'
backend: "sam3-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -1095,6 +1251,19 @@ jobs:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-qwen3-tts-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "qwen3-tts-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -1108,6 +1277,19 @@ jobs:
backend: "acestep-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-qwen3-tts-cpp'
base-image: "ubuntu:24.04"
ubuntu-version: '2404'
runs-on: 'ubuntu-24.04-arm'
backend: "qwen3-tts-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -1129,7 +1311,7 @@ jobs:
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-rerankers'
runs-on: 'ubuntu-latest'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "rerankers"
dockerfile: "./backend/Dockerfile.python"
@@ -1142,7 +1324,7 @@ jobs:
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-llama-cpp'
runs-on: 'ubuntu-latest'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "llama-cpp"
dockerfile: "./backend/Dockerfile.llama-cpp"
@@ -1155,7 +1337,7 @@ jobs:
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-vllm'
runs-on: 'arc-runner-set'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "vllm"
dockerfile: "./backend/Dockerfile.python"
@@ -1168,7 +1350,7 @@ jobs:
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-vllm-omni'
runs-on: 'arc-runner-set'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "vllm-omni"
dockerfile: "./backend/Dockerfile.python"
@@ -1181,7 +1363,7 @@ jobs:
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-transformers'
runs-on: 'arc-runner-set'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "transformers"
dockerfile: "./backend/Dockerfile.python"
@@ -1194,7 +1376,7 @@ jobs:
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-diffusers'
runs-on: 'arc-runner-set'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "diffusers"
dockerfile: "./backend/Dockerfile.python"
@@ -1207,7 +1389,7 @@ jobs:
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-ace-step'
runs-on: 'arc-runner-set'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "ace-step"
dockerfile: "./backend/Dockerfile.python"
@@ -1221,7 +1403,7 @@ jobs:
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-kokoro'
runs-on: 'arc-runner-set'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "kokoro"
dockerfile: "./backend/Dockerfile.python"
@@ -1234,7 +1416,7 @@ jobs:
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-vibevoice'
runs-on: 'arc-runner-set'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "vibevoice"
dockerfile: "./backend/Dockerfile.python"
@@ -1247,7 +1429,7 @@ jobs:
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-qwen-asr'
runs-on: 'arc-runner-set'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "qwen-asr"
dockerfile: "./backend/Dockerfile.python"
@@ -1260,7 +1442,7 @@ jobs:
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-nemo'
runs-on: 'arc-runner-set'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "nemo"
dockerfile: "./backend/Dockerfile.python"
@@ -1273,7 +1455,7 @@ jobs:
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-qwen-tts'
runs-on: 'arc-runner-set'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "qwen-tts"
dockerfile: "./backend/Dockerfile.python"
@@ -1286,7 +1468,7 @@ jobs:
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-fish-speech'
runs-on: 'arc-runner-set'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "fish-speech"
dockerfile: "./backend/Dockerfile.python"
@@ -1299,7 +1481,7 @@ jobs:
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-voxcpm'
runs-on: 'arc-runner-set'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "voxcpm"
dockerfile: "./backend/Dockerfile.python"
@@ -1312,7 +1494,7 @@ jobs:
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-pocket-tts'
runs-on: 'arc-runner-set'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "pocket-tts"
dockerfile: "./backend/Dockerfile.python"
@@ -1325,7 +1507,7 @@ jobs:
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-faster-whisper'
runs-on: 'bigger-runner'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "faster-whisper"
dockerfile: "./backend/Dockerfile.python"
@@ -1338,7 +1520,7 @@ jobs:
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-whisperx'
runs-on: 'bigger-runner'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "whisperx"
dockerfile: "./backend/Dockerfile.python"
@@ -1351,7 +1533,7 @@ jobs:
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-coqui'
runs-on: 'bigger-runner'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "coqui"
dockerfile: "./backend/Dockerfile.python"
@@ -1592,6 +1774,32 @@ jobs:
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2204'
- build-type: 'l4t'
cuda-major-version: "12"
cuda-minor-version: "0"
platforms: 'linux/arm64'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-whisperx'
runs-on: 'ubuntu-24.04-arm'
base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
skip-drivers: 'true'
backend: "whisperx"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2204'
- build-type: 'l4t'
cuda-major-version: "12"
cuda-minor-version: "0"
platforms: 'linux/arm64'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-faster-whisper'
runs-on: 'ubuntu-24.04-arm'
base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
skip-drivers: 'true'
backend: "faster-whisper"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2204'
# SYCL additional backends
- build-type: 'intel'
cuda-major-version: ""
@@ -1750,6 +1958,19 @@ jobs:
dockerfile: "./backend/Dockerfile.llama-cpp"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-cpu-ik-llama-cpp'
runs-on: 'bigger-runner'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "ik-llama-cpp"
dockerfile: "./backend/Dockerfile.ik-llama-cpp"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "0"
@@ -1790,6 +2011,59 @@ jobs:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# sam3-cpp
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-cpu-sam3-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "sam3-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f32-sam3-cpp'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "sam3-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f16'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f16-sam3-cpp'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "sam3-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64,linux/arm64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-sam3-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "sam3-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
@@ -1842,6 +2116,19 @@ jobs:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2204'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-arm64-sam3-cpp'
base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
runs-on: 'ubuntu-24.04-arm'
backend: "sam3-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2204'
# whisper
- build-type: ''
cuda-major-version: ""
@@ -1914,7 +2201,7 @@ jobs:
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-whisper'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
runs-on: 'ubuntu-latest'
skip-drivers: 'false'
backend: "whisper"
@@ -1993,13 +2280,92 @@ jobs:
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-acestep-cpp'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
runs-on: 'ubuntu-latest'
skip-drivers: 'false'
backend: "acestep-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# qwen3-tts-cpp
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64,linux/arm64'
tag-latest: 'auto'
tag-suffix: '-cpu-qwen3-tts-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "qwen3-tts-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f32-qwen3-tts-cpp'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "qwen3-tts-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f16'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f16-qwen3-tts-cpp'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "qwen3-tts-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64,linux/arm64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-qwen3-tts-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "qwen3-tts-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-arm64-qwen3-tts-cpp'
base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
runs-on: 'ubuntu-24.04-arm'
backend: "qwen3-tts-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2204'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-qwen3-tts-cpp'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
runs-on: 'ubuntu-latest'
skip-drivers: 'false'
backend: "qwen3-tts-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# voxtral
- build-type: ''
cuda-major-version: ""
@@ -2116,7 +2482,7 @@ jobs:
# platforms: 'linux/amd64'
# tag-latest: 'auto'
# tag-suffix: '-gpu-hipblas-rfdetr'
# base-image: "rocm/dev-ubuntu-24.04:6.4.4"
# base-image: "rocm/dev-ubuntu-24.04:7.2.1"
# runs-on: 'ubuntu-latest'
# skip-drivers: 'false'
# backend: "rfdetr"
@@ -2157,7 +2523,7 @@ jobs:
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-neutts'
runs-on: 'arc-runner-set'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "neutts"
dockerfile: "./backend/Dockerfile.python"
@@ -2305,6 +2671,10 @@ jobs:
tag-suffix: "-metal-darwin-arm64-acestep-cpp"
build-type: "metal"
lang: "go"
- backend: "qwen3-tts-cpp"
tag-suffix: "-metal-darwin-arm64-qwen3-tts-cpp"
build-type: "metal"
lang: "go"
- backend: "voxtral"
tag-suffix: "-metal-darwin-arm64-voxtral"
build-type: "metal"
@@ -2373,6 +2743,9 @@ jobs:
tag-suffix: "-metal-darwin-arm64-local-store"
build-type: "metal"
lang: "go"
- backend: "llama-cpp-quantization"
tag-suffix: "-metal-darwin-arm64-llama-cpp-quantization"
build-type: "mps"
with:
backend: ${{ matrix.backend }}
build-type: ${{ matrix.build-type }}

View File

@@ -0,0 +1,48 @@
name: Bump inference defaults
on:
schedule:
# Run daily at 06:00 UTC
- cron: '0 6 * * *'
workflow_dispatch: # Allow manual trigger
permissions:
contents: write
pull-requests: write
jobs:
bump:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- uses: actions/setup-go@v5
with:
go-version-file: go.mod
- name: Re-fetch inference defaults
run: make generate-force
- name: Check for changes
id: diff
run: |
if git diff --quiet core/config/inference_defaults.json; then
echo "changed=false" >> "$GITHUB_OUTPUT"
else
echo "changed=true" >> "$GITHUB_OUTPUT"
fi
- name: Create Pull Request
if: steps.diff.outputs.changed == 'true'
uses: peter-evans/create-pull-request@v8
with:
commit-message: "chore: bump inference defaults from unsloth"
title: "chore: bump inference defaults from unsloth"
body: |
Auto-generated update of `core/config/inference_defaults.json` from
[unsloth's inference_defaults.json](https://github.com/unslothai/unsloth/blob/main/studio/backend/assets/configs/inference_defaults.json).
This PR was created automatically by the `bump-inference-defaults` workflow.
branch: chore/bump-inference-defaults
delete-branch: true
labels: automated

View File

@@ -14,6 +14,10 @@ jobs:
variable: "LLAMA_VERSION"
branch: "master"
file: "backend/cpp/llama-cpp/Makefile"
- repository: "ikawrakow/ik_llama.cpp"
variable: "IK_LLAMA_VERSION"
branch: "main"
file: "backend/cpp/ik-llama-cpp/Makefile"
- repository: "ggml-org/whisper.cpp"
variable: "WHISPER_CPP_VERSION"
branch: "master"
@@ -34,6 +38,14 @@ jobs:
variable: "ACESTEP_CPP_VERSION"
branch: "master"
file: "backend/go/acestep-cpp/Makefile"
- repository: "PABannier/sam3.cpp"
variable: "SAM3_VERSION"
branch: "main"
file: "backend/go/sam3-cpp/Makefile"
- repository: "predict-woo/qwen3-tts.cpp"
variable: "QWEN3TTS_CPP_VERSION"
branch: "main"
file: "backend/go/qwen3-tts-cpp/Makefile"
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6

View File

@@ -55,7 +55,7 @@ jobs:
- name: Run gallery agent
env:
#OPENAI_MODEL: ${{ secrets.OPENAI_MODEL }}
OPENAI_MODE: Qwen3.5-2B-GGUF
OPENAI_MODEL: Qwen3.5-2B-GGUF
OPENAI_BASE_URL: "http://localhost:8080"
OPENAI_KEY: ${{ secrets.OPENAI_KEY }}
#OPENAI_BASE_URL: ${{ secrets.OPENAI_BASE_URL }}

75
.github/workflows/gh-pages.yml vendored Normal file
View File

@@ -0,0 +1,75 @@
name: Deploy docs to GitHub Pages
on:
push:
branches:
- master
paths:
- 'docs/**'
- 'gallery/**'
- 'images/**'
- '.github/ci/modelslist.go'
- '.github/workflows/gh-pages.yml'
workflow_dispatch:
permissions:
contents: read
pages: write
id-token: write
concurrency:
group: pages
cancel-in-progress: false
jobs:
build:
runs-on: ubuntu-latest
env:
HUGO_VERSION: "0.146.3"
steps:
- name: Checkout
uses: actions/checkout@v6
with:
fetch-depth: 0 # needed for enableGitInfo
submodules: true
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.22'
cache: false
- name: Setup Hugo
uses: peaceiris/actions-hugo@v3
with:
hugo-version: ${{ env.HUGO_VERSION }}
extended: true
- name: Setup Pages
id: pages
uses: actions/configure-pages@v6
- name: Generate gallery
run: go run ./.github/ci/modelslist.go ./gallery/index.yaml > docs/static/gallery.html
- name: Build site
working-directory: docs
run: |
mkdir -p layouts/_default
hugo --minify --baseURL "${{ steps.pages.outputs.base_url }}/"
- name: Upload artifact
uses: actions/upload-pages-artifact@v4
with:
path: docs/public
deploy:
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
runs-on: ubuntu-latest
needs: build
steps:
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v5

View File

@@ -59,7 +59,7 @@
platforms: 'linux/amd64'
tag-latest: 'false'
tag-suffix: '-hipblas'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
grpc-base-image: "ubuntu:24.04"
runs-on: 'ubuntu-latest'
makeflags: "--jobs=3 --output-sync=target"

View File

@@ -41,7 +41,7 @@
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-hipblas'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
grpc-base-image: "ubuntu:24.04"
runs-on: 'ubuntu-latest'
makeflags: "--jobs=3 --output-sync=target"

View File

@@ -14,6 +14,42 @@ concurrency:
cancel-in-progress: true
jobs:
detect-changes:
runs-on: ubuntu-latest
outputs:
run-all: ${{ steps.detect.outputs.run-all }}
transformers: ${{ steps.detect.outputs.transformers }}
rerankers: ${{ steps.detect.outputs.rerankers }}
diffusers: ${{ steps.detect.outputs.diffusers }}
coqui: ${{ steps.detect.outputs.coqui }}
moonshine: ${{ steps.detect.outputs.moonshine }}
pocket-tts: ${{ steps.detect.outputs.pocket-tts }}
qwen-tts: ${{ steps.detect.outputs.qwen-tts }}
qwen-asr: ${{ steps.detect.outputs.qwen-asr }}
nemo: ${{ steps.detect.outputs.nemo }}
voxcpm: ${{ steps.detect.outputs.voxcpm }}
llama-cpp-quantization: ${{ steps.detect.outputs.llama-cpp-quantization }}
llama-cpp: ${{ steps.detect.outputs.llama-cpp }}
ik-llama-cpp: ${{ steps.detect.outputs.ik-llama-cpp }}
vllm: ${{ steps.detect.outputs.vllm }}
acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }}
qwen3-tts-cpp: ${{ steps.detect.outputs.qwen3-tts-cpp }}
voxtral: ${{ steps.detect.outputs.voxtral }}
kokoros: ${{ steps.detect.outputs.kokoros }}
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Setup Bun
uses: oven-sh/setup-bun@v2
- name: Install dependencies
run: bun add js-yaml @octokit/core
- name: Detect changed backends
id: detect
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GITHUB_EVENT_PATH: ${{ github.event_path }}
run: bun run scripts/changed-backends.js
# Requires CUDA
# tests-chatterbox-tts:
# runs-on: ubuntu-latest
@@ -37,6 +73,8 @@ jobs:
# make --jobs=5 --output-sync=target -C backend/python/chatterbox
# make --jobs=5 --output-sync=target -C backend/python/chatterbox test
tests-transformers:
needs: detect-changes
if: needs.detect-changes.outputs.transformers == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
steps:
- name: Clone
@@ -58,6 +96,8 @@ jobs:
make --jobs=5 --output-sync=target -C backend/python/transformers
make --jobs=5 --output-sync=target -C backend/python/transformers test
tests-rerankers:
needs: detect-changes
if: needs.detect-changes.outputs.rerankers == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
steps:
- name: Clone
@@ -80,6 +120,8 @@ jobs:
make --jobs=5 --output-sync=target -C backend/python/rerankers test
tests-diffusers:
needs: detect-changes
if: needs.detect-changes.outputs.diffusers == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
steps:
- name: Clone
@@ -229,6 +271,8 @@ jobs:
# make --jobs=5 --output-sync=target -C backend/python/vllm test
tests-coqui:
needs: detect-changes
if: needs.detect-changes.outputs.coqui == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
steps:
- name: Clone
@@ -248,6 +292,8 @@ jobs:
make --jobs=5 --output-sync=target -C backend/python/coqui
make --jobs=5 --output-sync=target -C backend/python/coqui test
tests-moonshine:
needs: detect-changes
if: needs.detect-changes.outputs.moonshine == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
steps:
- name: Clone
@@ -267,6 +313,8 @@ jobs:
make --jobs=5 --output-sync=target -C backend/python/moonshine
make --jobs=5 --output-sync=target -C backend/python/moonshine test
tests-pocket-tts:
needs: detect-changes
if: needs.detect-changes.outputs.pocket-tts == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
steps:
- name: Clone
@@ -286,6 +334,8 @@ jobs:
make --jobs=5 --output-sync=target -C backend/python/pocket-tts
make --jobs=5 --output-sync=target -C backend/python/pocket-tts test
tests-qwen-tts:
needs: detect-changes
if: needs.detect-changes.outputs.qwen-tts == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
steps:
- name: Clone
@@ -327,6 +377,8 @@ jobs:
# make --jobs=5 --output-sync=target -C backend/python/fish-speech
# make --jobs=5 --output-sync=target -C backend/python/fish-speech test
tests-qwen-asr:
needs: detect-changes
if: needs.detect-changes.outputs.qwen-asr == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
steps:
- name: Clone
@@ -346,6 +398,8 @@ jobs:
make --jobs=5 --output-sync=target -C backend/python/qwen-asr
make --jobs=5 --output-sync=target -C backend/python/qwen-asr test
tests-nemo:
needs: detect-changes
if: needs.detect-changes.outputs.nemo == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
steps:
- name: Clone
@@ -365,6 +419,8 @@ jobs:
make --jobs=5 --output-sync=target -C backend/python/nemo
make --jobs=5 --output-sync=target -C backend/python/nemo test
tests-voxcpm:
needs: detect-changes
if: needs.detect-changes.outputs.voxcpm == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
steps:
- name: Clone
@@ -383,7 +439,118 @@ jobs:
run: |
make --jobs=5 --output-sync=target -C backend/python/voxcpm
make --jobs=5 --output-sync=target -C backend/python/voxcpm test
tests-llama-cpp-quantization:
needs: detect-changes
if: needs.detect-changes.outputs.llama-cpp-quantization == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 30
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Dependencies
run: |
sudo apt-get update
sudo apt-get install -y build-essential cmake curl git python3-pip
# Install UV
curl -LsSf https://astral.sh/uv/install.sh | sh
pip install --user --no-cache-dir grpcio-tools==1.64.1
- name: Build llama-quantize from llama.cpp
run: |
git clone --depth 1 https://github.com/ggml-org/llama.cpp.git /tmp/llama.cpp
cmake -B /tmp/llama.cpp/build -S /tmp/llama.cpp -DGGML_NATIVE=OFF
cmake --build /tmp/llama.cpp/build --target llama-quantize -j$(nproc)
sudo cp /tmp/llama.cpp/build/bin/llama-quantize /usr/local/bin/
- name: Install backend
run: |
make --jobs=5 --output-sync=target -C backend/python/llama-cpp-quantization
- name: Test llama-cpp-quantization
run: |
make --jobs=5 --output-sync=target -C backend/python/llama-cpp-quantization test
tests-llama-cpp-grpc:
needs: detect-changes
if: needs.detect-changes.outputs.llama-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
- name: Build llama-cpp backend image and run gRPC e2e tests
run: |
make test-extra-backend-llama-cpp
tests-ik-llama-cpp-grpc:
needs: detect-changes
if: needs.detect-changes.outputs.ik-llama-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
- name: Build ik-llama-cpp backend image and run gRPC e2e tests
run: |
make test-extra-backend-ik-llama-cpp
# tests-vllm-grpc is currently disabled in CI.
#
# The prebuilt vllm CPU wheel is compiled with AVX-512 VNNI/BF16
# instructions, and neither ubuntu-latest nor the bigger-runner pool
# offers a stable CPU baseline that supports them — runners come
# back with different hardware between runs and SIGILL on import of
# vllm.model_executor.models.registry. Compiling vllm from source
# via FROM_SOURCE=true works on any CPU but takes 30-50 minutes per
# run, which is too slow for a smoke test.
#
# The test itself (tests/e2e-backends + make test-extra-backend-vllm)
# is fully working and validated locally on a host with the right
# SIMD baseline. Run it manually with:
#
# make test-extra-backend-vllm
#
# Re-enable this job once we have a self-hosted runner label with
# guaranteed AVX-512 VNNI/BF16 support, or once the vllm project
# publishes a CPU wheel with a wider baseline.
#
# tests-vllm-grpc:
# needs: detect-changes
# if: needs.detect-changes.outputs.vllm == 'true' || needs.detect-changes.outputs.run-all == 'true'
# runs-on: bigger-runner
# timeout-minutes: 90
# steps:
# - name: Clone
# uses: actions/checkout@v6
# with:
# submodules: true
# - name: Dependencies
# run: |
# sudo apt-get update
# sudo apt-get install -y --no-install-recommends \
# make build-essential curl unzip ca-certificates git tar
# - name: Setup Go
# uses: actions/setup-go@v5
# with:
# go-version: '1.25.4'
# - name: Free disk space
# run: |
# sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /opt/hostedtoolcache/CodeQL || true
# df -h
# - name: Build vllm (cpu) backend image and run gRPC e2e tests
# run: |
# make test-extra-backend-vllm
tests-acestep-cpp:
needs: detect-changes
if: needs.detect-changes.outputs.acestep-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
steps:
- name: Clone
@@ -413,7 +580,41 @@ jobs:
- name: Test acestep-cpp
run: |
make --jobs=5 --output-sync=target -C backend/go/acestep-cpp test
tests-qwen3-tts-cpp:
needs: detect-changes
if: needs.detect-changes.outputs.qwen3-tts-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Dependencies
run: |
sudo apt-get update
sudo apt-get install -y build-essential cmake curl libopenblas-dev ffmpeg
- name: Setup Go
uses: actions/setup-go@v5
- name: Display Go version
run: go version
- name: Proto Dependencies
run: |
# Install protoc
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
PATH="$PATH:$HOME/go/bin" make protogen-go
- name: Build qwen3-tts-cpp
run: |
make --jobs=5 --output-sync=target -C backend/go/qwen3-tts-cpp
- name: Test qwen3-tts-cpp
run: |
make --jobs=5 --output-sync=target -C backend/go/qwen3-tts-cpp test
tests-voxtral:
needs: detect-changes
if: needs.detect-changes.outputs.voxtral == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
steps:
- name: Clone
@@ -444,3 +645,25 @@ jobs:
- name: Test voxtral
run: |
make --jobs=5 --output-sync=target -C backend/go/voxtral test
tests-kokoros:
needs: detect-changes
if: needs.detect-changes.outputs.kokoros == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Dependencies
run: |
sudo apt-get update
sudo apt-get install -y build-essential cmake pkg-config protobuf-compiler clang libclang-dev
sudo apt-get install -y espeak-ng libespeak-ng-dev libsonic-dev libpcaudio-dev libopus-dev libssl-dev
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
echo "$HOME/.cargo/bin" >> $GITHUB_PATH
- name: Build kokoros
run: |
make -C backend/rust/kokoros kokoros-grpc
- name: Test kokoros
run: |
make -C backend/rust/kokoros test

View File

@@ -21,7 +21,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
go-version: ['1.25.x']
go-version: ['1.26.x']
steps:
- name: Free Disk Space (Ubuntu)
uses: jlumbroso/free-disk-space@main
@@ -179,7 +179,7 @@ jobs:
runs-on: macos-latest
strategy:
matrix:
go-version: ['1.25.x']
go-version: ['1.26.x']
steps:
- name: Clone
uses: actions/checkout@v6

72
.github/workflows/tests-ui-e2e.yml vendored Normal file
View File

@@ -0,0 +1,72 @@
---
name: 'UI E2E Tests'
on:
pull_request:
paths:
- 'core/http/**'
- 'tests/e2e-ui/**'
- 'tests/e2e/mock-backend/**'
push:
branches:
- master
concurrency:
group: ci-tests-ui-e2e-${{ github.head_ref || github.ref }}-${{ github.repository }}
cancel-in-progress: true
jobs:
tests-ui-e2e:
runs-on: ubuntu-latest
strategy:
matrix:
go-version: ['1.26.x']
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Setup Go ${{ matrix.go-version }}
uses: actions/setup-go@v5
with:
go-version: ${{ matrix.go-version }}
cache: false
- name: Setup Node.js
uses: actions/setup-node@v6
with:
node-version: '22'
- name: Proto Dependencies
run: |
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
- name: System Dependencies
run: |
sudo apt-get update
sudo apt-get install -y build-essential libopus-dev
- name: Build UI test server
run: PATH="$PATH:$HOME/go/bin" make build-ui-test-server
- name: Install Playwright
working-directory: core/http/react-ui
run: |
npm install
npx playwright install --with-deps chromium
- name: Run Playwright tests
working-directory: core/http/react-ui
run: npx playwright test
- name: Upload Playwright report
if: ${{ failure() }}
uses: actions/upload-artifact@v7
with:
name: playwright-report
path: core/http/react-ui/playwright-report/
retention-days: 7
- name: Setup tmate session if tests fail
if: ${{ failure() }}
uses: mxschmitt/action-tmate@v3.23
with:
detached: true
connect-timeout-seconds: 180
limit-access-to-actor: true

5
.gitignore vendored
View File

@@ -72,3 +72,8 @@ core/http/react-ui/dist
# Extracted backend binaries for container-based testing
local-backends/
# UI E2E test artifacts
tests/e2e-ui/ui-test-server
core/http/react-ui/playwright-report/
core/http/react-ui/test-results/

3
.gitmodules vendored
View File

@@ -1,3 +1,6 @@
[submodule "docs/themes/hugo-theme-relearn"]
path = docs/themes/hugo-theme-relearn
url = https://github.com/McShelby/hugo-theme-relearn.git
[submodule "backend/rust/kokoros/sources/Kokoros"]
path = backend/rust/kokoros/sources/Kokoros
url = https://github.com/lucasjinreal/Kokoros

View File

@@ -11,6 +11,9 @@ This file is an index to detailed topic guides in the `.agents/` directory. Read
| [.agents/coding-style.md](.agents/coding-style.md) | Code style, editorconfig, logging, documentation conventions |
| [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing |
| [.agents/testing-mcp-apps.md](.agents/testing-mcp-apps.md) | Testing MCP Apps (interactive tool UIs) in the React UI |
| [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) | Adding API endpoints, auth middleware, feature permissions, user access control |
| [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |
| [.agents/adding-gallery-models.md](.agents/adding-gallery-models.md) | Adding GGUF models from HuggingFace to the model gallery |
## Quick Reference

View File

@@ -176,7 +176,7 @@ ENV PATH=/opt/rocm/bin:${PATH}
# The requirements-core target is common to all images. It should not be placed in requirements-core unless every single build will use it.
FROM requirements-drivers AS build-requirements
ARG GO_VERSION=1.25.4
ARG GO_VERSION=1.26.0
ARG CMAKE_VERSION=3.31.10
ARG CMAKE_FROM_SOURCE=false
ARG TARGETARCH
@@ -256,7 +256,7 @@ RUN apt-get update && \
FROM build-requirements AS builder-base
ARG GO_TAGS=""
ARG GO_TAGS="auth"
ARG GRPC_BACKENDS
ARG MAKEFLAGS
ARG LD_FLAGS="-s -w"
@@ -319,7 +319,6 @@ COPY ./.git ./.git
# Some of the Go backends use libs from the main src, we could further optimize the caching by building the CPP backends before here
COPY ./pkg/grpc ./pkg/grpc
COPY ./pkg/utils ./pkg/utils
COPY ./pkg/langchain ./pkg/langchain
RUN ls -l ./
RUN make protogen-go

121
Makefile
View File

@@ -1,5 +1,5 @@
# Disable parallel execution for backend builds
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp
GOCMD=go
GOTEST=$(GOCMD) test
@@ -107,7 +107,7 @@ core/http/react-ui/dist: react-ui
## Build:
build: protogen-go install-go-tools core/http/react-ui/dist ## Build the project
build: protogen-go generate install-go-tools core/http/react-ui/dist ## Build the project
$(info ${GREEN}I local-ai build info:${RESET})
$(info ${GREEN}I BUILD_TYPE: ${YELLOW}$(BUILD_TYPE)${RESET})
$(info ${GREEN}I GO_TAGS: ${YELLOW}$(GO_TAGS)${RESET})
@@ -148,7 +148,6 @@ test-models/testmodel.ggml:
mkdir -p test-dir
wget -q https://huggingface.co/mradermacher/gpt2-alpaca-gpt4-GGUF/resolve/main/gpt2-alpaca-gpt4.Q4_K_M.gguf -O test-models/testmodel.ggml
wget -q https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin -O test-models/whisper-en
wget -q https://huggingface.co/mudler/all-MiniLM-L6-v2/resolve/main/ggml-model-q4_0.bin -O test-models/bert
wget -q https://cdn.openai.com/whisper/draft-20220913a/micro-machines.wav -O test-dir/audio.wav
cp tests/models_fixtures/* test-models
@@ -398,6 +397,16 @@ protogen-go: protoc install-go-tools
./protoc --experimental_allow_proto3_optional -Ibackend/ --go_out=pkg/grpc/proto/ --go_opt=paths=source_relative --go-grpc_out=pkg/grpc/proto/ --go-grpc_opt=paths=source_relative \
backend/backend.proto
core/config/inference_defaults.json: ## Fetch inference defaults from unsloth (only if missing)
$(GOCMD) generate ./core/config/...
.PHONY: generate
generate: core/config/inference_defaults.json ## Ensure inference defaults exist
.PHONY: generate-force
generate-force: ## Re-fetch inference defaults from unsloth (always)
$(GOCMD) generate ./core/config/...
.PHONY: protogen-go-clean
protogen-go-clean:
$(RM) pkg/grpc/proto/backend.pb.go pkg/grpc/proto/backend_grpc.pb.go
@@ -419,8 +428,11 @@ prepare-test-extra: protogen-python
$(MAKE) -C backend/python/qwen-asr
$(MAKE) -C backend/python/nemo
$(MAKE) -C backend/python/voxcpm
$(MAKE) -C backend/python/faster-whisper
$(MAKE) -C backend/python/whisperx
$(MAKE) -C backend/python/ace-step
$(MAKE) -C backend/python/trl
$(MAKE) -C backend/rust/kokoros kokoros-grpc
test-extra: prepare-test-extra
$(MAKE) -C backend/python/transformers test
@@ -438,8 +450,74 @@ test-extra: prepare-test-extra
$(MAKE) -C backend/python/qwen-asr test
$(MAKE) -C backend/python/nemo test
$(MAKE) -C backend/python/voxcpm test
$(MAKE) -C backend/python/faster-whisper test
$(MAKE) -C backend/python/whisperx test
$(MAKE) -C backend/python/ace-step test
$(MAKE) -C backend/python/trl test
$(MAKE) -C backend/rust/kokoros test
##
## End-to-end gRPC tests that exercise a built backend container image.
##
## The test suite in tests/e2e-backends is backend-agnostic. You drive it via env
## vars (see tests/e2e-backends/backend_test.go for the full list) and the
## capability-driven harness picks which gRPC RPCs to exercise:
##
## BACKEND_IMAGE Required. Docker image to test, e.g. local-ai-backend:llama-cpp.
## BACKEND_TEST_MODEL_URL URL of a model file to download and load.
## BACKEND_TEST_MODEL_FILE Path to an already-downloaded model (skips download).
## BACKEND_TEST_MODEL_NAME HuggingFace repo id (e.g. Qwen/Qwen2.5-0.5B-Instruct).
## Use this instead of MODEL_URL for backends that
## resolve HF model ids natively (vllm, vllm-omni).
## BACKEND_TEST_CAPS Comma-separated capabilities, default "health,load,predict,stream".
## Adds "tools" to exercise ChatDelta tool call extraction.
## BACKEND_TEST_PROMPT Override the prompt used in predict/stream specs.
## BACKEND_TEST_OPTIONS Comma-separated Options[] entries forwarded to LoadModel,
## e.g. "tool_parser:hermes,reasoning_parser:qwen3".
##
## Direct usage (image already built, no docker-build-* dependency):
##
## make test-extra-backend BACKEND_IMAGE=local-ai-backend:llama-cpp \
## BACKEND_TEST_MODEL_URL=https://.../model.gguf
##
## Convenience wrappers below build a specific backend image first, then run the
## suite against it.
##
BACKEND_TEST_MODEL_URL?=https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q8_0.gguf
## Generic target — runs the suite against whatever BACKEND_IMAGE points at.
## Depends on protogen-go so pkg/grpc/proto is generated before `go test`.
test-extra-backend: protogen-go
@test -n "$$BACKEND_IMAGE" || { echo "BACKEND_IMAGE must be set" >&2; exit 1; }
BACKEND_IMAGE="$$BACKEND_IMAGE" \
BACKEND_TEST_MODEL_URL="$${BACKEND_TEST_MODEL_URL:-$(BACKEND_TEST_MODEL_URL)}" \
BACKEND_TEST_MODEL_FILE="$$BACKEND_TEST_MODEL_FILE" \
BACKEND_TEST_MODEL_NAME="$$BACKEND_TEST_MODEL_NAME" \
BACKEND_TEST_CAPS="$$BACKEND_TEST_CAPS" \
BACKEND_TEST_PROMPT="$$BACKEND_TEST_PROMPT" \
BACKEND_TEST_OPTIONS="$$BACKEND_TEST_OPTIONS" \
BACKEND_TEST_TOOL_PROMPT="$$BACKEND_TEST_TOOL_PROMPT" \
BACKEND_TEST_TOOL_NAME="$$BACKEND_TEST_TOOL_NAME" \
go test -v -timeout 30m ./tests/e2e-backends/...
## Convenience wrappers: build the image, then exercise it.
test-extra-backend-llama-cpp: docker-build-llama-cpp
BACKEND_IMAGE=local-ai-backend:llama-cpp $(MAKE) test-extra-backend
test-extra-backend-ik-llama-cpp: docker-build-ik-llama-cpp
BACKEND_IMAGE=local-ai-backend:ik-llama-cpp $(MAKE) test-extra-backend
## vllm is resolved from a HuggingFace model id (no file download) and
## exercises Predict + streaming + tool-call extraction via the hermes parser.
## Requires a host CPU with the SIMD instructions the prebuilt vllm CPU
## wheel was compiled against (AVX-512 VNNI/BF16); older CPUs will SIGILL
## on import — on CI this means using the bigger-runner label.
test-extra-backend-vllm: docker-build-vllm
BACKEND_IMAGE=local-ai-backend:vllm \
BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct \
BACKEND_TEST_CAPS=health,load,predict,stream,tools \
BACKEND_TEST_OPTIONS=tool_parser:hermes \
$(MAKE) test-extra-backend
DOCKER_IMAGE?=local-ai
IMAGE_TYPE?=core
@@ -534,6 +612,8 @@ backend-images:
# Backend metadata: BACKEND_NAME | DOCKERFILE_TYPE | BUILD_CONTEXT | PROGRESS_FLAG | NEEDS_BACKEND_ARG
# llama-cpp is special - uses llama-cpp Dockerfile and doesn't need BACKEND arg
BACKEND_LLAMA_CPP = llama-cpp|llama-cpp|.|false|false
# ik-llama-cpp is a fork of llama.cpp with superior CPU performance
BACKEND_IK_LLAMA_CPP = ik-llama-cpp|ik-llama-cpp|.|false|false
# Golang backends
BACKEND_PIPER = piper|golang|.|false|true
@@ -544,6 +624,7 @@ BACKEND_STABLEDIFFUSION_GGML = stablediffusion-ggml|golang|.|--progress=plain|tr
BACKEND_WHISPER = whisper|golang|.|false|true
BACKEND_VOXTRAL = voxtral|golang|.|false|true
BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true
BACKEND_QWEN3_TTS_CPP = qwen3-tts-cpp|golang|.|false|true
BACKEND_OPUS = opus|golang|.|false|true
# Python backends with root context
@@ -572,6 +653,14 @@ BACKEND_VOXCPM = voxcpm|python|.|false|true
BACKEND_WHISPERX = whisperx|python|.|false|true
BACKEND_ACE_STEP = ace-step|python|.|false|true
BACKEND_MLX_DISTRIBUTED = mlx-distributed|python|./|false|true
BACKEND_TRL = trl|python|.|false|true
BACKEND_LLAMA_CPP_QUANTIZATION = llama-cpp-quantization|python|.|false|true
# Rust backends
BACKEND_KOKOROS = kokoros|rust|.|false|true
# C++ backends (Go wrapper with purego)
BACKEND_SAM3_CPP = sam3-cpp|golang|.|false|true
# Helper function to build docker image for a backend
# Usage: $(call docker-build-backend,BACKEND_NAME,DOCKERFILE_TYPE,BUILD_CONTEXT,PROGRESS_FLAG,NEEDS_BACKEND_ARG)
@@ -583,6 +672,7 @@ define docker-build-backend
--build-arg CUDA_MINOR_VERSION=$(CUDA_MINOR_VERSION) \
--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
$(if $(FROM_SOURCE),--build-arg FROM_SOURCE=$(FROM_SOURCE)) \
$(if $(filter true,$(5)),--build-arg BACKEND=$(1)) \
-t local-ai-backend:$(1) -f backend/Dockerfile.$(2) $(3)
endef
@@ -595,6 +685,7 @@ endef
# Generate all docker-build targets
$(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_IK_LLAMA_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_PIPER)))
$(eval $(call generate-docker-build-target,$(BACKEND_LOCAL_STORE)))
$(eval $(call generate-docker-build-target,$(BACKEND_HUGGINGFACE)))
@@ -628,13 +719,18 @@ $(eval $(call generate-docker-build-target,$(BACKEND_VOXCPM)))
$(eval $(call generate-docker-build-target,$(BACKEND_WHISPERX)))
$(eval $(call generate-docker-build-target,$(BACKEND_ACE_STEP)))
$(eval $(call generate-docker-build-target,$(BACKEND_ACESTEP_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_QWEN3_TTS_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_MLX_DISTRIBUTED)))
$(eval $(call generate-docker-build-target,$(BACKEND_TRL)))
$(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP_QUANTIZATION)))
$(eval $(call generate-docker-build-target,$(BACKEND_KOKOROS)))
$(eval $(call generate-docker-build-target,$(BACKEND_SAM3_CPP)))
# Pattern rule for docker-save targets
docker-save-%: backend-images
docker save local-ai-backend:$* -o backend-images/$*.tar
docker-build-backends: docker-build-llama-cpp docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp
########################################################
### Mock Backend for E2E Tests
@@ -646,6 +742,23 @@ build-mock-backend: protogen-go
clean-mock-backend:
rm -f tests/e2e/mock-backend/mock-backend
########################################################
### UI E2E Test Server
########################################################
build-ui-test-server: build-mock-backend react-ui protogen-go
$(GOCMD) build -o tests/e2e-ui/ui-test-server ./tests/e2e-ui
test-ui-e2e: build-ui-test-server
cd core/http/react-ui && npm install && npx playwright install --with-deps chromium && npx playwright test
test-ui-e2e-docker:
docker build -t localai-ui-e2e -f tests/e2e-ui/Dockerfile .
docker run --rm localai-ui-e2e
clean-ui-test-server:
rm -f tests/e2e-ui/ui-test-server
########################################################
### END Backends
########################################################

421
README.md
View File

@@ -5,35 +5,17 @@
</h1>
<p align="center">
<a href="https://github.com/go-skynet/LocalAI/fork" target="blank">
<img src="https://img.shields.io/github/forks/go-skynet/LocalAI?style=for-the-badge" alt="LocalAI forks"/>
</a>
<a href="https://github.com/go-skynet/LocalAI/stargazers" target="blank">
<img src="https://img.shields.io/github/stars/go-skynet/LocalAI?style=for-the-badge" alt="LocalAI stars"/>
</a>
<a href="https://github.com/go-skynet/LocalAI/pulls" target="blank">
<img src="https://img.shields.io/github/issues-pr/go-skynet/LocalAI?style=for-the-badge" alt="LocalAI pull-requests"/>
</a>
<a href='https://github.com/go-skynet/LocalAI/releases'>
<img src='https://img.shields.io/github/release/go-skynet/LocalAI?&label=Latest&style=for-the-badge'>
</a>
</p>
<p align="center">
<a href="LICENSE" target="blank">
<img src="https://img.shields.io/badge/License-MIT-yellow.svg?style=for-the-badge" alt="LocalAI License"/>
</a>
</p>
<p align="center">
<a href="https://hub.docker.com/r/localai/localai" target="blank">
<img src="https://img.shields.io/badge/dockerhub-images-important.svg?logo=Docker" alt="LocalAI Docker hub"/>
</a>
<a href="https://quay.io/repository/go-skynet/local-ai?tab=tags&tag=latest" target="blank">
<img src="https://img.shields.io/badge/quay.io-images-important.svg?" alt="LocalAI Quay.io"/>
</a>
</p>
<p align="center">
<a href="https://twitter.com/LocalAI_API" target="blank">
<img src="https://img.shields.io/badge/X-%23000000.svg?style=for-the-badge&logo=X&logoColor=white&label=LocalAI_API" alt="Follow LocalAI_API"/>
@@ -47,363 +29,184 @@
<a href="https://trendshift.io/repositories/5539" target="_blank"><img src="https://trendshift.io/api/badge/repositories/5539" alt="mudler%2FLocalAI | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
</p>
> :bulb: Get help - [❓FAQ](https://localai.io/faq/) [💭Discussions](https://github.com/go-skynet/LocalAI/discussions) [:speech_balloon: Discord](https://discord.gg/uJAeKSAGDy) [:book: Documentation website](https://localai.io/)
>
> [💻 Quickstart](https://localai.io/basics/getting_started/) [🖼️ Models](https://models.localai.io/) [🚀 Roadmap](https://github.com/mudler/LocalAI/issues?q=is%3Aissue+is%3Aopen+label%3Aroadmap) [🛫 Examples](https://github.com/mudler/LocalAI-examples) Try on
[![Telegram](https://img.shields.io/badge/Telegram-2CA5E0?style=for-the-badge&logo=telegram&logoColor=white)](https://t.me/localaiofficial_bot)
**LocalAI** is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.
[![tests](https://github.com/go-skynet/LocalAI/actions/workflows/test.yml/badge.svg)](https://github.com/go-skynet/LocalAI/actions/workflows/test.yml)[![Build and Release](https://github.com/go-skynet/LocalAI/actions/workflows/release.yaml/badge.svg)](https://github.com/go-skynet/LocalAI/actions/workflows/release.yaml)[![build container images](https://github.com/go-skynet/LocalAI/actions/workflows/image.yml/badge.svg)](https://github.com/go-skynet/LocalAI/actions/workflows/image.yml)[![Bump dependencies](https://github.com/go-skynet/LocalAI/actions/workflows/bump_deps.yaml/badge.svg)](https://github.com/go-skynet/LocalAI/actions/workflows/bump_deps.yaml)[![Artifact Hub](https://img.shields.io/endpoint?url=https://artifacthub.io/badge/repository/localai)](https://artifacthub.io/packages/search?repo=localai)
- **Drop-in API compatibility** — OpenAI, Anthropic, ElevenLabs APIs
- **36+ backends** — llama.cpp, vLLM, transformers, whisper, diffusers, MLX...
- **Any hardware** — NVIDIA, AMD, Intel, Apple Silicon, Vulkan, or CPU-only
- **Multi-user ready** — API key auth, user quotas, role-based access
- **Built-in AI agents** — autonomous agents with tool use, RAG, MCP, and skills
- **Privacy-first** — your data never leaves your infrastructure
<p align="center">
<a href="https://github.com/mudler/LocalAI-examples" target="blank">
<img src="https://img.shields.io/badge/📦_Examples_Repository-Browse_Ready--to--Run_Examples-blue?style=for-the-badge" alt="LocalAI Examples Repository"/>
</a>
</p>
Created and maintained by [Ettore Di Giacinto](https://github.com/mudler).
**LocalAI** is the free, Open Source OpenAI alternative. LocalAI act as a drop-in replacement REST API that's compatible with OpenAI (Elevenlabs, Anthropic... ) API specifications for local AI inferencing. It allows you to run LLMs, generate images, audio (and not only) locally or on-prem with consumer grade hardware, supporting multiple model families. Does not require GPU. It is created and maintained by [Ettore Di Giacinto](https://github.com/mudler).
> [:book: Documentation](https://localai.io/) | [:speech_balloon: Discord](https://discord.gg/uJAeKSAGDy) | [💻 Quickstart](https://localai.io/basics/getting_started/) | [🖼️ Models](https://models.localai.io/) | [❓FAQ](https://localai.io/faq/)
## Guided tour
https://github.com/user-attachments/assets/08cbb692-57da-48f7-963d-2e7b43883c18
<details>
<summary><strong>Table of Contents</strong></summary>
- [Local Stack Family](#local-stack-family)
- [Screenshots / Video](#screenshots--video)
- [Quickstart](#-quickstart)
- [macOS Download](#macos-download)
- [Containers (Docker, podman, ...)](#containers-docker-podman-)
- [Latest project news](#-latest-project-news)
- [Features](#-features)
- [Supported Backends & Acceleration](#-supported-backends--acceleration)
- [Text Generation & Language Models](#text-generation--language-models)
- [Audio & Speech Processing](#audio--speech-processing)
- [Image & Video Generation](#image--video-generation)
- [Specialized AI Tasks](#specialized-ai-tasks)
- [Hardware Acceleration Matrix](#hardware-acceleration-matrix)
- [Community and integrations](#-community-and-integrations)
- [Resources](#-resources)
- [Media, Blogs, Social](#book--media-blogs-social)
- [Autonomous Development Team](#-autonomous-development-team)
- [Citation](#citation)
- [Sponsors](#-sponsors)
- [Individual sponsors](#individual-sponsors)
- [Star history](#-star-history)
- [License](#-license)
- [Acknowledgements](#-acknowledgements)
- [Contributors](#-contributors)
<summary>
Click to see more!
</summary>
#### User and auth
https://github.com/user-attachments/assets/228fa9ad-81a3-4d43-bfb9-31557e14a36c
#### Agents
https://github.com/user-attachments/assets/6270b331-e21d-4087-a540-6290006b381a
#### Usage metrics per user
https://github.com/user-attachments/assets/cbb03379-23b4-4e3d-bd26-d152f057007f
#### Fine-tuning and Quantization
https://github.com/user-attachments/assets/5ba4ace9-d3df-4795-b7d4-b0b404ea71ee
#### WebRTC
https://github.com/user-attachments/assets/ed88e34c-fed3-4b83-8a67-4716a9feeb7b
</details>
## Local Stack Family
## Quickstart
Liking LocalAI? LocalAI is part of an integrated suite of AI infrastructure tools, you might also like:
- **[LocalAGI](https://github.com/mudler/LocalAGI)** - AI agent orchestration platform with OpenAI Responses API compatibility and advanced agentic capabilities
- **[LocalRecall](https://github.com/mudler/LocalRecall)** - MCP/REST API knowledge base system providing persistent memory and storage for AI agents
- 🆕 **[Cogito](https://github.com/mudler/cogito)** - Go library for building intelligent, co-operative agentic software and LLM-powered workflows, focusing on improving results for small, open source language models that scales to any LLM. Powers LocalAGI and LocalAI MCP/Agentic capabilities
- 🆕 **[Wiz](https://github.com/mudler/wiz)** - Terminal-based AI agent accessible via Ctrl+Space keybinding. Portable, local-LLM friendly shell assistant with TUI/CLI modes, tool execution with approval, MCP protocol support, and multi-shell compatibility (zsh, bash, fish)
- 🆕 **[SkillServer](https://github.com/mudler/skillserver)** - Simple, centralized skills database for AI agents via MCP. Manages skills as Markdown files with MCP server integration, web UI for editing, Git synchronization, and full-text search capabilities
## Screenshots / Video
### Youtube video
<h1 align="center">
<br>
<a href="https://www.youtube.com/watch?v=PDqYhB9nNHA" target="_blank"> <img width="300" src="https://img.youtube.com/vi/PDqYhB9nNHA/0.jpg"> </a><br>
<br>
</h1>
### Screenshots
| Talk Interface | Generate Audio |
| --- | --- |
| ![LocalAI - Talk](./docs/assets/images/screenshots/screenshot_talk.png) | ![LocalAI - Generate Audio](./docs/assets/images/screenshots/screenshot_tts.png) |
| Models Overview | Generate Images |
| --- | --- |
| ![LocalAI - Models](./docs/assets/images/screenshots/screenshot_gallery.png) | ![LocalAI - Generate Images](./docs/assets/images/screenshots/screenshot_image.png) |
| Chat Interface | Home |
| --- | --- |
| ![LocalAI - Chat](./docs/assets/images/screenshots/screenshot_chat.png) | ![LocalAI - Home](./docs/assets/images/screenshots/screenshot_home.png) |
| Login | Swarm |
| --- | --- |
| ![LocalAI - Login](./docs/assets/images/screenshots/screenshot_login.png) | ![LocalAI - P2P Dashboard](./docs/assets/images/screenshots/screenshot_p2p.png) |
## 💻 Quickstart
### macOS Download:
### macOS
<a href="https://github.com/mudler/LocalAI/releases/latest/download/LocalAI.dmg">
<img src="https://img.shields.io/badge/Download-macOS-blue?style=for-the-badge&logo=apple&logoColor=white" alt="Download LocalAI for macOS"/>
</a>
> Note: the DMGs are not signed by Apple as quarantined. See https://github.com/mudler/LocalAI/issues/6268 for a workaround, fix is tracked here: https://github.com/mudler/LocalAI/issues/6244
> **Note:** The DMG is not signed by Apple. After installing, run: `sudo xattr -d com.apple.quarantine /Applications/LocalAI.app`. See [#6268](https://github.com/mudler/LocalAI/issues/6268) for details.
### Containers (Docker, podman, ...)
> **💡 Docker Run vs Docker Start**
>
> - `docker run` creates and starts a new container. If a container with the same name already exists, this command will fail.
> - `docker start` starts an existing container that was previously created with `docker run`.
>
> If you've already run LocalAI before and want to start it again, use: `docker start -i local-ai`
> Already ran LocalAI before? Use `docker start -i local-ai` to restart an existing container.
#### CPU only image:
#### CPU only:
```bash
docker run -ti --name local-ai -p 8080:8080 localai/localai:latest
```
#### NVIDIA GPU Images:
#### NVIDIA GPU:
```bash
# CUDA 13.0
# CUDA 13
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-13
# CUDA 12.0
# CUDA 12
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-12
# NVIDIA Jetson (L4T) ARM64
# CUDA 12 (for Nvidia AGX Orin and similar platforms)
# NVIDIA Jetson ARM64 (CUDA 12, for AGX Orin and similar)
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-nvidia-l4t-arm64
# CUDA 13 (for Nvidia DGX Spark)
# NVIDIA Jetson ARM64 (CUDA 13, for DGX Spark)
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-nvidia-l4t-arm64-cuda-13
```
#### AMD GPU Images (ROCm):
#### AMD GPU (ROCm):
```bash
docker run -ti --name local-ai -p 8080:8080 --device=/dev/kfd --device=/dev/dri --group-add=video localai/localai:latest-gpu-hipblas
```
#### Intel GPU Images (oneAPI):
#### Intel GPU (oneAPI):
```bash
docker run -ti --name local-ai -p 8080:8080 --device=/dev/dri/card1 --device=/dev/dri/renderD128 localai/localai:latest-gpu-intel
```
#### Vulkan GPU Images:
#### Vulkan GPU:
```bash
docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-gpu-vulkan
```
To load models:
### Loading models
```bash
# From the model gallery (see available models with `local-ai models list`, in the WebUI from the model tab, or visiting https://models.localai.io)
# From the model gallery (see available models with `local-ai models list` or at https://models.localai.io)
local-ai run llama-3.2-1b-instruct:q4_k_m
# Start LocalAI with the phi-2 model directly from huggingface
# From Huggingface
local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
# Install and run a model from the Ollama OCI registry
# From the Ollama OCI registry
local-ai run ollama://gemma:2b
# Run a model from a configuration file
# From a YAML config
local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
# Install and run a model from a standard OCI registry (e.g., Docker Hub)
# From a standard OCI registry (e.g., Docker Hub)
local-ai run oci://localai/phi-2:latest
```
> **Automatic Backend Detection**: When you install models from the gallery or YAML files, LocalAI automatically detects your system's GPU capabilities (NVIDIA, AMD, Intel) and downloads the appropriate backend. For advanced configuration options, see [GPU Acceleration](https://localai.io/features/gpu-acceleration/#automatic-backend-detection).
> **Automatic Backend Detection**: LocalAI automatically detects your GPU capabilities and downloads the appropriate backend. For advanced options, see [GPU Acceleration](https://localai.io/features/gpu-acceleration/).
For more information, see [💻 Getting started](https://localai.io/basics/getting_started/index.html), if you are interested in our roadmap items and future enhancements, you can see the [Issues labeled as Roadmap here](https://github.com/mudler/LocalAI/issues?q=is%3Aissue+is%3Aopen+label%3Aroadmap)
For more details, see the [Getting Started guide](https://localai.io/basics/getting_started/).
## 📰 Latest project news
- March 2026: [Agent management](https://github.com/mudler/LocalAI/pull/8820), [New React UI](https://github.com/mudler/LocalAI/pull/8772), [WebRTC](https://github.com/mudler/LocalAI/pull/8790),[MLX-distributed via P2P and RDMA](https://github.com/mudler/LocalAI/pull/8801), [MCP Apps, MCP Client-side](https://github.com/mudler/LocalAI/pull/8947)
- February 2026: [Realtime API for audio-to-audio with tool calling](https://github.com/mudler/LocalAI/pull/6245), [ACE-Step 1.5 support](https://github.com/mudler/LocalAI/pull/8396)
- January 2026: **LocalAI 3.10.0** - Major release with Anthropic API support, Open Responses API for stateful agents, video & image generation suite (LTX-2), unified GPU backends, tool streaming & XML parsing, system-aware backend gallery, crash fixes for AVX-only CPUs and AMD VRAM reporting, request tracing, and new backends: **Moonshine** (ultra-fast transcription), **Pocket-TTS** (lightweight TTS). Vulkan arm64 builds now available. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v3.10.0).
- December 2025: [Dynamic Memory Resource reclaimer](https://github.com/mudler/LocalAI/pull/7583), [Automatic fitting of models to multiple GPUS(llama.cpp)](https://github.com/mudler/LocalAI/pull/7584), [Added Vibevoice backend](https://github.com/mudler/LocalAI/pull/7494)
- November 2025: Major improvements to the UX. Among these: [Import models via URL](https://github.com/mudler/LocalAI/pull/7245) and [Multiple chats and history](https://github.com/mudler/LocalAI/pull/7325)
- October 2025: 🔌 [Model Context Protocol (MCP)](https://localai.io/docs/features/mcp/) support added for agentic capabilities with external tools
- September 2025: New Launcher application for MacOS and Linux, extended support to many backends for Mac and Nvidia L4T devices. Models: Added MLX-Audio, WAN 2.2. WebUI improvements and Python-based backends now ships portable python environments.
- August 2025: MLX, MLX-VLM, Diffusers and llama.cpp are now supported on Mac M1/M2/M3+ chips ( with `development` suffix in the gallery ): https://github.com/mudler/LocalAI/pull/6049 https://github.com/mudler/LocalAI/pull/6119 https://github.com/mudler/LocalAI/pull/6121 https://github.com/mudler/LocalAI/pull/6060
- July/August 2025: 🔍 [Object Detection](https://localai.io/features/object-detection/) added to the API featuring [rf-detr](https://github.com/roboflow/rf-detr)
- July 2025: All backends migrated outside of the main binary. LocalAI is now more lightweight, small, and automatically downloads the required backend to run the model. [Read the release notes](https://github.com/mudler/LocalAI/releases/tag/v3.2.0)
- June 2025: [Backend management](https://github.com/mudler/LocalAI/pull/5607) has been added. Attention: extras images are going to be deprecated from the next release! Read [the backend management PR](https://github.com/mudler/LocalAI/pull/5607).
- May 2025: [Audio input](https://github.com/mudler/LocalAI/pull/5466) and [Reranking](https://github.com/mudler/LocalAI/pull/5396) in llama.cpp backend, [Realtime API](https://github.com/mudler/LocalAI/pull/5392), Support to Gemma, SmollVLM, and more multimodal models (available in the gallery).
- May 2025: Important: image name changes [See release](https://github.com/mudler/LocalAI/releases/tag/v2.29.0)
- Apr 2025: Rebrand, WebUI enhancements
- Apr 2025: [LocalAGI](https://github.com/mudler/LocalAGI) and [LocalRecall](https://github.com/mudler/LocalRecall) join the LocalAI family stack.
- Apr 2025: WebUI overhaul
- Feb 2025: Backend cleanup, Breaking changes, new backends (kokoro, OutelTTS, faster-whisper), Nvidia L4T images
- Jan 2025: LocalAI model release: https://huggingface.co/mudler/LocalAI-functioncall-phi-4-v0.3, SANA support in diffusers: https://github.com/mudler/LocalAI/pull/4603
- Dec 2024: stablediffusion.cpp backend (ggml) added ( https://github.com/mudler/LocalAI/pull/4289 )
- Nov 2024: Bark.cpp backend added ( https://github.com/mudler/LocalAI/pull/4287 )
- Nov 2024: Voice activity detection models (**VAD**) added to the API: https://github.com/mudler/LocalAI/pull/4204
- Oct 2024: examples moved to [LocalAI-examples](https://github.com/mudler/LocalAI-examples)
- Aug 2024: 🆕 FLUX-1, [P2P Explorer](https://explorer.localai.io)
- July 2024: 🔥🔥 🆕 P2P Dashboard, LocalAI Federated mode and AI Swarms: https://github.com/mudler/LocalAI/pull/2723. P2P Global community pools: https://github.com/mudler/LocalAI/issues/3113
- May 2024: 🔥🔥 Decentralized P2P llama.cpp: https://github.com/mudler/LocalAI/pull/2343 (peer2peer llama.cpp!) 👉 Docs https://localai.io/features/distribute/
- May 2024: 🔥🔥 Distributed inferencing: https://github.com/mudler/LocalAI/pull/2324
- April 2024: Reranker API: https://github.com/mudler/LocalAI/pull/2121
## Latest News
Roadmap items: [List of issues](https://github.com/mudler/LocalAI/issues?q=is%3Aissue+is%3Aopen+label%3Aroadmap)
- **March 2026**: [Agent management](https://github.com/mudler/LocalAI/pull/8820), [New React UI](https://github.com/mudler/LocalAI/pull/8772), [WebRTC](https://github.com/mudler/LocalAI/pull/8790), [MLX-distributed via P2P and RDMA](https://github.com/mudler/LocalAI/pull/8801), [MCP Apps, MCP Client-side](https://github.com/mudler/LocalAI/pull/8947)
- **February 2026**: [Realtime API for audio-to-audio with tool calling](https://github.com/mudler/LocalAI/pull/6245), [ACE-Step 1.5 support](https://github.com/mudler/LocalAI/pull/8396)
- **January 2026**: **LocalAI 3.10.0** — Anthropic API support, Open Responses API, video & image generation (LTX-2), unified GPU backends, tool streaming, Moonshine, Pocket-TTS. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v3.10.0)
- **December 2025**: [Dynamic Memory Resource reclaimer](https://github.com/mudler/LocalAI/pull/7583), [Automatic multi-GPU model fitting (llama.cpp)](https://github.com/mudler/LocalAI/pull/7584), [Vibevoice backend](https://github.com/mudler/LocalAI/pull/7494)
- **November 2025**: [Import models via URL](https://github.com/mudler/LocalAI/pull/7245), [Multiple chats and history](https://github.com/mudler/LocalAI/pull/7325)
- **October 2025**: [Model Context Protocol (MCP)](https://localai.io/docs/features/mcp/) support for agentic capabilities
- **September 2025**: New Launcher for macOS and Linux, extended backend support for Mac and Nvidia L4T, MLX-Audio, WAN 2.2
- **August 2025**: MLX, MLX-VLM, Diffusers, llama.cpp now supported on Apple Silicon
- **July 2025**: All backends migrated outside the main binary — [lightweight, modular architecture](https://github.com/mudler/LocalAI/releases/tag/v3.2.0)
## 🚀 [Features](https://localai.io/features/)
For older news and full release notes, see [GitHub Releases](https://github.com/mudler/LocalAI/releases) and the [News page](https://localai.io/basics/news/).
- 🧩 [Backend Gallery](https://localai.io/backends/): Install/remove backends on the fly, powered by OCI images — fully customizable and API-driven.
- 📖 [Text generation with GPTs](https://localai.io/features/text-generation/) (`llama.cpp`, `transformers`, `vllm` ... [:book: and more](https://localai.io/model-compatibility/index.html#model-compatibility-table))
- 🗣 [Text to Audio](https://localai.io/features/text-to-audio/)
- 🔈 [Audio to Text](https://localai.io/features/audio-to-text/)
- 🎨 [Image generation](https://localai.io/features/image-generation)
- 🔥 [OpenAI-alike tools API](https://localai.io/features/openai-functions/)
- ⚡ [Realtime API](https://localai.io/features/openai-realtime/) (Speech-to-speech)
- 🧠 [Embeddings generation for vector databases](https://localai.io/features/embeddings/)
- ✍️ [Constrained grammars](https://localai.io/features/constrained_grammars/)
- 🖼️ [Download Models directly from Huggingface ](https://localai.io/models/)
- 🥽 [Vision API](https://localai.io/features/gpt-vision/)
- 🔍 [Object Detection](https://localai.io/features/object-detection/)
- 📈 [Reranker API](https://localai.io/features/reranker/)
- 🆕🖧 [P2P Inferencing](https://localai.io/features/distribute/)
- 🆕🔌 [Model Context Protocol (MCP)](https://localai.io/docs/features/mcp/) - Agentic capabilities with external tools and [LocalAGI's Agentic capabilities](https://github.com/mudler/LocalAGI)
- 🆕🤖 [Built-in Agents](https://localai.io/features/agents/) - Autonomous AI agents with tool use, knowledge base (RAG), skills, SSE streaming, import/export, and [Agent Hub](https://agenthub.localai.io) — powered by [LocalAGI](https://github.com/mudler/LocalAGI)
- 🔊 Voice activity detection (Silero-VAD support)
- 🌍 Integrated WebUI!
## Features
## 🧩 Supported Backends & Acceleration
- [Text generation](https://localai.io/features/text-generation/) (`llama.cpp`, `transformers`, `vllm` ... [and more](https://localai.io/model-compatibility/))
- [Text to Audio](https://localai.io/features/text-to-audio/)
- [Audio to Text](https://localai.io/features/audio-to-text/)
- [Image generation](https://localai.io/features/image-generation)
- [OpenAI-compatible tools API](https://localai.io/features/openai-functions/)
- [Realtime API](https://localai.io/features/openai-realtime/) (Speech-to-speech)
- [Embeddings generation](https://localai.io/features/embeddings/)
- [Constrained grammars](https://localai.io/features/constrained_grammars/)
- [Download models from Huggingface](https://localai.io/models/)
- [Vision API](https://localai.io/features/gpt-vision/)
- [Object Detection](https://localai.io/features/object-detection/)
- [Reranker API](https://localai.io/features/reranker/)
- [P2P Inferencing](https://localai.io/features/distribute/)
- [Distributed Mode](https://localai.io/features/distributed-mode/) — Horizontal scaling with PostgreSQL + NATS
- [Model Context Protocol (MCP)](https://localai.io/docs/features/mcp/)
- [Built-in Agents](https://localai.io/features/agents/) — Autonomous AI agents with tool use, RAG, skills, SSE streaming, and [Agent Hub](https://agenthub.localai.io)
- [Backend Gallery](https://localai.io/backends/) — Install/remove backends on the fly via OCI images
- Voice Activity Detection (Silero-VAD)
- Integrated WebUI
LocalAI supports a comprehensive range of AI backends with multiple acceleration options:
## Supported Backends & Acceleration
### Text Generation & Language Models
| Backend | Description | Acceleration Support |
|---------|-------------|---------------------|
| **llama.cpp** | LLM inference in C/C++ | CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, CPU |
| **vLLM** | Fast LLM inference with PagedAttention | CUDA 12/13, ROCm, Intel |
| **transformers** | HuggingFace transformers framework | CUDA 12/13, ROCm, Intel, CPU |
| **MLX** | Apple Silicon LLM inference | Metal (M1/M2/M3+) |
| **MLX-VLM** | Apple Silicon Vision-Language Models | Metal (M1/M2/M3+) |
| **vLLM Omni** | Multimodal vLLM with vision and audio | CUDA 12/13, ROCm, Intel |
LocalAI supports **36+ backends** including llama.cpp, vLLM, transformers, whisper.cpp, diffusers, MLX, MLX-VLM, and many more. Hardware acceleration is available for **NVIDIA** (CUDA 12/13), **AMD** (ROCm), **Intel** (oneAPI/SYCL), **Apple Silicon** (Metal), **Vulkan**, and **NVIDIA Jetson** (L4T). All backends can be installed on-the-fly from the [Backend Gallery](https://localai.io/backends/).
### Audio & Speech Processing
| Backend | Description | Acceleration Support |
|---------|-------------|---------------------|
| **whisper.cpp** | OpenAI Whisper in C/C++ | CUDA 12/13, ROCm, Intel SYCL, Vulkan, CPU |
| **faster-whisper** | Fast Whisper with CTranslate2 | CUDA 12/13, ROCm, Intel, CPU |
| **moonshine** | Ultra-fast transcription engine for low-end devices | CUDA 12/13, Metal, CPU |
| **coqui** | Advanced TTS with 1100+ languages | CUDA 12/13, ROCm, Intel, CPU |
| **kokoro** | Lightweight TTS model | CUDA 12/13, ROCm, Intel, CPU |
| **chatterbox** | Production-grade TTS | CUDA 12/13, CPU |
| **piper** | Fast neural TTS system | CPU |
| **kitten-tts** | Kitten TTS models | CPU |
| **silero-vad** | Voice Activity Detection | CPU |
| **neutts** | Text-to-speech with voice cloning | CUDA 12/13, ROCm, CPU |
| **vibevoice** | Real-time TTS with voice cloning | CUDA 12/13, ROCm, Intel, CPU |
| **pocket-tts** | Lightweight CPU-based TTS | CUDA 12/13, ROCm, Intel, CPU |
| **qwen-tts** | High-quality TTS with custom voice, voice design, and voice cloning | CUDA 12/13, ROCm, Intel, CPU |
| **nemo** | NVIDIA NeMo framework for speech models | CUDA 12/13, ROCm, Intel, CPU |
| **outetts** | OuteTTS with voice cloning | CUDA 12/13, CPU |
| **faster-qwen3-tts** | Faster Qwen3 TTS | CUDA 12/13, ROCm, Intel, CPU |
| **qwen-asr** | Qwen ASR speech recognition | CUDA 12/13, ROCm, Intel, CPU |
| **voxcpm** | VoxCPM speech understanding | CUDA 12/13, Metal, CPU |
| **whisperx** | Enhanced Whisper transcription | CUDA 12/13, ROCm, Intel, CPU |
| **ace-step** | Music generation from text descriptions, lyrics, or audio samples | CUDA 12/13, ROCm, Intel, Metal, CPU |
See the full [Backend & Model Compatibility Table](https://localai.io/model-compatibility/) and [GPU Acceleration guide](https://localai.io/features/gpu-acceleration/).
### Image & Video Generation
| Backend | Description | Acceleration Support |
|---------|-------------|---------------------|
| **stablediffusion.cpp** | Stable Diffusion in C/C++ | CUDA 12/13, Intel SYCL, Vulkan, CPU |
| **diffusers** | HuggingFace diffusion models | CUDA 12/13, ROCm, Intel, Metal, CPU |
## Resources
### Specialized AI Tasks
| Backend | Description | Acceleration Support |
|---------|-------------|---------------------|
| **rfdetr** | Real-time object detection | CUDA 12/13, Intel, CPU |
| **rerankers** | Document reranking API | CUDA 12/13, ROCm, Intel, CPU |
| **local-store** | Vector database | CPU |
| **huggingface** | HuggingFace API integration | API-based |
- [Documentation](https://localai.io/)
- [LLM fine-tuning guide](https://localai.io/docs/advanced/fine-tuning/)
- [Build from source](https://localai.io/basics/build/)
- [Kubernetes installation](https://localai.io/basics/getting_started/#run-localai-in-kubernetes)
- [Integrations & community projects](https://localai.io/docs/integrations/)
- [Installation video walkthrough](https://www.youtube.com/watch?v=cMVNnlqwfw4)
- [Media & blog posts](https://localai.io/basics/news/#media-blogs-social)
- [Examples](https://github.com/mudler/LocalAI-examples)
### Hardware Acceleration Matrix
## Autonomous Development Team
| Acceleration Type | Supported Backends | Hardware Support |
|-------------------|-------------------|------------------|
| **NVIDIA CUDA 12** | All CUDA-compatible backends | Nvidia hardware |
| **NVIDIA CUDA 13** | All CUDA-compatible backends | Nvidia hardware |
| **AMD ROCm** | llama.cpp, whisper, vllm, transformers, diffusers, rerankers, coqui, kokoro, neutts, vibevoice, pocket-tts, qwen-tts, ace-step | AMD Graphics |
| **Intel oneAPI** | llama.cpp, whisper, stablediffusion, vllm, transformers, diffusers, rfdetr, rerankers, coqui, kokoro, vibevoice, pocket-tts, qwen-tts, ace-step | Intel Arc, Intel iGPUs |
| **Apple Metal** | llama.cpp, whisper, diffusers, MLX, MLX-VLM, moonshine, ace-step | Apple M1/M2/M3+ |
| **Vulkan** | llama.cpp, whisper, stablediffusion | Cross-platform GPUs |
| **NVIDIA Jetson (CUDA 12)** | llama.cpp, whisper, stablediffusion, diffusers, rfdetr, ace-step | ARM64 embedded AI (AGX Orin, etc.) |
| **NVIDIA Jetson (CUDA 13)** | llama.cpp, whisper, stablediffusion, diffusers, rfdetr | ARM64 embedded AI (DGX Spark) |
| **CPU Optimized** | All backends | AVX/AVX2/AVX512, quantization support |
LocalAI is helped being maintained by a team of autonomous AI agents led by an AI Scrum Master.
### 🔗 Community and integrations
Build and deploy custom containers:
- https://github.com/sozercan/aikit
WebUIs:
- https://github.com/Jirubizu/localai-admin
- https://github.com/go-skynet/LocalAI-frontend
- QA-Pilot(An interactive chat project that leverages LocalAI LLMs for rapid understanding and navigation of GitHub code repository) https://github.com/reid41/QA-Pilot
Agentic Libraries:
- https://github.com/mudler/cogito
MCPs:
- https://github.com/mudler/MCPs
OS Assistant:
- https://github.com/mudler/Keygeist - Keygeist is an AI-powered keyboard operator that listens for key combinations and responds with AI-generated text typed directly into your Linux box.
Model galleries
- https://github.com/go-skynet/model-gallery
Voice:
- https://github.com/richiejp/VoxInput
Other:
- Helm chart https://github.com/go-skynet/helm-charts
- VSCode extension https://github.com/badgooooor/localai-vscode-plugin
- Langchain: https://python.langchain.com/docs/integrations/providers/localai/
- Terminal utility https://github.com/djcopley/ShellOracle
- Local Smart assistant https://github.com/mudler/LocalAGI
- Home Assistant https://github.com/drndos/hass-openai-custom-conversation / https://github.com/valentinfrlch/ha-llmvision / https://github.com/loryanstrant/HA-LocalAI-Monitor
- Discord bot https://github.com/mudler/LocalAGI/tree/main/examples/discord
- Slack bot https://github.com/mudler/LocalAGI/tree/main/examples/slack
- Shell-Pilot(Interact with LLM using LocalAI models via pure shell scripts on your Linux or MacOS system) https://github.com/reid41/shell-pilot
- Telegram bot https://github.com/mudler/LocalAI/tree/master/examples/telegram-bot
- Another Telegram Bot https://github.com/JackBekket/Hellper
- Auto-documentation https://github.com/JackBekket/Reflexia
- Github bot which answer on issues, with code and documentation as context https://github.com/JackBekket/GitHelper
- Github Actions: https://github.com/marketplace/actions/start-localai
- Examples: https://github.com/mudler/LocalAI/tree/master/examples/
### 🔗 Resources
- [LLM finetuning guide](https://localai.io/docs/advanced/fine-tuning/)
- [How to build locally](https://localai.io/basics/build/index.html)
- [How to install in Kubernetes](https://localai.io/basics/getting_started/index.html#run-localai-in-kubernetes)
- [Projects integrating LocalAI](https://localai.io/docs/integrations/)
- [How tos section](https://io.midori-ai.xyz/howtos/) (curated by our community)
## :book: 🎥 [Media, Blogs, Social](https://localai.io/basics/news/#media-blogs-social)
- 🆕 [LocalAI Autonomous Dev Team Blog Post](https://mudler.pm/posts/2026/02/28/a-call-to-open-source-maintainers-stop-babysitting-ai-how-i-built-a-100-local-autonomous-dev-team-to-maintain-localai-and-why-you-should-too/)
- [Run Visual studio code with LocalAI (SUSE)](https://www.suse.com/c/running-ai-locally/)
- 🆕 [Run LocalAI on Jetson Nano Devkit](https://mudler.pm/posts/local-ai-jetson-nano-devkit/)
- [Run LocalAI on AWS EKS with Pulumi](https://www.pulumi.com/blog/low-code-llm-apps-with-local-ai-flowise-and-pulumi/)
- [Run LocalAI on AWS](https://staleks.hashnode.dev/installing-localai-on-aws-ec2-instance)
- [Create a slackbot for teams and OSS projects that answer to documentation](https://mudler.pm/posts/smart-slackbot-for-teams/)
- [LocalAI meets k8sgpt](https://www.youtube.com/watch?v=PKrDNuJ_dfE)
- [Question Answering on Documents locally with LangChain, LocalAI, Chroma, and GPT4All](https://mudler.pm/posts/localai-question-answering/)
- [Tutorial to use k8sgpt with LocalAI](https://medium.com/@tyler_97636/k8sgpt-localai-unlock-kubernetes-superpowers-for-free-584790de9b65)
## 🤖 Autonomous Development Team
LocalAI is now helped being maintained (for small tasks!) by a full team of autonomous AI agents led by an AI Scrum Master! This experiment demonstrates how open source projects can leverage AI agents for sustainable, long-term maintenance.
- **📊 Live Reports**: [Automatically generated reports](http://reports.localai.io)
- **📋 Project Board**: [Agent task tracking](https://github.com/users/mudler/projects/6)
- **📝 Blog Post**: [Learn about the autonomous dev team experiment](https://mudler.pm/posts/2026/02/28/a-call-to-open-source-maintainers-stop-babysitting-ai-how-i-built-a-100-local-autonomous-dev-team-to-maintain-localai-and-why-you-should-too/)
- **Live Reports**: [reports.localai.io](http://reports.localai.io)
- **Project Board**: [Agent task tracking](https://github.com/users/mudler/projects/6)
- **Blog Post**: [Learn about the experiment](https://mudler.pm/posts/2026/02/28/a-call-to-open-source-maintainers-stop-babysitting-ai-how-i-built-a-100-local-autonomous-dev-team-to-maintain-localai-and-why-you-should-too/)
## Citation
@@ -419,7 +222,7 @@ If you utilize this repository, data in a downstream project, please consider ci
howpublished = {\url{https://github.com/go-skynet/LocalAI}},
```
## ❤️ Sponsors
## Sponsors
> Do you find LocalAI useful?
@@ -438,19 +241,19 @@ A huge thank you to our generous sponsors who support this project covering CI e
### Individual sponsors
A special thanks to individual sponsors that contributed to the project, a full list is in [Github](https://github.com/sponsors/mudler) and [buymeacoffee](https://buymeacoffee.com/mudler), a special shout out goes to [drikster80](https://github.com/drikster80) for being generous. Thank you everyone!
A special thanks to individual sponsors, a full list is on [GitHub](https://github.com/sponsors/mudler) and [buymeacoffee](https://buymeacoffee.com/mudler). Special shout out to [drikster80](https://github.com/drikster80) for being generous. Thank you everyone!
## 🌟 Star history
## Star history
[![LocalAI Star history Chart](https://api.star-history.com/svg?repos=go-skynet/LocalAI&type=Date)](https://star-history.com/#go-skynet/LocalAI&Date)
## 📖 License
## License
LocalAI is a community-driven project created by [Ettore Di Giacinto](https://github.com/mudler/).
MIT - Author Ettore Di Giacinto <mudler@localai.io>
## 🙇 Acknowledgements
## Acknowledgements
LocalAI couldn't have been built without the help of great software already available from the community. Thank you!
@@ -463,9 +266,9 @@ LocalAI couldn't have been built without the help of great software already avai
- https://github.com/rhasspy/piper
- [exo](https://github.com/exo-explore/exo) for the MLX distributed auto-parallel sharding implementation
## 🤗 Contributors
## Contributors
This is a community project, a special thanks to our contributors! 🤗
This is a community project, a special thanks to our contributors!
<a href="https://github.com/go-skynet/LocalAI/graphs/contributors">
<img src="https://contrib.rocks/image?repo=go-skynet/LocalAI" />
</a>

View File

@@ -0,0 +1,281 @@
ARG BASE_IMAGE=ubuntu:24.04
ARG GRPC_BASE_IMAGE=${BASE_IMAGE}
# The grpc target does one thing, it builds and installs GRPC. This is in it's own layer so that it can be effectively cached by CI.
# You probably don't need to change anything here, and if you do, make sure that CI is adjusted so that the cache continues to work.
FROM ${GRPC_BASE_IMAGE} AS grpc
# This is a bit of a hack, but it's required in order to be able to effectively cache this layer in CI
ARG GRPC_MAKEFLAGS="-j4 -Otarget"
ARG GRPC_VERSION=v1.65.0
ARG CMAKE_FROM_SOURCE=false
# CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain detection/arch table issues
ARG CMAKE_VERSION=3.31.10
ENV MAKEFLAGS=${GRPC_MAKEFLAGS}
WORKDIR /build
RUN apt-get update && \
apt-get install -y --no-install-recommends \
ca-certificates \
build-essential curl libssl-dev \
git wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Install CMake (the version in 22.04 is too old)
RUN <<EOT bash
if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
else
apt-get update && \
apt-get install -y \
cmake && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
# We install GRPC to a different prefix here so that we can copy in only the build artifacts later
# saves several hundred MB on the final docker image size vs copying in the entire GRPC source tree
# and running make install in the target container
RUN git clone --recurse-submodules --jobs 4 -b ${GRPC_VERSION} --depth 1 --shallow-submodules https://github.com/grpc/grpc && \
mkdir -p /build/grpc/cmake/build && \
cd /build/grpc/cmake/build && \
sed -i "216i\ TESTONLY" "../../third_party/abseil-cpp/absl/container/CMakeLists.txt" && \
cmake -DgRPC_INSTALL=ON -DgRPC_BUILD_TESTS=OFF -DCMAKE_INSTALL_PREFIX:PATH=/opt/grpc ../.. && \
make && \
make install && \
rm -rf /build
FROM ${BASE_IMAGE} AS builder
ARG CMAKE_FROM_SOURCE=false
ARG CMAKE_VERSION=3.31.10
# We can target specific CUDA ARCHITECTURES like --build-arg CUDA_DOCKER_ARCH='75;86;89;120'
ARG CUDA_DOCKER_ARCH
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
ARG CMAKE_ARGS
ENV CMAKE_ARGS=${CMAKE_ARGS}
ARG BACKEND=rerankers
ARG BUILD_TYPE
ENV BUILD_TYPE=${BUILD_TYPE}
ARG CUDA_MAJOR_VERSION
ARG CUDA_MINOR_VERSION
ARG SKIP_DRIVERS=false
ENV CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION}
ENV CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION}
ENV DEBIAN_FRONTEND=noninteractive
ARG TARGETARCH
ARG TARGETVARIANT
ARG GO_VERSION=1.25.4
ARG UBUNTU_VERSION=2404
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
ccache git \
ca-certificates \
make \
pkg-config libcurl4-openssl-dev \
curl unzip \
libssl-dev wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Cuda
ENV PATH=/usr/local/cuda/bin:${PATH}
# HipBLAS requirements
ENV PATH=/opt/rocm/bin:${PATH}
# Vulkan requirements
RUN <<EOT bash
if [ "${BUILD_TYPE}" = "vulkan" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
apt-get update && \
apt-get install -y --no-install-recommends \
software-properties-common pciutils wget gpg-agent && \
apt-get install -y libglm-dev cmake libxcb-dri3-0 libxcb-present0 libpciaccess0 \
libpng-dev libxcb-keysyms1-dev libxcb-dri3-dev libx11-dev g++ gcc \
libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
if [ "amd64" = "$TARGETARCH" ]; then
wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz" && \
tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
rm vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
mkdir -p /opt/vulkan-sdk && \
mv 1.4.335.0 /opt/vulkan-sdk/ && \
cd /opt/vulkan-sdk/1.4.335.0 && \
./vulkansdk --no-deps --maxjobs \
vulkan-loader \
vulkan-validationlayers \
vulkan-extensionlayer \
vulkan-tools \
shaderc && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/bin/* /usr/bin/ && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/lib/* /usr/lib/x86_64-linux-gnu/ && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/include/* /usr/include/ && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/share/* /usr/share/ && \
rm -rf /opt/vulkan-sdk
fi
if [ "arm64" = "$TARGETARCH" ]; then
mkdir vulkan && cd vulkan && \
curl -L -o vulkan-sdk.tar.xz https://github.com/mudler/vulkan-sdk-arm/releases/download/1.4.335.0/vulkansdk-ubuntu-24.04-arm-1.4.335.0.tar.xz && \
tar -xvf vulkan-sdk.tar.xz && \
rm vulkan-sdk.tar.xz && \
cd 1.4.335.0 && \
cp -rfv aarch64/bin/* /usr/bin/ && \
cp -rfv aarch64/lib/* /usr/lib/aarch64-linux-gnu/ && \
cp -rfv aarch64/include/* /usr/include/ && \
cp -rfv aarch64/share/* /usr/share/ && \
cd ../.. && \
rm -rf vulkan
fi
ldconfig && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
# CuBLAS requirements
RUN <<EOT bash
if ( [ "${BUILD_TYPE}" = "cublas" ] || [ "${BUILD_TYPE}" = "l4t" ] ) && [ "${SKIP_DRIVERS}" = "false" ]; then
apt-get update && \
apt-get install -y --no-install-recommends \
software-properties-common pciutils
if [ "amd64" = "$TARGETARCH" ]; then
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/x86_64/cuda-keyring_1.1-1_all.deb
fi
if [ "arm64" = "$TARGETARCH" ]; then
if [ "${CUDA_MAJOR_VERSION}" = "13" ]; then
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/sbsa/cuda-keyring_1.1-1_all.deb
else
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/arm64/cuda-keyring_1.1-1_all.deb
fi
fi
dpkg -i cuda-keyring_1.1-1_all.deb && \
rm -f cuda-keyring_1.1-1_all.deb && \
apt-get update && \
apt-get install -y --no-install-recommends \
cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcusparse-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcusolver-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
if [ "${CUDA_MAJOR_VERSION}" = "13" ] && [ "arm64" = "$TARGETARCH" ]; then
apt-get install -y --no-install-recommends \
libcufile-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libcudnn9-cuda-${CUDA_MAJOR_VERSION} cuda-cupti-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libnvjitlink-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
fi
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
# https://github.com/NVIDIA/Isaac-GR00T/issues/343
RUN <<EOT bash
if [ "${BUILD_TYPE}" = "cublas" ] && [ "${TARGETARCH}" = "arm64" ]; then
wget https://developer.download.nvidia.com/compute/cudss/0.6.0/local_installers/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
dpkg -i cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
cp /var/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0/cudss-*-keyring.gpg /usr/share/keyrings/ && \
apt-get update && apt-get -y install cudss cudss-cuda-${CUDA_MAJOR_VERSION} && \
wget https://developer.download.nvidia.com/compute/nvpl/25.5/local_installers/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
dpkg -i nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
cp /var/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5/nvpl-*-keyring.gpg /usr/share/keyrings/ && \
apt-get update && apt-get install -y nvpl
fi
EOT
# If we are building with clblas support, we need the libraries for the builds
RUN if [ "${BUILD_TYPE}" = "clblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
apt-get update && \
apt-get install -y --no-install-recommends \
libclblast-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* \
; fi
RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
apt-get update && \
apt-get install -y --no-install-recommends \
hipblas-dev \
rocblas-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* && \
# I have no idea why, but the ROCM lib packages don't trigger ldconfig after they install, which results in local-ai and others not being able
# to locate the libraries. We run ldconfig ourselves to work around this packaging deficiency
ldconfig \
; fi
RUN echo "TARGETARCH: $TARGETARCH"
# We need protoc installed, and the version in 22.04 is too old. We will create one as part installing the GRPC build below
# but that will also being in a newer version of absl which stablediffusion cannot compile with. This version of protoc is only
# here so that we can generate the grpc code for the stablediffusion build
RUN <<EOT bash
if [ "amd64" = "$TARGETARCH" ]; then
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-x86_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
fi
if [ "arm64" = "$TARGETARCH" ]; then
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-aarch_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
fi
EOT
# Install CMake (the version in 22.04 is too old)
RUN <<EOT bash
if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
else
apt-get update && \
apt-get install -y \
cmake && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
COPY --from=grpc /opt/grpc /usr/local
COPY . /LocalAI
RUN <<'EOT' bash
set -euxo pipefail
if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
rm -rf /LocalAI/backend/cpp/ik-llama-cpp-*-build
fi
cd /LocalAI/backend/cpp/ik-llama-cpp
if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
# ARM64 / ROCm: build without x86 SIMD
make ik-llama-cpp-fallback
else
# ik_llama.cpp's IQK kernels require at least AVX2
make ik-llama-cpp-avx2
fi
EOT
# Copy libraries using a script to handle architecture differences
RUN make -BC /LocalAI/backend/cpp/ik-llama-cpp package
FROM scratch
# Copy all available binaries (the build process only creates the appropriate ones for the target architecture)
COPY --from=builder /LocalAI/backend/cpp/ik-llama-cpp/package/. ./

View File

@@ -209,7 +209,11 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
rm -rf /var/lib/apt/lists/* && \
# I have no idea why, but the ROCM lib packages don't trigger ldconfig after they install, which results in local-ai and others not being able
# to locate the libraries. We run ldconfig ourselves to work around this packaging deficiency
ldconfig \
ldconfig && \
# Log which GPU architectures have rocBLAS kernel support
echo "rocBLAS library data architectures:" && \
(ls /opt/rocm*/lib/rocblas/library/Kernels* 2>/dev/null || ls /opt/rocm*/lib64/rocblas/library/Kernels* 2>/dev/null) | grep -oP 'gfx[0-9a-z+-]+' | sort -u || \
echo "WARNING: No rocBLAS kernel data found" \
; fi
RUN echo "TARGETARCH: $TARGETARCH"

View File

@@ -29,6 +29,7 @@ RUN apt-get update && \
curl python3-pip \
python-is-python3 \
python3-dev llvm \
libnuma1 libgomp1 \
python3-venv make cmake && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
@@ -195,6 +196,12 @@ COPY backend/backend.proto /${BACKEND}/backend.proto
COPY backend/python/common/ /${BACKEND}/common
COPY scripts/build/package-gpu-libs.sh /package-gpu-libs.sh
# Optional per-backend source build toggle (e.g. vllm on CPU can set
# FROM_SOURCE=true to compile against the build host SIMD instead of
# pulling a prebuilt wheel). Default empty — most backends ignore it.
ARG FROM_SOURCE=""
ENV FROM_SOURCE=${FROM_SOURCE}
RUN cd /${BACKEND} && PORTABLE_PYTHON=true make
# Package GPU libraries into the backend's lib directory

39
backend/Dockerfile.rust Normal file
View File

@@ -0,0 +1,39 @@
ARG BASE_IMAGE=ubuntu:24.04
FROM ${BASE_IMAGE} AS builder
ARG BACKEND=kokoros
ENV DEBIAN_FRONTEND=noninteractive
ARG TARGETARCH
ARG TARGETVARIANT
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
git ccache \
ca-certificates \
make cmake wget \
curl unzip \
clang \
pkg-config \
libssl-dev \
espeak-ng libespeak-ng-dev \
libsonic-dev libpcaudio-dev \
libopus-dev \
protobuf-compiler && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Install Rust
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"
COPY . /LocalAI
RUN git config --global --add safe.directory /LocalAI
RUN make -C /LocalAI/backend/rust/${BACKEND} build
FROM scratch
ARG BACKEND=kokoros
COPY --from=builder /LocalAI/backend/rust/${BACKEND}/package/. ./

View File

@@ -39,6 +39,19 @@ service Backend {
rpc AudioDecode(AudioDecodeRequest) returns (AudioDecodeResult) {}
rpc ModelMetadata(ModelOptions) returns (ModelMetadataResponse) {}
// Fine-tuning RPCs
rpc StartFineTune(FineTuneRequest) returns (FineTuneJobResult) {}
rpc FineTuneProgress(FineTuneProgressRequest) returns (stream FineTuneProgressUpdate) {}
rpc StopFineTune(FineTuneStopRequest) returns (Result) {}
rpc ListCheckpoints(ListCheckpointsRequest) returns (ListCheckpointsResponse) {}
rpc ExportModel(ExportModelRequest) returns (Result) {}
// Quantization RPCs
rpc StartQuantization(QuantizationRequest) returns (QuantizationJobResult) {}
rpc QuantizationProgress(QuantizationProgressRequest) returns (stream QuantizationProgressUpdate) {}
rpc StopQuantization(QuantizationStopRequest) returns (Result) {}
}
// Define the empty request
@@ -166,6 +179,7 @@ message PredictOptions {
int32 Logprobs = 50; // Number of top logprobs to return (maps to OpenAI logprobs parameter)
int32 TopLogprobs = 51; // Number of top logprobs to return per token (maps to OpenAI top_logprobs parameter)
map<string, string> Metadata = 52; // Generic per-request metadata (e.g., enable_thinking)
float MinP = 53; // Minimum probability sampling threshold (0.0 = disabled)
}
// ToolCallDelta represents an incremental tool call update from the C++ parser.
@@ -430,6 +444,10 @@ message Message {
message DetectOptions {
string src = 1;
string prompt = 2; // Text prompt (for SAM 3 PCS mode)
repeated float points = 3; // Point coordinates as [x1, y1, label1, x2, y2, label2, ...] (label: 1=pos, 0=neg)
repeated float boxes = 4; // Box coordinates as [x1, y1, x2, y2, ...]
float threshold = 5; // Detection confidence threshold
}
message Detection {
@@ -439,6 +457,7 @@ message Detection {
float height = 4;
float confidence = 5;
string class_name = 6;
bytes mask = 7; // PNG-encoded binary segmentation mask
}
message DetectResponse {
@@ -472,7 +491,7 @@ message ToolFormatMarkers {
string id_field = 16; // e.g., "id"
bool fun_name_is_key = 17;
bool tools_array_wrapped = 18;
bool uses_python_dicts = 19;
reserved 19;
// Reasoning markers
string reasoning_start = 20; // e.g., "<think>"
@@ -528,3 +547,139 @@ message ModelMetadataResponse {
string rendered_template = 2; // The rendered chat template with enable_thinking=true (empty if not applicable)
ToolFormatMarkers tool_format = 3; // Auto-detected tool format markers from differential template analysis
}
// Fine-tuning messages
message FineTuneRequest {
// Model identification
string model = 1; // HF model name or local path
string training_type = 2; // "lora", "loha", "lokr", "full" — what parameters to train
string training_method = 3; // "sft", "dpo", "grpo", "rloo", "reward", "kto", "orpo", "network_training"
// Adapter config (universal across LoRA/LoHa/LoKr for LLM + diffusion)
int32 adapter_rank = 10; // LoRA rank (r), default 16
int32 adapter_alpha = 11; // scaling factor, default 16
float adapter_dropout = 12; // default 0.0
repeated string target_modules = 13; // layer names to adapt
// Universal training hyperparameters
float learning_rate = 20; // default 2e-4
int32 num_epochs = 21; // default 3
int32 batch_size = 22; // default 2
int32 gradient_accumulation_steps = 23; // default 4
int32 warmup_steps = 24; // default 5
int32 max_steps = 25; // 0 = use epochs
int32 save_steps = 26; // 0 = only save final
float weight_decay = 27; // default 0.01
bool gradient_checkpointing = 28;
string optimizer = 29; // adamw_8bit, adamw, sgd, adafactor, prodigy
int32 seed = 30; // default 3407
string mixed_precision = 31; // fp16, bf16, fp8, no
// Dataset
string dataset_source = 40; // HF dataset ID, local file/dir path
string dataset_split = 41; // train, test, etc.
// Output
string output_dir = 50;
string job_id = 51; // client-assigned or auto-generated
// Resume training from a checkpoint
string resume_from_checkpoint = 55; // path to checkpoint dir to resume from
// Backend-specific AND method-specific extensibility
map<string, string> extra_options = 60;
}
message FineTuneJobResult {
string job_id = 1;
bool success = 2;
string message = 3;
}
message FineTuneProgressRequest {
string job_id = 1;
}
message FineTuneProgressUpdate {
string job_id = 1;
int32 current_step = 2;
int32 total_steps = 3;
float current_epoch = 4;
float total_epochs = 5;
float loss = 6;
float learning_rate = 7;
float grad_norm = 8;
float eval_loss = 9;
float eta_seconds = 10;
float progress_percent = 11;
string status = 12; // queued, caching, loading_model, loading_dataset, training, saving, completed, failed, stopped
string message = 13;
string checkpoint_path = 14; // set when a checkpoint is saved
string sample_path = 15; // set when a sample is generated (video/image backends)
map<string, float> extra_metrics = 16; // method-specific metrics
}
message FineTuneStopRequest {
string job_id = 1;
bool save_checkpoint = 2;
}
message ListCheckpointsRequest {
string output_dir = 1;
}
message ListCheckpointsResponse {
repeated CheckpointInfo checkpoints = 1;
}
message CheckpointInfo {
string path = 1;
int32 step = 2;
float epoch = 3;
float loss = 4;
string created_at = 5;
}
message ExportModelRequest {
string checkpoint_path = 1;
string output_path = 2;
string export_format = 3; // lora, loha, lokr, merged_16bit, merged_4bit, gguf, diffusers
string quantization_method = 4; // for GGUF: q4_k_m, q5_k_m, q8_0, f16, etc.
string model = 5; // base model name (for merge operations)
map<string, string> extra_options = 6;
}
// Quantization messages
message QuantizationRequest {
string model = 1; // HF model name or local path
string quantization_type = 2; // q4_k_m, q5_k_m, q8_0, f16, etc.
string output_dir = 3; // where to write output files
string job_id = 4; // client-assigned job ID
map<string, string> extra_options = 5; // hf_token, custom flags, etc.
}
message QuantizationJobResult {
string job_id = 1;
bool success = 2;
string message = 3;
}
message QuantizationProgressRequest {
string job_id = 1;
}
message QuantizationProgressUpdate {
string job_id = 1;
float progress_percent = 2;
string status = 3; // queued, downloading, converting, quantizing, completed, failed, stopped
string message = 4;
string output_file = 5; // set when completed — path to the output GGUF file
map<string, float> extra_metrics = 6; // e.g. file_size_mb, compression_ratio
}
message QuantizationStopRequest {
string job_id = 1;
}

View File

@@ -0,0 +1,78 @@
## Clip/LLaVA library for multimodal support — built locally from copied sources
set(TARGET myclip)
add_library(${TARGET} clip.cpp clip.h llava.cpp llava.h)
install(TARGETS ${TARGET} LIBRARY)
target_include_directories(myclip PUBLIC .)
target_include_directories(myclip PUBLIC ../..)
target_include_directories(myclip PUBLIC ../../common)
target_link_libraries(${TARGET} PRIVATE common ggml llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_11)
if (NOT MSVC)
target_compile_options(${TARGET} PRIVATE -Wno-cast-qual)
endif()
set(TARGET grpc-server)
set(CMAKE_CXX_STANDARD 17)
cmake_minimum_required(VERSION 3.15)
set(TARGET grpc-server)
set(_PROTOBUF_LIBPROTOBUF libprotobuf)
set(_REFLECTION grpc++_reflection)
if (${CMAKE_SYSTEM_NAME} MATCHES "Darwin")
if (CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "arm64")
set(HOMEBREW_DEFAULT_PREFIX "/opt/homebrew")
else()
set(HOMEBREW_DEFAULT_PREFIX "/usr/local")
endif()
link_directories("${HOMEBREW_DEFAULT_PREFIX}/lib")
include_directories("${HOMEBREW_DEFAULT_PREFIX}/include")
endif()
find_package(absl CONFIG REQUIRED)
find_package(Protobuf CONFIG REQUIRED)
find_package(gRPC CONFIG REQUIRED)
find_program(_PROTOBUF_PROTOC protoc)
set(_GRPC_GRPCPP grpc++)
find_program(_GRPC_CPP_PLUGIN_EXECUTABLE grpc_cpp_plugin)
include_directories(${CMAKE_CURRENT_BINARY_DIR})
include_directories(${Protobuf_INCLUDE_DIRS})
message(STATUS "Using protobuf version ${Protobuf_VERSION} | Protobuf_INCLUDE_DIRS: ${Protobuf_INCLUDE_DIRS} | CMAKE_CURRENT_BINARY_DIR: ${CMAKE_CURRENT_BINARY_DIR}")
# Proto file
get_filename_component(hw_proto "../../../../../../backend/backend.proto" ABSOLUTE)
get_filename_component(hw_proto_path "${hw_proto}" PATH)
set(hw_proto_srcs "${CMAKE_CURRENT_BINARY_DIR}/backend.pb.cc")
set(hw_proto_hdrs "${CMAKE_CURRENT_BINARY_DIR}/backend.pb.h")
set(hw_grpc_srcs "${CMAKE_CURRENT_BINARY_DIR}/backend.grpc.pb.cc")
set(hw_grpc_hdrs "${CMAKE_CURRENT_BINARY_DIR}/backend.grpc.pb.h")
add_custom_command(
OUTPUT "${hw_proto_srcs}" "${hw_proto_hdrs}" "${hw_grpc_srcs}" "${hw_grpc_hdrs}"
COMMAND ${_PROTOBUF_PROTOC}
ARGS --grpc_out "${CMAKE_CURRENT_BINARY_DIR}"
--cpp_out "${CMAKE_CURRENT_BINARY_DIR}"
-I "${hw_proto_path}"
--plugin=protoc-gen-grpc="${_GRPC_CPP_PLUGIN_EXECUTABLE}"
"${hw_proto}"
DEPENDS "${hw_proto}")
add_library(hw_grpc_proto
${hw_grpc_srcs}
${hw_grpc_hdrs}
${hw_proto_srcs}
${hw_proto_hdrs} )
add_executable(${TARGET} grpc-server.cpp json.hpp)
target_link_libraries(${TARGET} PRIVATE common llama myclip ${CMAKE_THREAD_LIBS_INIT} absl::flags hw_grpc_proto
absl::flags_parse
gRPC::${_REFLECTION}
gRPC::${_GRPC_GRPCPP}
protobuf::${_PROTOBUF_LIBPROTOBUF})
target_compile_features(${TARGET} PRIVATE cxx_std_11)
if(TARGET BUILD_INFO)
add_dependencies(${TARGET} BUILD_INFO)
endif()

View File

@@ -0,0 +1,167 @@
IK_LLAMA_VERSION?=08ae48c667e3dcd3025821a8585190b4a46c2f7c
LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp
CMAKE_ARGS?=
BUILD_TYPE?=
NATIVE?=false
ONEAPI_VARS?=/opt/intel/oneapi/setvars.sh
TARGET?=--target grpc-server
JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 1)
ARCH?=$(shell uname -m)
# Disable Shared libs as we are linking on static gRPC and we can't mix shared and static
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=OFF
CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
ifeq ($(NATIVE),false)
CMAKE_ARGS+=-DGGML_NATIVE=OFF -DLLAMA_OPENSSL=OFF
endif
# If build type is cublas, then we set -DGGML_CUDA=ON to CMAKE_ARGS automatically
ifeq ($(BUILD_TYPE),cublas)
CMAKE_ARGS+=-DGGML_CUDA=ON
# If build type is openblas then we set -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
# to CMAKE_ARGS automatically
else ifeq ($(BUILD_TYPE),openblas)
CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
# If build type is clblas (openCL) we set -DGGML_CLBLAST=ON -DCLBlast_DIR=/some/path
else ifeq ($(BUILD_TYPE),clblas)
CMAKE_ARGS+=-DGGML_CLBLAST=ON -DCLBlast_DIR=/some/path
# If it's hipblas we do have also to set CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++
else ifeq ($(BUILD_TYPE),hipblas)
ROCM_HOME ?= /opt/rocm
ROCM_PATH ?= /opt/rocm
export CXX=$(ROCM_HOME)/llvm/bin/clang++
export CC=$(ROCM_HOME)/llvm/bin/clang
AMDGPU_TARGETS?=gfx803,gfx900,gfx906,gfx908,gfx90a,gfx942,gfx1010,gfx1030,gfx1032,gfx1100,gfx1101,gfx1102,gfx1200,gfx1201
CMAKE_ARGS+=-DGGML_HIP=ON -DAMDGPU_TARGETS=$(AMDGPU_TARGETS)
else ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DGGML_VULKAN=1
else ifeq ($(OS),Darwin)
ifeq ($(BUILD_TYPE),)
BUILD_TYPE=metal
endif
ifneq ($(BUILD_TYPE),metal)
CMAKE_ARGS+=-DGGML_METAL=OFF
else
CMAKE_ARGS+=-DGGML_METAL=ON
CMAKE_ARGS+=-DGGML_METAL_EMBED_LIBRARY=ON
CMAKE_ARGS+=-DGGML_METAL_USE_BF16=ON
CMAKE_ARGS+=-DGGML_OPENMP=OFF
endif
TARGET+=--target ggml-metal
endif
ifeq ($(BUILD_TYPE),sycl_f16)
CMAKE_ARGS+=-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx \
-DCMAKE_CXX_FLAGS="-fsycl" \
-DGGML_SYCL_F16=ON
endif
ifeq ($(BUILD_TYPE),sycl_f32)
CMAKE_ARGS+=-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx \
-DCMAKE_CXX_FLAGS="-fsycl"
endif
INSTALLED_PACKAGES=$(CURDIR)/../grpc/installed_packages
INSTALLED_LIB_CMAKE=$(INSTALLED_PACKAGES)/lib/cmake
ADDED_CMAKE_ARGS=-Dabsl_DIR=${INSTALLED_LIB_CMAKE}/absl \
-DProtobuf_DIR=${INSTALLED_LIB_CMAKE}/protobuf \
-Dutf8_range_DIR=${INSTALLED_LIB_CMAKE}/utf8_range \
-DgRPC_DIR=${INSTALLED_LIB_CMAKE}/grpc \
-DCMAKE_CXX_STANDARD_INCLUDE_DIRECTORIES=${INSTALLED_PACKAGES}/include
build-ik-llama-cpp-grpc-server:
# Conditionally build grpc for the backend to use if needed
ifdef BUILD_GRPC_FOR_BACKEND_LLAMA
$(MAKE) -C ../../grpc build
_PROTOBUF_PROTOC=${INSTALLED_PACKAGES}/bin/proto \
_GRPC_CPP_PLUGIN_EXECUTABLE=${INSTALLED_PACKAGES}/bin/grpc_cpp_plugin \
PATH="${INSTALLED_PACKAGES}/bin:${PATH}" \
CMAKE_ARGS="${CMAKE_ARGS} ${ADDED_CMAKE_ARGS}" \
IK_LLAMA_VERSION=$(IK_LLAMA_VERSION) \
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../$(VARIANT) grpc-server
else
echo "BUILD_GRPC_FOR_BACKEND_LLAMA is not defined."
IK_LLAMA_VERSION=$(IK_LLAMA_VERSION) $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../$(VARIANT) grpc-server
endif
ik-llama-cpp-avx2: llama.cpp
cp -rf $(CURRENT_MAKEFILE_DIR)/../ik-llama-cpp $(CURRENT_MAKEFILE_DIR)/../ik-llama-cpp-avx2-build
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../ik-llama-cpp-avx2-build purge
$(info ${GREEN}I ik-llama-cpp build info:avx2${RESET})
CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on" $(MAKE) VARIANT="ik-llama-cpp-avx2-build" build-ik-llama-cpp-grpc-server
cp -rfv $(CURRENT_MAKEFILE_DIR)/../ik-llama-cpp-avx2-build/grpc-server ik-llama-cpp-avx2
ik-llama-cpp-avx512: llama.cpp
cp -rf $(CURRENT_MAKEFILE_DIR)/../ik-llama-cpp $(CURRENT_MAKEFILE_DIR)/../ik-llama-cpp-avx512-build
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../ik-llama-cpp-avx512-build purge
$(info ${GREEN}I ik-llama-cpp build info:avx512${RESET})
CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on" $(MAKE) VARIANT="ik-llama-cpp-avx512-build" build-ik-llama-cpp-grpc-server
cp -rfv $(CURRENT_MAKEFILE_DIR)/../ik-llama-cpp-avx512-build/grpc-server ik-llama-cpp-avx512
ik-llama-cpp-avx: llama.cpp
cp -rf $(CURRENT_MAKEFILE_DIR)/../ik-llama-cpp $(CURRENT_MAKEFILE_DIR)/../ik-llama-cpp-avx-build
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../ik-llama-cpp-avx-build purge
$(info ${GREEN}I ik-llama-cpp build info:avx${RESET})
CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) VARIANT="ik-llama-cpp-avx-build" build-ik-llama-cpp-grpc-server
cp -rfv $(CURRENT_MAKEFILE_DIR)/../ik-llama-cpp-avx-build/grpc-server ik-llama-cpp-avx
ik-llama-cpp-fallback: llama.cpp
cp -rf $(CURRENT_MAKEFILE_DIR)/../ik-llama-cpp $(CURRENT_MAKEFILE_DIR)/../ik-llama-cpp-fallback-build
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../ik-llama-cpp-fallback-build purge
$(info ${GREEN}I ik-llama-cpp build info:fallback${RESET})
CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) VARIANT="ik-llama-cpp-fallback-build" build-ik-llama-cpp-grpc-server
cp -rfv $(CURRENT_MAKEFILE_DIR)/../ik-llama-cpp-fallback-build/grpc-server ik-llama-cpp-fallback
ik-llama-cpp-grpc: llama.cpp
cp -rf $(CURRENT_MAKEFILE_DIR)/../ik-llama-cpp $(CURRENT_MAKEFILE_DIR)/../ik-llama-cpp-grpc-build
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../ik-llama-cpp-grpc-build purge
$(info ${GREEN}I ik-llama-cpp build info:grpc${RESET})
CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" TARGET="--target grpc-server --target rpc-server" $(MAKE) VARIANT="ik-llama-cpp-grpc-build" build-ik-llama-cpp-grpc-server
cp -rfv $(CURRENT_MAKEFILE_DIR)/../ik-llama-cpp-grpc-build/grpc-server ik-llama-cpp-grpc
ik-llama-cpp-rpc-server: ik-llama-cpp-grpc
cp -rf $(CURRENT_MAKEFILE_DIR)/../ik-llama-cpp-grpc-build/llama.cpp/build/bin/rpc-server ik-llama-cpp-rpc-server
llama.cpp:
mkdir -p llama.cpp
cd llama.cpp && \
git init && \
git remote add origin $(LLAMA_REPO) && \
git fetch origin && \
git checkout -b build $(IK_LLAMA_VERSION) && \
git submodule update --init --recursive --depth 1 --single-branch
llama.cpp/examples/grpc-server: llama.cpp
mkdir -p llama.cpp/examples/grpc-server
bash prepare.sh
rebuild:
bash prepare.sh
rm -rf grpc-server
$(MAKE) grpc-server
package:
bash package.sh
purge:
rm -rf llama.cpp/build
rm -rf llama.cpp/examples/grpc-server
rm -rf grpc-server
clean: purge
rm -rf llama.cpp
grpc-server: llama.cpp llama.cpp/examples/grpc-server
@echo "Building grpc-server with $(BUILD_TYPE) build type and $(CMAKE_ARGS)"
ifneq (,$(findstring sycl,$(BUILD_TYPE)))
+bash -c "source $(ONEAPI_VARS); \
cd llama.cpp && mkdir -p build && cd build && cmake .. $(CMAKE_ARGS) && cmake --build . --config Release -j $(JOBS) $(TARGET)"
else
+cd llama.cpp && mkdir -p build && cd build && cmake .. $(CMAKE_ARGS) && cmake --build . --config Release -j $(JOBS) $(TARGET)
endif
cp llama.cpp/build/bin/grpc-server .

View File

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,58 @@
#!/bin/bash
# Script to copy the appropriate libraries based on architecture
# This script is used in the final stage of the Dockerfile
set -e
CURDIR=$(dirname "$(realpath $0)")
REPO_ROOT="${CURDIR}/../../.."
# Create lib directory
mkdir -p $CURDIR/package/lib
cp -avrf $CURDIR/ik-llama-cpp-* $CURDIR/package/
cp -rfv $CURDIR/run.sh $CURDIR/package/
# Detect architecture and copy appropriate libraries
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
# x86_64 architecture
echo "Detected x86_64 architecture, copying x86_64 libraries..."
cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
# ARM64 architecture
echo "Detected ARM64 architecture, copying ARM64 libraries..."
cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
else
echo "Error: Could not detect architecture"
exit 1
fi
# Package GPU libraries based on BUILD_TYPE
# The GPU library packaging script will detect BUILD_TYPE and copy appropriate GPU libraries
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
if [ -f "$GPU_LIB_SCRIPT" ]; then
echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
package_gpu_libs
fi
echo "Packaging completed successfully"
ls -liah $CURDIR/package/
ls -liah $CURDIR/package/lib/

View File

@@ -0,0 +1,10 @@
--- a/ggml/src/iqk/iqk_common.h
+++ b/ggml/src/iqk/iqk_common.h
@@ -9,6 +9,7 @@
#pragma once
#include "iqk_config.h"
+#include <cstdint>
#if defined IQK_IMPLEMENT

View File

@@ -0,0 +1,49 @@
#!/bin/bash
## Patches
## Apply patches from the `patches` directory
if [ -d "patches" ]; then
for patch in $(ls patches); do
echo "Applying patch $patch"
patch -d llama.cpp/ -p1 < patches/$patch
done
fi
set -e
cp -r CMakeLists.txt llama.cpp/examples/grpc-server/
cp -r grpc-server.cpp llama.cpp/examples/grpc-server/
cp -r utils.hpp llama.cpp/examples/grpc-server/
cp -rfv llama.cpp/vendor/nlohmann/json.hpp llama.cpp/examples/grpc-server/
## Copy clip/llava files for multimodal support (built as myclip library)
cp -rfv llama.cpp/examples/llava/clip.h llama.cpp/examples/grpc-server/clip.h
cp -rfv llama.cpp/examples/llava/clip.cpp llama.cpp/examples/grpc-server/clip.cpp
cp -rfv llama.cpp/examples/llava/llava.cpp llama.cpp/examples/grpc-server/llava.cpp
# Prepend llama.h include to llava.h
echo '#include "llama.h"' > llama.cpp/examples/grpc-server/llava.h
cat llama.cpp/examples/llava/llava.h >> llama.cpp/examples/grpc-server/llava.h
# Copy clip-impl.h if it exists
if [ -f llama.cpp/examples/llava/clip-impl.h ]; then
cp -rfv llama.cpp/examples/llava/clip-impl.h llama.cpp/examples/grpc-server/clip-impl.h
fi
# Copy stb_image.h
if [ -f llama.cpp/vendor/stb/stb_image.h ]; then
cp -rfv llama.cpp/vendor/stb/stb_image.h llama.cpp/examples/grpc-server/stb_image.h
elif [ -f llama.cpp/common/stb_image.h ]; then
cp -rfv llama.cpp/common/stb_image.h llama.cpp/examples/grpc-server/stb_image.h
fi
## Fix API compatibility in llava.cpp (llama_n_embd -> llama_model_n_embd)
if [ -f llama.cpp/examples/grpc-server/llava.cpp ]; then
sed -i 's/llama_n_embd(/llama_model_n_embd(/g' llama.cpp/examples/grpc-server/llava.cpp
fi
set +e
if grep -q "grpc-server" llama.cpp/examples/CMakeLists.txt; then
echo "grpc-server already added"
else
echo "add_subdirectory(grpc-server)" >> llama.cpp/examples/CMakeLists.txt
fi
set -e

View File

@@ -0,0 +1,40 @@
#!/bin/bash
set -ex
# Get the absolute current dir where the script is located
CURDIR=$(dirname "$(realpath $0)")
cd /
echo "CPU info:"
grep -e "model\sname" /proc/cpuinfo | head -1
grep -e "flags" /proc/cpuinfo | head -1
# ik_llama.cpp requires AVX2 — default to avx2 binary
BINARY=ik-llama-cpp-avx2
if [ -e $CURDIR/ik-llama-cpp-fallback ] && ! grep -q -e "\savx2\s" /proc/cpuinfo ; then
echo "CPU: AVX2 NOT found, using fallback"
BINARY=ik-llama-cpp-fallback
fi
# Extend ld library path with the dir where this script is located/lib
if [ "$(uname)" == "Darwin" ]; then
export DYLD_LIBRARY_PATH=$CURDIR/lib:$DYLD_LIBRARY_PATH
#export DYLD_FALLBACK_LIBRARY_PATH=$CURDIR/lib:$DYLD_FALLBACK_LIBRARY_PATH
else
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
fi
# If there is a lib/ld.so, use it
if [ -f $CURDIR/lib/ld.so ]; then
echo "Using lib/ld.so"
echo "Using binary: $BINARY"
exec $CURDIR/lib/ld.so $CURDIR/$BINARY "$@"
fi
echo "Using binary: $BINARY"
exec $CURDIR/$BINARY "$@"
# We should never reach this point, however just in case we do, run fallback
exec $CURDIR/ik-llama-cpp-fallback "$@"

View File

@@ -0,0 +1,483 @@
// https://github.com/ggerganov/llama.cpp/blob/master/examples/server/utils.hpp
#pragma once
#include <string>
#include <vector>
#include <set>
#include <mutex>
#include <condition_variable>
#include <unordered_map>
#include "json.hpp"
#include "clip.h"
using json = nlohmann::json;
extern bool server_verbose;
#ifndef SERVER_VERBOSE
#define SERVER_VERBOSE 1
#endif
#if SERVER_VERBOSE != 1
#define LOG_VERBOSE(MSG, ...)
#else
#define LOG_VERBOSE(MSG, ...) \
do \
{ \
if (server_verbose) \
{ \
server_log("VERBOSE", __func__, __LINE__, MSG, __VA_ARGS__); \
} \
} while (0)
#endif
#define LOG_ERROR( MSG, ...) server_log("ERROR", __func__, __LINE__, MSG, __VA_ARGS__)
#define LOG_WARNING(MSG, ...) server_log("WARNING", __func__, __LINE__, MSG, __VA_ARGS__)
#define LOG_INFO( MSG, ...) server_log("INFO", __func__, __LINE__, MSG, __VA_ARGS__)
//
// parallel
//
enum server_state {
SERVER_STATE_LOADING_MODEL, // Server is starting up, model not fully loaded yet
SERVER_STATE_READY, // Server is ready and model is loaded
SERVER_STATE_ERROR // An error occurred, load_model failed
};
enum task_type {
TASK_TYPE_COMPLETION,
TASK_TYPE_CANCEL,
TASK_TYPE_NEXT_RESPONSE
};
struct task_server {
int id = -1; // to be filled by llama_server_queue
int target_id;
task_type type;
json data;
bool infill_mode = false;
bool embedding_mode = false;
int multitask_id = -1;
};
struct task_result {
int id;
int multitask_id = -1;
bool stop;
bool error;
json result_json;
};
struct task_multi {
int id;
std::set<int> subtasks_remaining{};
std::vector<task_result> results{};
};
// TODO: can become bool if we can't find use of more states
enum slot_state
{
IDLE,
PROCESSING,
};
enum slot_command
{
NONE,
LOAD_PROMPT,
RELEASE,
};
struct slot_params
{
bool stream = true;
bool cache_prompt = false; // remember the prompt to avoid reprocessing all prompt
uint32_t seed = -1; // RNG seed
int32_t n_keep = 0; // number of tokens to keep from initial prompt
int32_t n_predict = -1; // new tokens to predict
std::vector<std::string> antiprompt;
json input_prefix;
json input_suffix;
};
struct slot_image
{
int32_t id;
bool request_encode_image = false;
float * image_embedding = nullptr;
int32_t image_tokens = 0;
clip_image_u8 * img_data;
std::string prefix_prompt; // before of this image
};
// completion token output with probabilities
struct completion_token_output
{
struct token_prob
{
llama_token tok;
float prob;
};
std::vector<token_prob> probs;
llama_token tok;
std::string text_to_send;
};
static inline void server_log(const char *level, const char *function, int line,
const char *message, const nlohmann::ordered_json &extra)
{
nlohmann::ordered_json log
{
{"timestamp", time(nullptr)},
{"level", level},
{"function", function},
{"line", line},
{"message", message},
};
if (!extra.empty())
{
log.merge_patch(extra);
}
const std::string str = log.dump(-1, ' ', false, json::error_handler_t::replace);
printf("%.*s\n", (int)str.size(), str.data());
fflush(stdout);
}
//
// server utils
//
template <typename T>
static T json_value(const json &body, const std::string &key, const T &default_value)
{
// Fallback null to default value
return body.contains(key) && !body.at(key).is_null()
? body.value(key, default_value)
: default_value;
}
inline std::string format_chatml(std::vector<json> messages)
{
std::ostringstream chatml_msgs;
for (auto it = messages.begin(); it != messages.end(); ++it) {
chatml_msgs << "<|im_start|>"
<< json_value(*it, "role", std::string("user")) << '\n';
chatml_msgs << json_value(*it, "content", std::string(""))
<< "<|im_end|>\n";
}
chatml_msgs << "<|im_start|>assistant" << '\n';
return chatml_msgs.str();
}
//
// work queue utils
//
struct llama_server_queue {
int id = 0;
std::mutex mutex_tasks;
// queues
std::vector<task_server> queue_tasks;
std::vector<task_server> queue_tasks_deferred;
std::vector<task_multi> queue_multitasks;
std::condition_variable condition_tasks;
// callback functions
std::function<void(task_server&)> callback_new_task;
std::function<void(task_multi&)> callback_finish_multitask;
std::function<void(void)> callback_all_task_finished;
// Add a new task to the end of the queue
int post(task_server task) {
std::unique_lock<std::mutex> lock(mutex_tasks);
if (task.id == -1) {
task.id = id++;
}
queue_tasks.push_back(std::move(task));
condition_tasks.notify_one();
return task.id;
}
// Add a new task, but defer until one slot is available
void defer(task_server task) {
std::unique_lock<std::mutex> lock(mutex_tasks);
queue_tasks_deferred.push_back(std::move(task));
}
// Get the next id for creating anew task
int get_new_id() {
std::unique_lock<std::mutex> lock(mutex_tasks);
return id++;
}
// Register function to process a new task
void on_new_task(std::function<void(task_server&)> callback) {
callback_new_task = callback;
}
// Register function to process a multitask
void on_finish_multitask(std::function<void(task_multi&)> callback) {
callback_finish_multitask = callback;
}
// Register the function to be called when the batch of tasks is finished
void on_all_tasks_finished(std::function<void(void)> callback) {
callback_all_task_finished = callback;
}
// Call when the state of one slot is changed
void notify_slot_changed() {
// move deferred tasks back to main loop
std::unique_lock<std::mutex> lock(mutex_tasks);
for (auto & task : queue_tasks_deferred) {
queue_tasks.push_back(std::move(task));
}
queue_tasks_deferred.clear();
}
// Start the main loop. This call is blocking
[[noreturn]]
void start_loop() {
while (true) {
// new task arrived
LOG_VERBOSE("have new task", {});
{
while (true)
{
std::unique_lock<std::mutex> lock(mutex_tasks);
if (queue_tasks.empty()) {
lock.unlock();
break;
}
task_server task = queue_tasks.front();
queue_tasks.erase(queue_tasks.begin());
lock.unlock();
LOG_VERBOSE("callback_new_task", {});
callback_new_task(task);
}
LOG_VERBOSE("callback_all_task_finished", {});
// process and update all the multitasks
auto queue_iterator = queue_multitasks.begin();
while (queue_iterator != queue_multitasks.end())
{
if (queue_iterator->subtasks_remaining.empty())
{
// all subtasks done == multitask is done
task_multi current_multitask = *queue_iterator;
callback_finish_multitask(current_multitask);
// remove this multitask
queue_iterator = queue_multitasks.erase(queue_iterator);
}
else
{
++queue_iterator;
}
}
// all tasks in the current loop is finished
callback_all_task_finished();
}
LOG_VERBOSE("wait for new task", {});
// wait for new task
{
std::unique_lock<std::mutex> lock(mutex_tasks);
if (queue_tasks.empty()) {
condition_tasks.wait(lock, [&]{
return !queue_tasks.empty();
});
}
}
}
}
//
// functions to manage multitasks
//
// add a multitask by specifying the id of all subtask (subtask is a task_server)
void add_multitask(int multitask_id, std::vector<int>& sub_ids)
{
std::lock_guard<std::mutex> lock(mutex_tasks);
task_multi multi;
multi.id = multitask_id;
std::copy(sub_ids.begin(), sub_ids.end(), std::inserter(multi.subtasks_remaining, multi.subtasks_remaining.end()));
queue_multitasks.push_back(multi);
}
// updatethe remaining subtasks, while appending results to multitask
void update_multitask(int multitask_id, int subtask_id, task_result& result)
{
std::lock_guard<std::mutex> lock(mutex_tasks);
for (auto& multitask : queue_multitasks)
{
if (multitask.id == multitask_id)
{
multitask.subtasks_remaining.erase(subtask_id);
multitask.results.push_back(result);
}
}
}
};
struct llama_server_response {
typedef std::function<void(int, int, task_result&)> callback_multitask_t;
callback_multitask_t callback_update_multitask;
// for keeping track of all tasks waiting for the result
std::set<int> waiting_task_ids;
// the main result queue
std::vector<task_result> queue_results;
std::mutex mutex_results;
std::condition_variable condition_results;
void add_waiting_task_id(int task_id) {
std::unique_lock<std::mutex> lock(mutex_results);
waiting_task_ids.insert(task_id);
}
void remove_waiting_task_id(int task_id) {
std::unique_lock<std::mutex> lock(mutex_results);
waiting_task_ids.erase(task_id);
}
// This function blocks the thread until there is a response for this task_id
task_result recv(int task_id) {
while (true)
{
std::unique_lock<std::mutex> lock(mutex_results);
condition_results.wait(lock, [&]{
return !queue_results.empty();
});
LOG_VERBOSE("condition_results unblock", {});
for (int i = 0; i < (int) queue_results.size(); i++)
{
if (queue_results[i].id == task_id)
{
assert(queue_results[i].multitask_id == -1);
task_result res = queue_results[i];
queue_results.erase(queue_results.begin() + i);
return res;
}
}
}
// should never reach here
}
// Register the function to update multitask
void on_multitask_update(callback_multitask_t callback) {
callback_update_multitask = callback;
}
// Send a new result to a waiting task_id
void send(task_result result) {
std::unique_lock<std::mutex> lock(mutex_results);
LOG_VERBOSE("send new result", {});
for (auto& task_id : waiting_task_ids) {
// LOG_TEE("waiting task id %i \n", task_id);
// for now, tasks that have associated parent multitasks just get erased once multitask picks up the result
if (result.multitask_id == task_id)
{
LOG_VERBOSE("callback_update_multitask", {});
callback_update_multitask(task_id, result.id, result);
continue;
}
if (result.id == task_id)
{
LOG_VERBOSE("queue_results.push_back", {});
queue_results.push_back(result);
condition_results.notify_one();
return;
}
}
}
};
//
// base64 utils (TODO: move to common in the future)
//
static const std::string base64_chars =
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"abcdefghijklmnopqrstuvwxyz"
"0123456789+/";
static inline bool is_base64(uint8_t c)
{
return (isalnum(c) || (c == '+') || (c == '/'));
}
static inline std::vector<uint8_t> base64_decode(const std::string & encoded_string)
{
int i = 0;
int j = 0;
int in_ = 0;
int in_len = encoded_string.size();
uint8_t char_array_4[4];
uint8_t char_array_3[3];
std::vector<uint8_t> ret;
while (in_len-- && (encoded_string[in_] != '=') && is_base64(encoded_string[in_]))
{
char_array_4[i++] = encoded_string[in_]; in_++;
if (i == 4)
{
for (i = 0; i <4; i++)
{
char_array_4[i] = base64_chars.find(char_array_4[i]);
}
char_array_3[0] = ((char_array_4[0] ) << 2) + ((char_array_4[1] & 0x30) >> 4);
char_array_3[1] = ((char_array_4[1] & 0xf) << 4) + ((char_array_4[2] & 0x3c) >> 2);
char_array_3[2] = ((char_array_4[2] & 0x3) << 6) + char_array_4[3];
for (i = 0; (i < 3); i++)
{
ret.push_back(char_array_3[i]);
}
i = 0;
}
}
if (i)
{
for (j = i; j <4; j++)
{
char_array_4[j] = 0;
}
for (j = 0; j <4; j++)
{
char_array_4[j] = base64_chars.find(char_array_4[j]);
}
char_array_3[0] = ((char_array_4[0] ) << 2) + ((char_array_4[1] & 0x30) >> 4);
char_array_3[1] = ((char_array_4[1] & 0xf) << 4) + ((char_array_4[2] & 0x3c) >> 2);
char_array_3[2] = ((char_array_4[2] & 0x3) << 6) + char_array_4[3];
for (j = 0; (j < i - 1); j++)
{
ret.push_back(char_array_3[j]);
}
}
return ret;
}

View File

@@ -1,5 +1,5 @@
LLAMA_VERSION?=e30f1fdf74ea9238ff562901aa974c75aab6619b
LLAMA_VERSION?=ff5ef8278615a2462b79b50abdf3cc95cfb31c6f
LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
CMAKE_ARGS?=
@@ -33,7 +33,7 @@ else ifeq ($(BUILD_TYPE),hipblas)
ROCM_PATH ?= /opt/rocm
export CXX=$(ROCM_HOME)/llvm/bin/clang++
export CC=$(ROCM_HOME)/llvm/bin/clang
AMDGPU_TARGETS?=gfx803,gfx900,gfx906,gfx908,gfx90a,gfx942,gfx1010,gfx1030,gfx1032,gfx1100,gfx1101,gfx1102,gfx1200,gfx1201
AMDGPU_TARGETS?=gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1200,gfx1201
CMAKE_ARGS+=-DGGML_HIP=ON -DAMDGPU_TARGETS=$(AMDGPU_TARGETS)
else ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DGGML_VULKAN=1

View File

@@ -22,8 +22,10 @@
#include <grpcpp/ext/proto_server_reflection_plugin.h>
#include <grpcpp/grpcpp.h>
#include <grpcpp/health_check_service_interface.h>
#include <grpcpp/security/server_credentials.h>
#include <regex>
#include <atomic>
#include <cstdlib>
#include <mutex>
#include <signal.h>
#include <thread>
@@ -37,6 +39,43 @@ using grpc::Server;
using grpc::ServerBuilder;
using grpc::ServerContext;
using grpc::Status;
// gRPC bearer token auth for distributed mode.
// Reads LOCALAI_GRPC_AUTH_TOKEN from the environment. When set, rejects
// requests without a matching "authorization: Bearer <token>" metadata header.
// Cached auth token — empty means auth is disabled.
static std::string g_grpc_auth_token;
// Minimal constant-time comparison (avoids OpenSSL dependency)
static int ct_memcmp(const void* a, const void* b, size_t n) {
const unsigned char* pa = static_cast<const unsigned char*>(a);
const unsigned char* pb = static_cast<const unsigned char*>(b);
unsigned char result = 0;
for (size_t i = 0; i < n; i++) {
result |= pa[i] ^ pb[i];
}
return result;
}
// Returns OK when auth is disabled or the token matches.
static grpc::Status checkAuth(grpc::ServerContext* context) {
if (g_grpc_auth_token.empty()) {
return grpc::Status::OK;
}
auto metadata = context->client_metadata();
auto it = metadata.find("authorization");
if (it != metadata.end()) {
std::string expected = "Bearer " + g_grpc_auth_token;
std::string got(it->second.data(), it->second.size());
if (expected.size() == got.size() &&
ct_memcmp(expected.data(), got.data(), expected.size()) == 0) {
return grpc::Status::OK;
}
}
return grpc::Status(grpc::StatusCode::UNAUTHENTICATED, "invalid token");
}
// END LocalAI
@@ -136,6 +175,7 @@ json parse_options(bool streaming, const backend::PredictOptions* predict, const
data["mirostat_eta"] = predict->mirostateta();
data["n_keep"] = predict->nkeep();
data["seed"] = predict->seed();
data["min_p"] = predict->minp();
std::string grammar_str = predict->grammar();
@@ -244,6 +284,12 @@ json parse_options(bool streaming, const backend::PredictOptions* predict, const
data["ignore_eos"] = predict->ignoreeos();
data["embeddings"] = predict->embeddings();
// Speculative decoding per-request overrides
// NDraft maps to speculative.n_max (maximum draft tokens per speculation step)
if (predict->ndraft() > 0) {
data["speculative.n_max"] = predict->ndraft();
}
// Add the correlationid to json data
data["correlation_id"] = predict->correlationid();
@@ -362,6 +408,16 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
if (!request->mmproj().empty()) {
params.mmproj.path = request->mmproj();
}
// Draft model for speculative decoding
if (!request->draftmodel().empty()) {
params.speculative.mparams_dft.path = request->draftmodel();
// Default to draft type if a draft model is set but no explicit type
if (params.speculative.type == COMMON_SPECULATIVE_TYPE_NONE) {
params.speculative.type = COMMON_SPECULATIVE_TYPE_DRAFT;
}
}
// params.model_alias ??
params.model_alias.insert(request->modelfile());
if (!request->cachetypekey().empty()) {
@@ -569,6 +625,48 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
// If conversion fails, keep default value (8)
}
}
// Speculative decoding options
} else if (!strcmp(optname, "spec_type") || !strcmp(optname, "speculative_type")) {
auto type = common_speculative_type_from_name(optval_str);
if (type != COMMON_SPECULATIVE_TYPE_COUNT) {
params.speculative.type = type;
}
} else if (!strcmp(optname, "spec_n_max") || !strcmp(optname, "draft_max")) {
if (optval != NULL) {
try { params.speculative.n_max = std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_n_min") || !strcmp(optname, "draft_min")) {
if (optval != NULL) {
try { params.speculative.n_min = std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_p_min") || !strcmp(optname, "draft_p_min")) {
if (optval != NULL) {
try { params.speculative.p_min = std::stof(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_p_split")) {
if (optval != NULL) {
try { params.speculative.p_split = std::stof(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_size_n") || !strcmp(optname, "ngram_size_n")) {
if (optval != NULL) {
try { params.speculative.ngram_size_n = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_size_m") || !strcmp(optname, "ngram_size_m")) {
if (optval != NULL) {
try { params.speculative.ngram_size_m = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_min_hits") || !strcmp(optname, "ngram_min_hits")) {
if (optval != NULL) {
try { params.speculative.ngram_min_hits = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "draft_gpu_layers")) {
if (optval != NULL) {
try { params.speculative.n_gpu_layers = std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "draft_ctx_size")) {
if (optval != NULL) {
try { params.speculative.n_ctx = std::stoi(optval_str); } catch (...) {}
}
}
}
@@ -713,13 +811,17 @@ private:
public:
BackendServiceImpl(server_context& ctx) : ctx_server(ctx) {}
grpc::Status Health(ServerContext* /*context*/, const backend::HealthMessage* /*request*/, backend::Reply* reply) override {
grpc::Status Health(ServerContext* context, const backend::HealthMessage* /*request*/, backend::Reply* reply) override {
auto auth = checkAuth(context);
if (!auth.ok()) return auth;
// Implement Health RPC
reply->set_message("OK");
return Status::OK;
}
grpc::Status LoadModel(ServerContext* /*context*/, const backend::ModelOptions* request, backend::Result* result) override {
grpc::Status LoadModel(ServerContext* context, const backend::ModelOptions* request, backend::Result* result) override {
auto auth = checkAuth(context);
if (!auth.ok()) return auth;
// Implement LoadModel RPC
common_params params;
params_parse(ctx_server, request, params);
@@ -918,6 +1020,8 @@ public:
}
grpc::Status PredictStream(grpc::ServerContext* context, const backend::PredictOptions* request, grpc::ServerWriter<backend::Reply>* writer) override {
auto auth = checkAuth(context);
if (!auth.ok()) return auth;
if (params_base.model.path.empty()) {
return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "Model not loaded");
}
@@ -1205,6 +1309,7 @@ public:
body_json["messages"] = messages_json;
body_json["stream"] = true; // PredictStream is always streaming
body_json["stream_options"] = {{"include_usage", true}}; // Ensure token counts in final chunk
// Check if grammar is provided from Go layer (NoGrammar=false)
// If grammar is provided, we must use it and NOT let template generate grammar from tools
@@ -1509,11 +1614,15 @@ public:
ctx_server.impl->vocab,
params_base,
ctx_server.get_meta().slot_n_ctx,
ctx_server.get_meta().logit_bias_eog,
data);
task.id_slot = json_value(data, "id_slot", -1);
// OAI-compat
task.params.res_type = TASK_RESPONSE_TYPE_NONE;
// OAI-compat: enable autoparser (PEG-based chat parsing) so that
// reasoning, tool calls, and content are classified into ChatDeltas.
// Without this, the PEG parser never produces diffs and the Go side
// cannot detect tool calls or separate reasoning from content.
task.params.res_type = TASK_RESPONSE_TYPE_OAI_CHAT;
task.params.oaicompat_cmpl_id = completion_id;
// oaicompat_model is already populated by params_from_json_cmpl
@@ -1538,19 +1647,47 @@ public:
return grpc::Status(grpc::StatusCode::INTERNAL, error_json.value("message", "Error occurred"));
}
// Lambda to build a Reply from JSON + attach chat deltas from a result
// Lambda to build a Reply from JSON + attach chat deltas from a result.
// Handles both native format ({"content": "..."}) and OAI chat format
// ({"choices": [{"delta": {"content": "...", "reasoning": "..."}}]}).
auto build_reply_from_json = [](const json & res_json, server_task_result * raw_result) -> backend::Reply {
backend::Reply reply;
std::string completion_text = res_json.value("content", "");
reply.set_message(completion_text);
reply.set_tokens(res_json.value("tokens_predicted", 0));
reply.set_prompt_tokens(res_json.value("tokens_evaluated", 0));
std::string completion_text;
if (res_json.contains("choices")) {
// OAI chat format — extract content from choices[0].delta
const auto & choices = res_json.at("choices");
if (!choices.empty()) {
const auto & delta = choices[0].value("delta", json::object());
if (delta.contains("content") && !delta.at("content").is_null()) {
completion_text = delta.at("content").get<std::string>();
}
}
} else {
// Native llama.cpp format
completion_text = res_json.value("content", "");
}
reply.set_message(completion_text);
// Token counts: native format has top-level fields,
// OAI format has them in "usage" (final chunk only)
if (res_json.contains("usage")) {
const auto & usage = res_json.at("usage");
reply.set_tokens(usage.value("completion_tokens", 0));
reply.set_prompt_tokens(usage.value("prompt_tokens", 0));
} else {
reply.set_tokens(res_json.value("tokens_predicted", 0));
reply.set_prompt_tokens(res_json.value("tokens_evaluated", 0));
}
// Timings: present as top-level "timings" in both formats
if (res_json.contains("timings")) {
reply.set_timing_prompt_processing(res_json.at("timings").value("prompt_ms", 0.0));
reply.set_timing_token_generation(res_json.at("timings").value("predicted_ms", 0.0));
}
// Logprobs: extract_logprobs_from_json handles both formats
json logprobs_json = extract_logprobs_from_json(res_json);
if (!logprobs_json.empty() && !logprobs_json.is_null()) {
reply.set_logprobs(logprobs_json.dump());
@@ -1559,6 +1696,12 @@ public:
return reply;
};
// Attach chat deltas from the autoparser to a Reply.
// When diffs are available, populate ChatDeltas on the reply.
// The raw message is always preserved so the Go side can use it
// for reasoning extraction and tool call parsing as a fallback
// (important in distributed mode where ChatDeltas may not be
// the primary parsing path).
auto attach_chat_deltas = [](backend::Reply & reply, server_task_result * raw_result) {
// Try streaming partial result first
auto* partial = dynamic_cast<server_task_result_cmpl_partial*>(raw_result);
@@ -1573,12 +1716,23 @@ public:
}
};
// Process first result
// Process first result.
// When TASK_RESPONSE_TYPE_OAI_CHAT is used, the first token may
// produce a JSON array with a role-init element followed by the
// actual content element. We must only attach chat deltas to the
// content element — attaching to both would duplicate the first
// token since oaicompat_msg_diffs is the same for both.
json first_res_json = first_result->to_json();
if (first_res_json.is_array()) {
for (const auto & res : first_res_json) {
auto reply = build_reply_from_json(res, first_result.get());
attach_chat_deltas(reply, first_result.get());
// Skip chat deltas for role-init elements (have "role" in
// delta but no content/reasoning diffs of their own).
bool is_role_init = res.contains("choices") && !res["choices"].empty() &&
res["choices"][0].value("delta", json::object()).contains("role");
if (!is_role_init) {
attach_chat_deltas(reply, first_result.get());
}
writer->Write(reply);
}
} else {
@@ -1602,7 +1756,11 @@ public:
if (res_json.is_array()) {
for (const auto & res : res_json) {
auto reply = build_reply_from_json(res, result.get());
attach_chat_deltas(reply, result.get());
bool is_role_init = res.contains("choices") && !res["choices"].empty() &&
res["choices"][0].value("delta", json::object()).contains("role");
if (!is_role_init) {
attach_chat_deltas(reply, result.get());
}
writer->Write(reply);
}
} else {
@@ -1621,6 +1779,8 @@ public:
}
grpc::Status Predict(ServerContext* context, const backend::PredictOptions* request, backend::Reply* reply) override {
auto auth = checkAuth(context);
if (!auth.ok()) return auth;
if (params_base.model.path.empty()) {
return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "Model not loaded");
}
@@ -2238,11 +2398,13 @@ public:
ctx_server.impl->vocab,
params_base,
ctx_server.get_meta().slot_n_ctx,
ctx_server.get_meta().logit_bias_eog,
data);
task.id_slot = json_value(data, "id_slot", -1);
// OAI-compat
task.params.res_type = TASK_RESPONSE_TYPE_NONE;
// OAI-compat: enable autoparser (PEG-based chat parsing) so that
// reasoning, tool calls, and content are classified into ChatDeltas.
task.params.res_type = TASK_RESPONSE_TYPE_OAI_CHAT;
task.params.oaicompat_cmpl_id = completion_id;
// oaicompat_model is already populated by params_from_json_cmpl
@@ -2273,25 +2435,48 @@ public:
auto* final_res = dynamic_cast<server_task_result_cmpl_final*>(all_results.results[0].get());
GGML_ASSERT(final_res != nullptr);
json result_json = all_results.results[0]->to_json();
reply->set_message(result_json.value("content", ""));
int32_t tokens_predicted = result_json.value("tokens_predicted", 0);
// Handle both native format ({"content": "...", "tokens_predicted": N})
// and OAI chat format ({"choices": [{"message": {"content": "..."}}],
// "usage": {"completion_tokens": N, "prompt_tokens": N}}).
std::string completion_text;
int32_t tokens_predicted = 0;
int32_t tokens_evaluated = 0;
if (result_json.contains("choices")) {
// OAI chat format
const auto & choices = result_json.at("choices");
if (!choices.empty()) {
const auto & msg = choices[0].value("message", json::object());
if (msg.contains("content") && !msg.at("content").is_null()) {
completion_text = msg.at("content").get<std::string>();
}
}
if (result_json.contains("usage")) {
const auto & usage = result_json.at("usage");
tokens_predicted = usage.value("completion_tokens", 0);
tokens_evaluated = usage.value("prompt_tokens", 0);
}
} else {
// Native llama.cpp format
completion_text = result_json.value("content", "");
tokens_predicted = result_json.value("tokens_predicted", 0);
tokens_evaluated = result_json.value("tokens_evaluated", 0);
}
reply->set_message(completion_text);
reply->set_tokens(tokens_predicted);
int32_t tokens_evaluated = result_json.value("tokens_evaluated", 0);
reply->set_prompt_tokens(tokens_evaluated);
// Timings: present in both formats as a top-level "timings" object
if (result_json.contains("timings")) {
double timing_prompt_processing = result_json.at("timings").value("prompt_ms", 0.0);
reply->set_timing_prompt_processing(timing_prompt_processing);
double timing_token_generation = result_json.at("timings").value("predicted_ms", 0.0);
reply->set_timing_token_generation(timing_token_generation);
reply->set_timing_prompt_processing(result_json.at("timings").value("prompt_ms", 0.0));
reply->set_timing_token_generation(result_json.at("timings").value("predicted_ms", 0.0));
}
// Extract and set logprobs if present
// Logprobs: extract_logprobs_from_json handles both formats
json logprobs_json = extract_logprobs_from_json(result_json);
if (!logprobs_json.empty() && !logprobs_json.is_null()) {
std::string logprobs_str = logprobs_json.dump();
reply->set_logprobs(logprobs_str);
reply->set_logprobs(logprobs_json.dump());
}
// Populate chat deltas from the autoparser's final parsed message
@@ -2307,7 +2492,20 @@ public:
for (auto & res : all_results.results) {
GGML_ASSERT(dynamic_cast<server_task_result_cmpl_final*>(res.get()) != nullptr);
json res_json = res->to_json();
arr.push_back(res_json.value("content", ""));
// Handle both native and OAI chat formats
std::string result_content;
if (res_json.contains("choices")) {
const auto & choices = res_json.at("choices");
if (!choices.empty()) {
const auto & msg = choices[0].value("message", json::object());
if (msg.contains("content") && !msg.at("content").is_null()) {
result_content = msg.at("content").get<std::string>();
}
}
} else {
result_content = res_json.value("content", "");
}
arr.push_back(result_content);
// Extract logprobs for each result
json logprobs_json = extract_logprobs_from_json(res_json);
@@ -2339,6 +2537,8 @@ public:
}
grpc::Status Embedding(ServerContext* context, const backend::PredictOptions* request, backend::EmbeddingResult* embeddingResult) override {
auto auth = checkAuth(context);
if (!auth.ok()) return auth;
if (params_base.model.path.empty()) {
return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "Model not loaded");
}
@@ -2519,7 +2719,9 @@ public:
return grpc::Status::OK;
}
grpc::Status TokenizeString(ServerContext* /*context*/, const backend::PredictOptions* request, backend::TokenizationResponse* response) override {
grpc::Status TokenizeString(ServerContext* context, const backend::PredictOptions* request, backend::TokenizationResponse* response) override {
auto auth = checkAuth(context);
if (!auth.ok()) return auth;
if (params_base.model.path.empty()) {
return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "Model not loaded");
}
@@ -2687,7 +2889,6 @@ public:
tf->set_id_field(ap.tools.format.id_field);
tf->set_fun_name_is_key(ap.tools.format.fun_name_is_key);
tf->set_tools_array_wrapped(ap.tools.format.tools_array_wrapped);
tf->set_uses_python_dicts(ap.tools.format.uses_python_dicts);
tf->set_function_field(ap.tools.format.function_field);
tf->set_gen_id_field(ap.tools.format.gen_id_field);
@@ -2761,10 +2962,18 @@ int main(int argc, char** argv) {
ServerBuilder builder;
builder.AddListeningPort(server_address, grpc::InsecureServerCredentials());
// Initialize bearer token auth if LOCALAI_GRPC_AUTH_TOKEN is set
const char* auth_token = std::getenv("LOCALAI_GRPC_AUTH_TOKEN");
if (auth_token != nullptr && auth_token[0] != '\0') {
g_grpc_auth_token = auth_token;
std::cout << "gRPC auth enabled via LOCALAI_GRPC_AUTH_TOKEN" << std::endl;
}
builder.RegisterService(&service);
builder.SetMaxMessageSize(50 * 1024 * 1024); // 50MB
builder.SetMaxSendMessageSize(50 * 1024 * 1024); // 50MB
builder.SetMaxReceiveMessageSize(50 * 1024 * 1024); // 50MB
std::unique_ptr<Server> server(builder.BuildAndStart());
// run the HTTP server in a thread - see comment below
std::thread t([&]()

View File

@@ -24,6 +24,9 @@ if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
# ARM64 architecture
echo "Detected ARM64 architecture, copying ARM64 libraries..."
@@ -33,6 +36,9 @@ elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
else
echo "Error: Could not detect architecture"
exit 1

View File

@@ -46,6 +46,10 @@ if [ "$(uname)" == "Darwin" ]; then
#export DYLD_FALLBACK_LIBRARY_PATH=$CURDIR/lib:$DYLD_FALLBACK_LIBRARY_PATH
else
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
# Tell rocBLAS where to find TensileLibrary data (GPU kernel tuning files)
if [ -d "$CURDIR/lib/rocblas/library" ]; then
export ROCBLAS_TENSILE_LIBPATH=$CURDIR/lib/rocblas/library
fi
fi
# If there is a lib/ld.so, use it

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# acestep.cpp version
ACESTEP_REPO?=https://github.com/ace-step/acestep.cpp
ACESTEP_CPP_VERSION?=5aa065445541094cba934299cd498bbb9fa5c434
ACESTEP_CPP_VERSION?=e0c8d75a672fca5684c88c68dbf6d12f58754258
SO_TARGET?=libgoacestepcpp.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF

View File

@@ -106,12 +106,13 @@ func TestLoadModel(t *testing.T) {
defer conn.Close()
client := pb.NewBackendClient(conn)
// Get base directory from main model file for relative paths
mainModelPath := filepath.Join(modelDir, "acestep-5Hz-lm-0.6B-Q8_0.gguf")
resp, err := client.LoadModel(context.Background(), &pb.ModelOptions{
ModelFile: mainModelPath,
ModelPath: modelDir,
Options: []string{
"text_encoder_model:Qwen3-Embedding-0.6B-Q8_0.gguf",
"dit_model:acestep-v15-turbo-Q8_0.gguf",
@@ -133,7 +134,7 @@ func TestSoundGeneration(t *testing.T) {
if err != nil {
t.Fatal(err)
}
defer os.RemoveAll(tmpDir)
t.Cleanup(func() { os.RemoveAll(tmpDir) })
outputFile := filepath.Join(tmpDir, "output.wav")
@@ -151,6 +152,7 @@ func TestSoundGeneration(t *testing.T) {
// Load models
loadResp, err := client.LoadModel(context.Background(), &pb.ModelOptions{
ModelFile: mainModelPath,
ModelPath: modelDir,
Options: []string{
"text_encoder_model:Qwen3-Embedding-0.6B-Q8_0.gguf",
"dit_model:acestep-v15-turbo-Q8_0.gguf",

View File

@@ -11,7 +11,7 @@ import (
)
var (
CppLoadModel func(lmModelPath, textEncoderPath, ditModelPath, vaeModelPath string) int
CppLoadModel func(lmModelPath, textEncoderPath, ditModelPath, vaeModelPath string) int
CppGenerateMusic func(caption, lyrics string, bpm int, keyscale, timesignature string, duration, temperature float32, instrumental bool, seed int, dst string, threads int) int
)
@@ -24,23 +24,23 @@ func (a *AceStepCpp) Load(opts *pb.ModelOptions) error {
lmModel := opts.ModelFile
// Get the base directory from ModelFile for resolving relative paths
baseDir := filepath.Dir(lmModel)
baseDir := opts.ModelPath
var textEncoderModel, ditModel, vaeModel string
for _, oo := range opts.Options {
parts := strings.SplitN(oo, ":", 2)
if len(parts) != 2 {
key, value, found := strings.Cut(oo, ":")
if !found {
fmt.Fprintf(os.Stderr, "Unrecognized option: %v\n", oo)
continue
}
switch parts[0] {
switch key {
case "text_encoder_model":
textEncoderModel = parts[1]
textEncoderModel = value
case "dit_model":
ditModel = parts[1]
ditModel = value
case "vae_model":
vaeModel = parts[1]
vaeModel = value
default:
fmt.Fprintf(os.Stderr, "Unrecognized option: %v\n", oo)
}

View File

@@ -18,7 +18,6 @@ type LLM struct {
draftModel *llama.LLama
}
// Free releases GPU resources and frees the llama model
// This should be called when the model is being unloaded to properly release VRAM
func (llm *LLM) Free() error {

View File

@@ -1,5 +1,4 @@
//go:build debug
// +build debug
package main

View File

@@ -1,5 +1,4 @@
//go:build !debug
// +build !debug
package main

View File

@@ -332,7 +332,7 @@ func normalizedCosineSimilarity(k1, k2 []float32) float32 {
assert(len(k1) == len(k2), fmt.Sprintf("normalizedCosineSimilarity: len(k1) = %d, len(k2) = %d", len(k1), len(k2)))
var dot float32
for i := 0; i < len(k1); i++ {
for i := range len(k1) {
dot += k1[i] * k2[i]
}
@@ -419,7 +419,7 @@ func cosineSimilarity(k1, k2 []float32, mag1 float64) float32 {
assert(len(k1) == len(k2), fmt.Sprintf("cosineSimilarity: len(k1) = %d, len(k2) = %d", len(k1), len(k2)))
var dot, mag2 float64
for i := 0; i < len(k1); i++ {
for i := range len(k1) {
dot += float64(k1[i] * k2[i])
mag2 += float64(k2[i] * k2[i])
}

View File

@@ -701,7 +701,7 @@ var _ = Describe("Opus", func() {
// to one-shot (only difference is resampler batch boundaries).
var maxDiff float64
var sumDiffSq float64
for i := 0; i < minLen; i++ {
for i := range minLen {
diff := math.Abs(float64(oneShotTail[i]) - float64(batchedTail[i]))
if diff > maxDiff {
maxDiff = diff
@@ -774,7 +774,7 @@ var _ = Describe("Opus", func() {
minLen := min(len(refTail), min(len(persistentTail), len(freshTail)))
var persistentMaxDiff, freshMaxDiff float64
for i := 0; i < minLen; i++ {
for i := range minLen {
pd := math.Abs(float64(refTail[i]) - float64(persistentTail[i]))
fd := math.Abs(float64(refTail[i]) - float64(freshTail[i]))
if pd > persistentMaxDiff {
@@ -932,7 +932,7 @@ var _ = Describe("Opus", func() {
GinkgoWriter.Printf("Zero-crossing intervals: mean=%.2f stddev=%.2f CV=%.3f (expected period ~%.1f)\n",
mean, stddev, stddev/mean, 16000.0/440.0/2.0)
Expect(stddev / mean).To(BeNumerically("<", 0.15),
Expect(stddev/mean).To(BeNumerically("<", 0.15),
fmt.Sprintf("irregular zero crossings suggest discontinuity: CV=%.3f", stddev/mean))
// Also check frequency is correct
@@ -978,7 +978,7 @@ var _ = Describe("Opus", func() {
// Every sample must be identical — the resampler is deterministic
var maxDiff float64
for i := 0; i < len(oneShot); i++ {
for i := range len(oneShot) {
diff := math.Abs(float64(oneShot[i]) - float64(batched[i]))
if diff > maxDiff {
maxDiff = diff
@@ -1037,13 +1037,13 @@ var _ = Describe("Opus", func() {
binary.LittleEndian.PutUint32(hdr[4:8], uint32(36+dataLen))
copy(hdr[8:12], "WAVE")
copy(hdr[12:16], "fmt ")
binary.LittleEndian.PutUint32(hdr[16:20], 16) // chunk size
binary.LittleEndian.PutUint16(hdr[20:22], 1) // PCM
binary.LittleEndian.PutUint16(hdr[22:24], 1) // mono
binary.LittleEndian.PutUint32(hdr[24:28], uint32(sampleRate)) // sample rate
binary.LittleEndian.PutUint32(hdr[28:32], uint32(sampleRate*2)) // byte rate
binary.LittleEndian.PutUint16(hdr[32:34], 2) // block align
binary.LittleEndian.PutUint16(hdr[34:36], 16) // bits per sample
binary.LittleEndian.PutUint32(hdr[16:20], 16) // chunk size
binary.LittleEndian.PutUint16(hdr[20:22], 1) // PCM
binary.LittleEndian.PutUint16(hdr[22:24], 1) // mono
binary.LittleEndian.PutUint32(hdr[24:28], uint32(sampleRate)) // sample rate
binary.LittleEndian.PutUint32(hdr[28:32], uint32(sampleRate*2)) // byte rate
binary.LittleEndian.PutUint16(hdr[32:34], 2) // block align
binary.LittleEndian.PutUint16(hdr[34:36], 16) // bits per sample
copy(hdr[36:40], "data")
binary.LittleEndian.PutUint32(hdr[40:44], uint32(dataLen))
@@ -1126,7 +1126,7 @@ var _ = Describe("Opus", func() {
)
pcm := make([]byte, toneNumSamples*2)
for i := 0; i < toneNumSamples; i++ {
for i := range toneNumSamples {
sample := int16(toneAmplitude * math.Sin(2*math.Pi*toneFreq*float64(i)/float64(toneSampleRate)))
binary.LittleEndian.PutUint16(pcm[i*2:], uint16(sample))
}

View File

@@ -0,0 +1,56 @@
cmake_minimum_required(VERSION 3.14)
project(goqwen3ttscpp LANGUAGES C CXX)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
set(QWEN3TTS_DIR ${CMAKE_CURRENT_SOURCE_DIR}/sources/qwen3-tts.cpp)
# Override upstream's CMAKE_CUDA_ARCHITECTURES before add_subdirectory.
if(NOT DEFINED CMAKE_CUDA_ARCHITECTURES)
set(CMAKE_CUDA_ARCHITECTURES "75-virtual;80-virtual;86-real;89-real")
endif()
# Build ggml from the upstream's submodule FIRST, so that ggml/ggml-base/ggml-cpu
# CMake targets exist when the upstream project references them by name.
# The upstream CMakeLists.txt uses target_link_libraries(... ggml ggml-base ggml-cpu)
# with target_link_directories pointing at a pre-built ggml/build/. By adding ggml
# as a subdirectory here, CMake resolves those names as targets instead.
add_subdirectory(${QWEN3TTS_DIR}/ggml ggml EXCLUDE_FROM_ALL)
# Now add the upstream project
add_subdirectory(${QWEN3TTS_DIR} qwen3tts EXCLUDE_FROM_ALL)
add_library(goqwen3ttscpp MODULE cpp/goqwen3ttscpp.cpp)
target_link_libraries(goqwen3ttscpp PRIVATE qwen3_tts)
target_include_directories(goqwen3ttscpp PRIVATE ${QWEN3TTS_DIR}/src)
target_include_directories(goqwen3ttscpp SYSTEM PRIVATE ${QWEN3TTS_DIR}/ggml/include)
# Link GPU backends if available
foreach(backend blas cuda metal vulkan)
if(TARGET ggml-${backend})
target_link_libraries(goqwen3ttscpp PRIVATE ggml-${backend})
string(TOUPPER ${backend} BACKEND_UPPER)
target_compile_definitions(goqwen3ttscpp PRIVATE QWEN3TTS_HAVE_${BACKEND_UPPER})
if(backend STREQUAL "cuda")
find_package(CUDAToolkit QUIET)
if(CUDAToolkit_FOUND)
target_link_libraries(goqwen3ttscpp PRIVATE CUDA::cudart)
endif()
endif()
endif()
endforeach()
if(MSVC)
target_compile_options(goqwen3ttscpp PRIVATE /W4 /wd4100 /wd4505)
else()
target_compile_options(goqwen3ttscpp PRIVATE -Wall -Wextra -Wshadow -Wconversion
-Wno-unused-parameter -Wno-unused-function -Wno-sign-conversion)
endif()
if(CMAKE_CXX_COMPILER_ID MATCHES "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.0)
target_link_libraries(goqwen3ttscpp PRIVATE stdc++fs)
endif()
set_property(TARGET goqwen3ttscpp PROPERTY CXX_STANDARD 17)
set_target_properties(goqwen3ttscpp PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})

View File

@@ -0,0 +1,126 @@
CMAKE_ARGS?=
BUILD_TYPE?=
NATIVE?=false
GOCMD?=go
GO_TAGS?=
JOBS?=$(shell nproc --ignore=1)
# qwen3-tts.cpp version
QWEN3TTS_REPO?=https://github.com/predict-woo/qwen3-tts.cpp
QWEN3TTS_CPP_VERSION?=7a762e2ad4bacc6fdda81d81bf10a09ffb546f29
SO_TARGET?=libgoqwen3ttscpp.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
ifeq ($(NATIVE),false)
CMAKE_ARGS+=-DGGML_NATIVE=OFF
endif
ifeq ($(BUILD_TYPE),cublas)
CMAKE_ARGS+=-DGGML_CUDA=ON
else ifeq ($(BUILD_TYPE),openblas)
CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
else ifeq ($(BUILD_TYPE),clblas)
CMAKE_ARGS+=-DGGML_CLBLAST=ON -DCLBlast_DIR=/some/path
else ifeq ($(BUILD_TYPE),hipblas)
CMAKE_ARGS+=-DGGML_HIPBLAS=ON
else ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DGGML_VULKAN=ON
else ifeq ($(OS),Darwin)
ifneq ($(BUILD_TYPE),metal)
CMAKE_ARGS+=-DGGML_METAL=OFF
else
CMAKE_ARGS+=-DGGML_METAL=ON
CMAKE_ARGS+=-DGGML_METAL_EMBED_LIBRARY=ON
endif
endif
ifeq ($(BUILD_TYPE),sycl_f16)
CMAKE_ARGS+=-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx \
-DGGML_SYCL_F16=ON
endif
ifeq ($(BUILD_TYPE),sycl_f32)
CMAKE_ARGS+=-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx
endif
sources/qwen3-tts.cpp:
mkdir -p sources/qwen3-tts.cpp
cd sources/qwen3-tts.cpp && \
git init && \
git remote add origin $(QWEN3TTS_REPO) && \
git fetch origin && \
git checkout $(QWEN3TTS_CPP_VERSION) && \
git submodule update --init --recursive --depth 1 --single-branch
# Detect OS
UNAME_S := $(shell uname -s)
# Only build CPU variants on Linux
ifeq ($(UNAME_S),Linux)
VARIANT_TARGETS = libgoqwen3ttscpp-avx.so libgoqwen3ttscpp-avx2.so libgoqwen3ttscpp-avx512.so libgoqwen3ttscpp-fallback.so
else
# On non-Linux (e.g., Darwin), build only fallback variant
VARIANT_TARGETS = libgoqwen3ttscpp-fallback.so
endif
qwen3-tts-cpp: main.go goqwen3ttscpp.go $(VARIANT_TARGETS)
CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o qwen3-tts-cpp ./
package: qwen3-tts-cpp
bash package.sh
build: package
clean: purge
rm -rf libgoqwen3ttscpp*.so package sources/qwen3-tts.cpp qwen3-tts-cpp
purge:
rm -rf build*
# Variants must build sequentially
.NOTPARALLEL:
# Build all variants (Linux only)
ifeq ($(UNAME_S),Linux)
libgoqwen3ttscpp-avx.so: sources/qwen3-tts.cpp
$(info ${GREEN}I qwen3-tts-cpp build info:avx${RESET})
SO_TARGET=libgoqwen3ttscpp-avx.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgoqwen3ttscpp-custom
rm -rf build-libgoqwen3ttscpp-avx.so
libgoqwen3ttscpp-avx2.so: sources/qwen3-tts.cpp
$(info ${GREEN}I qwen3-tts-cpp build info:avx2${RESET})
SO_TARGET=libgoqwen3ttscpp-avx2.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgoqwen3ttscpp-custom
rm -rf build-libgoqwen3ttscpp-avx2.so
libgoqwen3ttscpp-avx512.so: sources/qwen3-tts.cpp
$(info ${GREEN}I qwen3-tts-cpp build info:avx512${RESET})
SO_TARGET=libgoqwen3ttscpp-avx512.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgoqwen3ttscpp-custom
rm -rf build-libgoqwen3ttscpp-avx512.so
endif
# Build fallback variant (all platforms)
libgoqwen3ttscpp-fallback.so: sources/qwen3-tts.cpp
$(info ${GREEN}I qwen3-tts-cpp build info:fallback${RESET})
SO_TARGET=libgoqwen3ttscpp-fallback.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgoqwen3ttscpp-custom
rm -rf build-libgoqwen3ttscpp-fallback.so
libgoqwen3ttscpp-custom: CMakeLists.txt cpp/goqwen3ttscpp.cpp cpp/goqwen3ttscpp.h
mkdir -p build-$(SO_TARGET) && \
cd build-$(SO_TARGET) && \
cmake .. $(CMAKE_ARGS) && \
cmake --build . --config Release -j$(JOBS) --target goqwen3ttscpp && \
cd .. && \
mv build-$(SO_TARGET)/libgoqwen3ttscpp.so ./$(SO_TARGET)
test: qwen3-tts-cpp
@echo "Running qwen3-tts-cpp tests..."
bash test.sh
@echo "qwen3-tts-cpp tests completed."
all: qwen3-tts-cpp package

View File

@@ -0,0 +1,161 @@
#include "goqwen3ttscpp.h"
#include "ggml-backend.h"
#include "qwen3_tts.h"
#include <cmath>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <string>
using namespace qwen3_tts;
// Global engine (loaded once, reused across requests)
static Qwen3TTS *g_engine = nullptr;
static bool g_loaded = false;
static int g_threads = 4;
static void ggml_log_cb(enum ggml_log_level level, const char *log, void *data) {
const char *level_str;
if (!log)
return;
switch (level) {
case GGML_LOG_LEVEL_DEBUG:
level_str = "DEBUG";
break;
case GGML_LOG_LEVEL_INFO:
level_str = "INFO";
break;
case GGML_LOG_LEVEL_WARN:
level_str = "WARN";
break;
case GGML_LOG_LEVEL_ERROR:
level_str = "ERROR";
break;
default:
level_str = "?????";
break;
}
fprintf(stderr, "[%-5s] ", level_str);
fputs(log, stderr);
fflush(stderr);
}
// Map language string to language_id token used by the model
static int language_to_id(const char *lang) {
if (!lang || lang[0] == '\0')
return 2050; // default: English
std::string l(lang);
if (l == "en")
return 2050;
if (l == "ru")
return 2069;
if (l == "zh")
return 2055;
if (l == "ja")
return 2058;
if (l == "ko")
return 2064;
if (l == "de")
return 2053;
if (l == "fr")
return 2061;
if (l == "es")
return 2054;
if (l == "it")
return 2056;
if (l == "pt")
return 2057;
fprintf(stderr, "[qwen3-tts-cpp] Unknown language '%s', defaulting to English\n",
lang);
return 2050;
}
int load_model(const char *model_dir, int n_threads) {
ggml_log_set(ggml_log_cb, nullptr);
ggml_backend_load_all();
if (n_threads <= 0)
n_threads = 4;
g_threads = n_threads;
fprintf(stderr, "[qwen3-tts-cpp] Loading models from %s (threads=%d)\n",
model_dir, n_threads);
g_engine = new Qwen3TTS();
if (!g_engine->load_models(model_dir)) {
fprintf(stderr, "[qwen3-tts-cpp] FATAL: failed to load models from %s\n",
model_dir);
delete g_engine;
g_engine = nullptr;
return 1;
}
g_loaded = true;
fprintf(stderr, "[qwen3-tts-cpp] Models loaded successfully\n");
return 0;
}
int synthesize(const char *text, const char *ref_audio_path, const char *dst,
const char *language, float temperature, float top_p,
int top_k, float repetition_penalty, int max_audio_tokens,
int n_threads) {
if (!g_loaded || !g_engine) {
fprintf(stderr, "[qwen3-tts-cpp] ERROR: models not loaded\n");
return 1;
}
if (!text || !dst) {
fprintf(stderr, "[qwen3-tts-cpp] ERROR: text and dst are required\n");
return 2;
}
tts_params params;
params.max_audio_tokens = max_audio_tokens > 0 ? max_audio_tokens : 4096;
params.temperature = temperature;
params.top_p = top_p;
params.top_k = top_k;
params.repetition_penalty = repetition_penalty;
params.n_threads = n_threads > 0 ? n_threads : g_threads;
params.language_id = language_to_id(language);
fprintf(stderr, "[qwen3-tts-cpp] Synthesizing: text='%.50s%s', lang_id=%d, "
"temp=%.2f, threads=%d\n",
text, (strlen(text) > 50 ? "..." : ""), params.language_id,
temperature, params.n_threads);
tts_result result;
bool has_ref = ref_audio_path && ref_audio_path[0] != '\0';
if (has_ref) {
fprintf(stderr, "[qwen3-tts-cpp] Voice cloning with ref: %s\n",
ref_audio_path);
result = g_engine->synthesize_with_voice(text, ref_audio_path, params);
} else {
result = g_engine->synthesize(text, params);
}
if (!result.success) {
fprintf(stderr, "[qwen3-tts-cpp] ERROR: synthesis failed: %s\n",
result.error_msg.c_str());
return 3;
}
int n_samples = (int)result.audio.size();
if (n_samples == 0) {
fprintf(stderr, "[qwen3-tts-cpp] ERROR: synthesis produced no samples\n");
return 4;
}
fprintf(stderr,
"[qwen3-tts-cpp] Synthesis done: %d samples (%.2fs @ 24kHz)\n",
n_samples, (float)n_samples / 24000.0f);
if (!save_audio_file(dst, result.audio, result.sample_rate)) {
fprintf(stderr, "[qwen3-tts-cpp] ERROR: failed to write %s\n", dst);
return 5;
}
fprintf(stderr, "[qwen3-tts-cpp] Wrote %s\n", dst);
return 0;
}

View File

@@ -0,0 +1,12 @@
#pragma once
#include <cstddef>
#include <cstdint>
extern "C" {
int load_model(const char *model_dir, int n_threads);
int synthesize(const char *text, const char *ref_audio_path, const char *dst,
const char *language, float temperature, float top_p,
int top_k, float repetition_penalty, int max_audio_tokens,
int n_threads);
}

View File

@@ -0,0 +1,74 @@
package main
import (
"fmt"
"os"
"path/filepath"
"github.com/mudler/LocalAI/pkg/grpc/base"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
)
var (
CppLoadModel func(modelDir string, nThreads int) int
CppSynthesize func(text, refAudioPath, dst, language string,
temperature, topP float32, topK int,
repetitionPenalty float32, maxAudioTokens, nThreads int) int
)
type Qwen3TtsCpp struct {
base.SingleThread
threads int
}
func (q *Qwen3TtsCpp) Load(opts *pb.ModelOptions) error {
// ModelFile is the model directory path (containing GGUF files)
modelDir := opts.ModelFile
if modelDir == "" {
modelDir = opts.ModelPath
}
// Resolve relative paths
if !filepath.IsAbs(modelDir) && opts.ModelPath != "" {
modelDir = filepath.Join(opts.ModelPath, modelDir)
}
threads := int(opts.Threads)
if threads <= 0 {
threads = 4
}
q.threads = threads
fmt.Fprintf(os.Stderr, "[qwen3-tts-cpp] Loading models from: %s (threads=%d)\n", modelDir, threads)
if ret := CppLoadModel(modelDir, threads); ret != 0 {
return fmt.Errorf("failed to load qwen3-tts model (error code: %d)", ret)
}
return nil
}
func (q *Qwen3TtsCpp) TTS(req *pb.TTSRequest) error {
text := req.Text
voice := req.Voice // reference audio path for voice cloning (empty = no cloning)
dst := req.Dst
language := ""
if req.Language != nil {
language = *req.Language
}
// Synthesis parameters with sensible defaults
temperature := float32(0.9)
topP := float32(0.8)
topK := 50
repetitionPenalty := float32(1.05)
maxAudioTokens := 4096
if ret := CppSynthesize(text, voice, dst, language,
temperature, topP, topK, repetitionPenalty,
maxAudioTokens, q.threads); ret != 0 {
return fmt.Errorf("failed to synthesize audio (error code: %d)", ret)
}
return nil
}

View File

@@ -0,0 +1,47 @@
package main
// Note: this is started internally by LocalAI and a server is allocated for each model
import (
"flag"
"os"
"github.com/ebitengine/purego"
grpc "github.com/mudler/LocalAI/pkg/grpc"
)
var (
addr = flag.String("addr", "localhost:50051", "the address to connect to")
)
type LibFuncs struct {
FuncPtr any
Name string
}
func main() {
// Get library name from environment variable, default to fallback
libName := os.Getenv("QWEN3TTS_LIBRARY")
if libName == "" {
libName = "./libgoqwen3ttscpp-fallback.so"
}
gosd, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
if err != nil {
panic(err)
}
libFuncs := []LibFuncs{
{&CppLoadModel, "load_model"},
{&CppSynthesize, "synthesize"},
}
for _, lf := range libFuncs {
purego.RegisterLibFunc(lf.FuncPtr, gosd, lf.Name)
}
flag.Parse()
if err := grpc.StartServer(*addr, &Qwen3TtsCpp{}); err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,64 @@
#!/bin/bash
# Script to copy the appropriate libraries based on architecture
# This script is used in the final stage of the Dockerfile
set -e
CURDIR=$(dirname "$(realpath $0)")
REPO_ROOT="${CURDIR}/../../.."
# Create lib directory
mkdir -p $CURDIR/package/lib
cp -avf $CURDIR/qwen3-tts-cpp $CURDIR/package/
cp -fv $CURDIR/libgoqwen3ttscpp-*.so $CURDIR/package/
cp -fv $CURDIR/run.sh $CURDIR/package/
# Detect architecture and copy appropriate libraries
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
# x86_64 architecture
echo "Detected x86_64 architecture, copying x86_64 libraries..."
cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
# ARM64 architecture
echo "Detected ARM64 architecture, copying ARM64 libraries..."
cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ $(uname -s) = "Darwin" ]; then
echo "Detected Darwin"
else
echo "Error: Could not detect architecture"
exit 1
fi
# Package GPU libraries based on BUILD_TYPE
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
if [ -f "$GPU_LIB_SCRIPT" ]; then
echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
package_gpu_libs
fi
echo "Packaging completed successfully"
ls -liah $CURDIR/package/
ls -liah $CURDIR/package/lib/

View File

@@ -0,0 +1,173 @@
package main
import (
"context"
"os"
"os/exec"
"path/filepath"
"testing"
"time"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
)
const (
testAddr = "localhost:50051"
startupWait = 5 * time.Second
)
func skipIfNoModel(t *testing.T) string {
t.Helper()
modelDir := os.Getenv("QWEN3TTS_MODEL_DIR")
if modelDir == "" {
t.Skip("QWEN3TTS_MODEL_DIR not set, skipping test (set to directory with GGUF models)")
}
if _, err := os.Stat(filepath.Join(modelDir, "qwen3-tts-0.6b-f16.gguf")); os.IsNotExist(err) {
t.Skipf("TTS model file not found in %s, skipping", modelDir)
}
if _, err := os.Stat(filepath.Join(modelDir, "qwen3-tts-tokenizer-f16.gguf")); os.IsNotExist(err) {
t.Skipf("Tokenizer model file not found in %s, skipping", modelDir)
}
return modelDir
}
func startServer(t *testing.T) *exec.Cmd {
t.Helper()
binary := os.Getenv("QWEN3TTS_BINARY")
if binary == "" {
binary = "./qwen3-tts-cpp"
}
if _, err := os.Stat(binary); os.IsNotExist(err) {
t.Skipf("Backend binary not found at %s, skipping", binary)
}
cmd := exec.Command(binary, "--addr", testAddr)
cmd.Stdout = os.Stderr
cmd.Stderr = os.Stderr
if err := cmd.Start(); err != nil {
t.Fatalf("Failed to start server: %v", err)
}
time.Sleep(startupWait)
return cmd
}
func stopServer(cmd *exec.Cmd) {
if cmd != nil && cmd.Process != nil {
cmd.Process.Kill()
cmd.Wait()
}
}
func dialGRPC(t *testing.T) *grpc.ClientConn {
t.Helper()
conn, err := grpc.Dial(testAddr,
grpc.WithTransportCredentials(insecure.NewCredentials()),
grpc.WithDefaultCallOptions(
grpc.MaxCallRecvMsgSize(50*1024*1024),
grpc.MaxCallSendMsgSize(50*1024*1024),
),
)
if err != nil {
t.Fatalf("Failed to dial gRPC: %v", err)
}
return conn
}
func TestServerHealth(t *testing.T) {
cmd := startServer(t)
defer stopServer(cmd)
conn := dialGRPC(t)
defer conn.Close()
client := pb.NewBackendClient(conn)
resp, err := client.Health(context.Background(), &pb.HealthMessage{})
if err != nil {
t.Fatalf("Health check failed: %v", err)
}
if string(resp.Message) != "OK" {
t.Fatalf("Expected OK, got %s", string(resp.Message))
}
}
func TestLoadModel(t *testing.T) {
modelDir := skipIfNoModel(t)
cmd := startServer(t)
defer stopServer(cmd)
conn := dialGRPC(t)
defer conn.Close()
client := pb.NewBackendClient(conn)
resp, err := client.LoadModel(context.Background(), &pb.ModelOptions{
ModelFile: modelDir,
Threads: 4,
})
if err != nil {
t.Fatalf("LoadModel failed: %v", err)
}
if !resp.Success {
t.Fatalf("LoadModel returned failure: %s", resp.Message)
}
}
func TestTTS(t *testing.T) {
modelDir := skipIfNoModel(t)
tmpDir, err := os.MkdirTemp("", "qwen3tts-test")
if err != nil {
t.Fatal(err)
}
t.Cleanup(func() { os.RemoveAll(tmpDir) })
outputFile := filepath.Join(tmpDir, "output.wav")
cmd := startServer(t)
defer stopServer(cmd)
conn := dialGRPC(t)
defer conn.Close()
client := pb.NewBackendClient(conn)
// Load models
loadResp, err := client.LoadModel(context.Background(), &pb.ModelOptions{
ModelFile: modelDir,
Threads: 4,
})
if err != nil {
t.Fatalf("LoadModel failed: %v", err)
}
if !loadResp.Success {
t.Fatalf("LoadModel returned failure: %s", loadResp.Message)
}
// Synthesize speech
language := "en"
_, err = client.TTS(context.Background(), &pb.TTSRequest{
Text: "Hello, this is a test of the Qwen3 text to speech system.",
Dst: outputFile,
Language: &language,
})
if err != nil {
t.Fatalf("TTS failed: %v", err)
}
// Verify output file exists and has content
info, err := os.Stat(outputFile)
if os.IsNotExist(err) {
t.Fatal("Output audio file was not created")
}
if err != nil {
t.Fatalf("Failed to stat output file: %v", err)
}
t.Logf("Output file size: %d bytes", info.Size())
// WAV header is 44 bytes minimum; any real audio should be much larger
if info.Size() < 1000 {
t.Errorf("Output file too small (%d bytes), expected real audio data", info.Size())
}
}

52
backend/go/qwen3-tts-cpp/run.sh Executable file
View File

@@ -0,0 +1,52 @@
#!/bin/bash
set -ex
# Get the absolute current dir where the script is located
CURDIR=$(dirname "$(realpath $0)")
cd /
echo "CPU info:"
if [ "$(uname)" != "Darwin" ]; then
grep -e "model\sname" /proc/cpuinfo | head -1
grep -e "flags" /proc/cpuinfo | head -1
fi
LIBRARY="$CURDIR/libgoqwen3ttscpp-fallback.so"
if [ "$(uname)" != "Darwin" ]; then
if grep -q -e "\savx\s" /proc/cpuinfo ; then
echo "CPU: AVX found OK"
if [ -e $CURDIR/libgoqwen3ttscpp-avx.so ]; then
LIBRARY="$CURDIR/libgoqwen3ttscpp-avx.so"
fi
fi
if grep -q -e "\savx2\s" /proc/cpuinfo ; then
echo "CPU: AVX2 found OK"
if [ -e $CURDIR/libgoqwen3ttscpp-avx2.so ]; then
LIBRARY="$CURDIR/libgoqwen3ttscpp-avx2.so"
fi
fi
# Check avx 512
if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
echo "CPU: AVX512F found OK"
if [ -e $CURDIR/libgoqwen3ttscpp-avx512.so ]; then
LIBRARY="$CURDIR/libgoqwen3ttscpp-avx512.so"
fi
fi
fi
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
export QWEN3TTS_LIBRARY=$LIBRARY
# If there is a lib/ld.so, use it
if [ -f $CURDIR/lib/ld.so ]; then
echo "Using lib/ld.so"
echo "Using library: $LIBRARY"
exec $CURDIR/lib/ld.so $CURDIR/qwen3-tts-cpp "$@"
fi
echo "Using library: $LIBRARY"
exec $CURDIR/qwen3-tts-cpp "$@"

View File

@@ -0,0 +1,52 @@
#!/bin/bash
set -e
CURDIR=$(dirname "$(realpath $0)")
echo "Running qwen3-tts-cpp backend tests..."
# The test requires:
# - QWEN3TTS_MODEL_DIR: path to directory containing GGUF model files
# - QWEN3TTS_BINARY: path to the qwen3-tts-cpp binary (defaults to ./qwen3-tts-cpp)
#
# Tests that require the model will be skipped if QWEN3TTS_MODEL_DIR is not set
# or the directory does not contain the required model files.
cd "$CURDIR"
# Only auto-download models when QWEN3TTS_MODEL_DIR is not explicitly set
if [ -z "$QWEN3TTS_MODEL_DIR" ]; then
export QWEN3TTS_MODEL_DIR="./qwen3-tts-models"
if [ ! -d "$QWEN3TTS_MODEL_DIR" ]; then
echo "Creating qwen3-tts-models directory for tests..."
mkdir -p "$QWEN3TTS_MODEL_DIR"
REPO_ID="endo5501/qwen3-tts.cpp"
echo "Repository: ${REPO_ID}"
echo ""
# Files to download (smallest model for testing)
FILES=(
"qwen3-tts-0.6b-f16.gguf"
"qwen3-tts-tokenizer-f16.gguf"
)
BASE_URL="https://huggingface.co/${REPO_ID}/resolve/main"
for file in "${FILES[@]}"; do
dest="${QWEN3TTS_MODEL_DIR}/${file}"
if [ -f "${dest}" ]; then
echo " [skip] ${file} (already exists)"
else
echo " [download] ${file}..."
curl -L -o "${dest}" "${BASE_URL}/${file}" --progress-bar
echo " [done] ${file}"
fi
done
fi
fi
# Run Go tests
go test -v -timeout 600s .
echo "All qwen3-tts-cpp tests passed."

7
backend/go/sam3-cpp/.gitignore vendored Normal file
View File

@@ -0,0 +1,7 @@
sources/
build*/
package/
libgosam3*.so
sam3-cpp
test-models/
test-data/

View File

@@ -0,0 +1,26 @@
cmake_minimum_required(VERSION 3.14)
project(gosam3 LANGUAGES C CXX)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
# Build ggml as static libraries to avoid runtime .so dependencies
set(BUILD_SHARED_LIBS OFF CACHE BOOL "Build static libraries" FORCE)
set(SAM3_BUILD_EXAMPLES OFF CACHE BOOL "Disable sam3.cpp examples" FORCE)
set(SAM3_BUILD_TESTS OFF CACHE BOOL "Disable sam3.cpp tests" FORCE)
add_subdirectory(./sources/sam3.cpp)
add_library(gosam3 MODULE gosam3.cpp)
target_link_libraries(gosam3 PRIVATE sam3 ggml)
if(CMAKE_CXX_COMPILER_ID MATCHES "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.0)
target_link_libraries(gosam3 PRIVATE stdc++fs)
endif()
target_include_directories(gosam3 PUBLIC
sources/sam3.cpp
sources/sam3.cpp/ggml/include
)
set_property(TARGET gosam3 PROPERTY CXX_STANDARD 14)
set_target_properties(gosam3 PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})

View File

@@ -0,0 +1,122 @@
CMAKE_ARGS?=
BUILD_TYPE?=
NATIVE?=false
GOCMD?=go
GO_TAGS?=
JOBS?=$(shell nproc --ignore=1)
# sam3.cpp
SAM3_REPO?=https://github.com/PABannier/sam3.cpp
SAM3_VERSION?=01832ef85fcc8eb6488f1d01cd247f07e96ff5a9
ifeq ($(NATIVE),false)
CMAKE_ARGS+=-DGGML_NATIVE=OFF
endif
# If build type is cublas, then we set -DGGML_CUDA=ON to CMAKE_ARGS automatically
ifeq ($(BUILD_TYPE),cublas)
CMAKE_ARGS+=-DGGML_CUDA=ON
else ifeq ($(BUILD_TYPE),openblas)
CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
else ifeq ($(BUILD_TYPE),clblas)
CMAKE_ARGS+=-DGGML_CLBLAST=ON
else ifeq ($(BUILD_TYPE),hipblas)
ROCM_HOME ?= /opt/rocm
ROCM_PATH ?= /opt/rocm
export CXX=$(ROCM_HOME)/llvm/bin/clang++
export CC=$(ROCM_HOME)/llvm/bin/clang
AMDGPU_TARGETS?=gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1200,gfx1201
CMAKE_ARGS+=-DGGML_HIPBLAS=ON -DAMDGPU_TARGETS=$(AMDGPU_TARGETS)
else ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DGGML_VULKAN=ON
else ifeq ($(OS),Darwin)
ifneq ($(BUILD_TYPE),metal)
CMAKE_ARGS+=-DGGML_METAL=OFF
else
CMAKE_ARGS+=-DGGML_METAL=ON
CMAKE_ARGS+=-DGGML_METAL_EMBED_LIBRARY=ON
endif
endif
ifeq ($(BUILD_TYPE),sycl_f16)
CMAKE_ARGS+=-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx \
-DGGML_SYCL_F16=ON
endif
ifeq ($(BUILD_TYPE),sycl_f32)
CMAKE_ARGS+=-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx
endif
sources/sam3.cpp:
git clone --recursive $(SAM3_REPO) sources/sam3.cpp && \
cd sources/sam3.cpp && \
git checkout $(SAM3_VERSION) && \
git submodule update --init --recursive --depth 1 --single-branch
# Detect OS
UNAME_S := $(shell uname -s)
# Only build CPU variants on Linux
ifeq ($(UNAME_S),Linux)
VARIANT_TARGETS = libgosam3-avx.so libgosam3-avx2.so libgosam3-avx512.so libgosam3-fallback.so
else
# On non-Linux (e.g., Darwin), build only fallback variant
VARIANT_TARGETS = libgosam3-fallback.so
endif
sam3-cpp: main.go gosam3.go $(VARIANT_TARGETS)
CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o sam3-cpp ./
package: sam3-cpp
bash package.sh
build: package
clean: purge
rm -rf libgosam3*.so sam3-cpp package sources
purge:
rm -rf build*
# Build all variants (Linux only)
ifeq ($(UNAME_S),Linux)
libgosam3-avx.so: sources/sam3.cpp
$(MAKE) purge
$(info ${GREEN}I sam3-cpp build info:avx${RESET})
SO_TARGET=libgosam3-avx.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgosam3-custom
rm -rfv build*
libgosam3-avx2.so: sources/sam3.cpp
$(MAKE) purge
$(info ${GREEN}I sam3-cpp build info:avx2${RESET})
SO_TARGET=libgosam3-avx2.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgosam3-custom
rm -rfv build*
libgosam3-avx512.so: sources/sam3.cpp
$(MAKE) purge
$(info ${GREEN}I sam3-cpp build info:avx512${RESET})
SO_TARGET=libgosam3-avx512.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgosam3-custom
rm -rfv build*
endif
# Build fallback variant (all platforms)
libgosam3-fallback.so: sources/sam3.cpp
$(MAKE) purge
$(info ${GREEN}I sam3-cpp build info:fallback${RESET})
SO_TARGET=libgosam3-fallback.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgosam3-custom
rm -rfv build*
libgosam3-custom: CMakeLists.txt gosam3.cpp gosam3.h
mkdir -p build-$(SO_TARGET) && \
cd build-$(SO_TARGET) && \
cmake .. $(CMAKE_ARGS) && \
cmake --build . --config Release -j$(JOBS) && \
cd .. && \
mv build-$(SO_TARGET)/libgosam3.so ./$(SO_TARGET)
all: sam3-cpp package

View File

@@ -0,0 +1,193 @@
#include "sam3.h"
#include "gosam3.h"
#include <cstdio>
#include <cstring>
#include <memory>
#include <vector>
#define STB_IMAGE_WRITE_IMPLEMENTATION
#define STB_IMAGE_WRITE_STATIC
#include "stb_image_write.h"
// Static state
static std::shared_ptr<sam3_model> g_model;
static sam3_state_ptr g_state;
static sam3_result g_result;
static std::vector<std::vector<unsigned char>> g_mask_pngs;
// Callback for stbi_write_png_to_mem via stbi_write_png_to_func
static void png_write_callback(void *context, void *data, int size) {
auto *buf = static_cast<std::vector<unsigned char>*>(context);
auto *bytes = static_cast<unsigned char*>(data);
buf->insert(buf->end(), bytes, bytes + size);
}
// Encode all masks as PNGs after segmentation
static void encode_masks_as_png() {
g_mask_pngs.clear();
g_mask_pngs.resize(g_result.detections.size());
for (size_t i = 0; i < g_result.detections.size(); i++) {
const auto &mask = g_result.detections[i].mask;
if (mask.width > 0 && mask.height > 0 && !mask.data.empty()) {
stbi_write_png_to_func(png_write_callback, &g_mask_pngs[i],
mask.width, mask.height, 1,
mask.data.data(), mask.width);
}
}
}
extern "C" {
int sam3_cpp_load_model(const char *model_path, int threads) {
sam3_params params;
params.model_path = model_path;
params.n_threads = threads;
params.use_gpu = true;
g_model = sam3_load_model(params);
if (!g_model) {
fprintf(stderr, "[sam3-cpp] Failed to load model: %s\n", model_path);
return 1;
}
g_state = sam3_create_state(*g_model, params);
if (!g_state) {
fprintf(stderr, "[sam3-cpp] Failed to create state\n");
g_model.reset();
return 2;
}
fprintf(stderr, "[sam3-cpp] Model loaded: %s (threads=%d)\n", model_path, threads);
return 0;
}
int sam3_cpp_encode_image(const char *image_path) {
if (!g_model || !g_state) {
fprintf(stderr, "[sam3-cpp] Model not loaded\n");
return 1;
}
sam3_image img = sam3_load_image(image_path);
if (img.data.empty()) {
fprintf(stderr, "[sam3-cpp] Failed to load image: %s\n", image_path);
return 2;
}
if (!sam3_encode_image(*g_state, *g_model, img)) {
fprintf(stderr, "[sam3-cpp] Failed to encode image\n");
return 3;
}
return 0;
}
int sam3_cpp_segment_pvs(float *points, int n_point_triples,
float *boxes, int n_box_quads,
float threshold) {
if (!g_model || !g_state) {
return -1;
}
sam3_pvs_params pvs_params;
// Parse points: each triple is [x, y, label]
for (int i = 0; i < n_point_triples; i++) {
float x = points[i * 3];
float y = points[i * 3 + 1];
float label = points[i * 3 + 2];
sam3_point pt = {x, y};
if (label > 0.5f) {
pvs_params.pos_points.push_back(pt);
} else {
pvs_params.neg_points.push_back(pt);
}
}
// Parse boxes: each quad is [x1, y1, x2, y2], use only first box
if (n_box_quads > 0) {
pvs_params.box = {boxes[0], boxes[1], boxes[2], boxes[3]};
pvs_params.use_box = true;
}
g_result = sam3_segment_pvs(*g_state, *g_model, pvs_params);
encode_masks_as_png();
return static_cast<int>(g_result.detections.size());
}
int sam3_cpp_segment_pcs(const char *text_prompt, float threshold) {
if (!g_model || !g_state) {
return -1;
}
// PCS mode requires SAM 3 (full model with text encoder)
if (sam3_is_visual_only(*g_model) ||
sam3_get_model_type(*g_model) != SAM3_MODEL_SAM3) {
fprintf(stderr, "[sam3-cpp] PCS mode requires full SAM 3 model\n");
return -1;
}
sam3_pcs_params pcs_params;
pcs_params.text_prompt = text_prompt;
pcs_params.score_threshold = threshold > 0 ? threshold : 0.5f;
g_result = sam3_segment_pcs(*g_state, *g_model, pcs_params);
encode_masks_as_png();
return static_cast<int>(g_result.detections.size());
}
int sam3_cpp_get_n_detections(void) {
return static_cast<int>(g_result.detections.size());
}
float sam3_cpp_get_detection_x(int i) {
if (i < 0 || i >= static_cast<int>(g_result.detections.size())) return 0;
return g_result.detections[i].box.x0;
}
float sam3_cpp_get_detection_y(int i) {
if (i < 0 || i >= static_cast<int>(g_result.detections.size())) return 0;
return g_result.detections[i].box.y0;
}
float sam3_cpp_get_detection_w(int i) {
if (i < 0 || i >= static_cast<int>(g_result.detections.size())) return 0;
const auto &box = g_result.detections[i].box;
return box.x1 - box.x0;
}
float sam3_cpp_get_detection_h(int i) {
if (i < 0 || i >= static_cast<int>(g_result.detections.size())) return 0;
const auto &box = g_result.detections[i].box;
return box.y1 - box.y0;
}
float sam3_cpp_get_detection_score(int i) {
if (i < 0 || i >= static_cast<int>(g_result.detections.size())) return 0;
return g_result.detections[i].score;
}
int sam3_cpp_get_detection_mask_png(int i, unsigned char *buf, int buf_size) {
if (i < 0 || i >= static_cast<int>(g_mask_pngs.size())) return 0;
const auto &png = g_mask_pngs[i];
int size = static_cast<int>(png.size());
if (buf == nullptr) {
return size;
}
int to_copy = size < buf_size ? size : buf_size;
memcpy(buf, png.data(), to_copy);
return to_copy;
}
void sam3_cpp_free_results(void) {
g_result.detections.clear();
g_mask_pngs.clear();
}
} // extern "C"

View File

@@ -0,0 +1,143 @@
package main
import (
"encoding/base64"
"fmt"
"os"
"path/filepath"
"unsafe"
"github.com/mudler/LocalAI/pkg/grpc/base"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
)
type SAM3 struct {
base.SingleThread
}
var (
CppLoadModel func(modelPath string, threads int) int
CppEncodeImage func(imagePath string) int
CppSegmentPVS func(points uintptr, nPointTriples int, boxes uintptr, nBoxQuads int, threshold float32) int
CppSegmentPCS func(textPrompt string, threshold float32) int
CppGetNDetections func() int
CppGetDetectionX func(i int) float32
CppGetDetectionY func(i int) float32
CppGetDetectionW func(i int) float32
CppGetDetectionH func(i int) float32
CppGetDetectionScore func(i int) float32
CppGetDetectionMaskPNG func(i int, buf uintptr, bufSize int) int
CppFreeResults func()
)
func (s *SAM3) Load(opts *pb.ModelOptions) error {
modelFile := opts.ModelFile
if modelFile == "" {
modelFile = opts.Model
}
var modelPath string
if filepath.IsAbs(modelFile) {
modelPath = modelFile
} else {
modelPath = filepath.Join(opts.ModelPath, modelFile)
}
threads := int(opts.Threads)
if threads <= 0 {
threads = 4
}
ret := CppLoadModel(modelPath, threads)
if ret != 0 {
return fmt.Errorf("failed to load SAM3 model (error %d): %s", ret, modelPath)
}
return nil
}
func (s *SAM3) Detect(opts *pb.DetectOptions) (pb.DetectResponse, error) {
// Decode base64 image and write to temp file
imgData, err := base64.StdEncoding.DecodeString(opts.Src)
if err != nil {
return pb.DetectResponse{}, fmt.Errorf("failed to decode image: %w", err)
}
tmpFile, err := os.CreateTemp("", "sam3-*.png")
if err != nil {
return pb.DetectResponse{}, fmt.Errorf("failed to create temp file: %w", err)
}
defer os.Remove(tmpFile.Name())
if _, err := tmpFile.Write(imgData); err != nil {
tmpFile.Close()
return pb.DetectResponse{}, fmt.Errorf("failed to write temp file: %w", err)
}
tmpFile.Close()
// Encode image
ret := CppEncodeImage(tmpFile.Name())
if ret != 0 {
return pb.DetectResponse{}, fmt.Errorf("failed to encode image (error %d)", ret)
}
threshold := opts.Threshold
if threshold <= 0 {
threshold = 0.5
}
// Determine segmentation mode
var nDetections int
if opts.Prompt != "" {
// Text-prompted segmentation (PCS mode, SAM 3 only)
nDetections = CppSegmentPCS(opts.Prompt, threshold)
} else {
// Point/box-prompted segmentation (PVS mode)
var pointsPtr uintptr
var boxesPtr uintptr
nPointTriples := len(opts.Points) / 3
nBoxQuads := len(opts.Boxes) / 4
if nPointTriples > 0 {
pointsPtr = uintptr(unsafe.Pointer(&opts.Points[0]))
}
if nBoxQuads > 0 {
boxesPtr = uintptr(unsafe.Pointer(&opts.Boxes[0]))
}
nDetections = CppSegmentPVS(pointsPtr, nPointTriples, boxesPtr, nBoxQuads, threshold)
}
if nDetections < 0 {
return pb.DetectResponse{}, fmt.Errorf("segmentation failed")
}
defer CppFreeResults()
// Build response
detections := make([]*pb.Detection, nDetections)
for i := 0; i < nDetections; i++ {
det := &pb.Detection{
X: CppGetDetectionX(i),
Y: CppGetDetectionY(i),
Width: CppGetDetectionW(i),
Height: CppGetDetectionH(i),
Confidence: CppGetDetectionScore(i),
ClassName: "segment",
}
// Get mask PNG
maskSize := CppGetDetectionMaskPNG(i, 0, 0)
if maskSize > 0 {
maskBuf := make([]byte, maskSize)
CppGetDetectionMaskPNG(i, uintptr(unsafe.Pointer(&maskBuf[0])), maskSize)
det.Mask = maskBuf
}
detections[i] = det
}
return pb.DetectResponse{
Detections: detections,
}, nil
}

View File

@@ -0,0 +1,51 @@
#ifndef GOSAM3_H
#define GOSAM3_H
#ifdef __cplusplus
extern "C" {
#endif
// Load model from file. Returns 0 on success, non-zero on failure.
int sam3_cpp_load_model(const char *model_path, int threads);
// Encode an image from file path. Must be called before segmentation.
// Returns 0 on success.
int sam3_cpp_encode_image(const char *image_path);
// Segment with point/box prompts (PVS mode).
// points: flat array of [x, y, label] triples (label: 1=positive, 0=negative)
// boxes: flat array of [x1, y1, x2, y2] quads
// Returns number of detections, or -1 on error.
int sam3_cpp_segment_pvs(float *points, int n_point_triples,
float *boxes, int n_box_quads,
float threshold);
// Segment with text prompt (PCS mode, SAM 3 only).
// Returns number of detections, or -1 on error.
int sam3_cpp_segment_pcs(const char *text_prompt, float threshold);
// Access detection results (valid after a segment call).
int sam3_cpp_get_n_detections(void);
// Get bounding box for detection i (as x, y, width, height).
float sam3_cpp_get_detection_x(int i);
float sam3_cpp_get_detection_y(int i);
float sam3_cpp_get_detection_w(int i);
float sam3_cpp_get_detection_h(int i);
// Get confidence score for detection i.
float sam3_cpp_get_detection_score(int i);
// Get mask as PNG-encoded bytes.
// If buf is NULL, returns the required buffer size.
// Otherwise writes up to buf_size bytes and returns bytes written.
int sam3_cpp_get_detection_mask_png(int i, unsigned char *buf, int buf_size);
// Free current detection results.
void sam3_cpp_free_results(void);
#ifdef __cplusplus
}
#endif
#endif // GOSAM3_H

View File

@@ -0,0 +1,56 @@
package main
import (
"flag"
"os"
"github.com/ebitengine/purego"
grpc "github.com/mudler/LocalAI/pkg/grpc"
)
var (
addr = flag.String("addr", "localhost:50051", "the address to connect to")
)
type LibFuncs struct {
FuncPtr any
Name string
}
func main() {
// Get library name from environment variable, default to fallback
libName := os.Getenv("SAM3_LIBRARY")
if libName == "" {
libName = "./libgosam3-fallback.so"
}
gosamLib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
if err != nil {
panic(err)
}
libFuncs := []LibFuncs{
{&CppLoadModel, "sam3_cpp_load_model"},
{&CppEncodeImage, "sam3_cpp_encode_image"},
{&CppSegmentPVS, "sam3_cpp_segment_pvs"},
{&CppSegmentPCS, "sam3_cpp_segment_pcs"},
{&CppGetNDetections, "sam3_cpp_get_n_detections"},
{&CppGetDetectionX, "sam3_cpp_get_detection_x"},
{&CppGetDetectionY, "sam3_cpp_get_detection_y"},
{&CppGetDetectionW, "sam3_cpp_get_detection_w"},
{&CppGetDetectionH, "sam3_cpp_get_detection_h"},
{&CppGetDetectionScore, "sam3_cpp_get_detection_score"},
{&CppGetDetectionMaskPNG, "sam3_cpp_get_detection_mask_png"},
{&CppFreeResults, "sam3_cpp_free_results"},
}
for _, lf := range libFuncs {
purego.RegisterLibFunc(lf.FuncPtr, gosamLib, lf.Name)
}
flag.Parse()
if err := grpc.StartServer(*addr, &SAM3{}); err != nil {
panic(err)
}
}

59
backend/go/sam3-cpp/package.sh Executable file
View File

@@ -0,0 +1,59 @@
#!/bin/bash
# Script to copy the appropriate libraries based on architecture
set -e
CURDIR=$(dirname "$(realpath $0)")
REPO_ROOT="${CURDIR}/../../.."
# Create lib directory
mkdir -p $CURDIR/package/lib
cp -avf $CURDIR/libgosam3-*.so $CURDIR/package/
cp -avf $CURDIR/sam3-cpp $CURDIR/package/
cp -fv $CURDIR/run.sh $CURDIR/package/
# Detect architecture and copy appropriate libraries
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
# x86_64 architecture
echo "Detected x86_64 architecture, copying x86_64 libraries..."
cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
# ARM64 architecture
echo "Detected ARM64 architecture, copying ARM64 libraries..."
cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ $(uname -s) = "Darwin" ]; then
echo "Detected Darwin"
else
echo "Error: Could not detect architecture"
exit 1
fi
# Package GPU libraries based on BUILD_TYPE
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
if [ -f "$GPU_LIB_SCRIPT" ]; then
echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
package_gpu_libs
fi
echo "Packaging completed successfully"
ls -liah $CURDIR/package/
ls -liah $CURDIR/package/lib/

52
backend/go/sam3-cpp/run.sh Executable file
View File

@@ -0,0 +1,52 @@
#!/bin/bash
set -ex
# Get the absolute current dir where the script is located
CURDIR=$(dirname "$(realpath $0)")
cd /
echo "CPU info:"
if [ "$(uname)" != "Darwin" ]; then
grep -e "model\sname" /proc/cpuinfo | head -1
grep -e "flags" /proc/cpuinfo | head -1
fi
LIBRARY="$CURDIR/libgosam3-fallback.so"
if [ "$(uname)" != "Darwin" ]; then
if grep -q -e "\savx\s" /proc/cpuinfo ; then
echo "CPU: AVX found OK"
if [ -e $CURDIR/libgosam3-avx.so ]; then
LIBRARY="$CURDIR/libgosam3-avx.so"
fi
fi
if grep -q -e "\savx2\s" /proc/cpuinfo ; then
echo "CPU: AVX2 found OK"
if [ -e $CURDIR/libgosam3-avx2.so ]; then
LIBRARY="$CURDIR/libgosam3-avx2.so"
fi
fi
# Check avx 512
if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
echo "CPU: AVX512F found OK"
if [ -e $CURDIR/libgosam3-avx512.so ]; then
LIBRARY="$CURDIR/libgosam3-avx512.so"
fi
fi
fi
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
export SAM3_LIBRARY=$LIBRARY
# If there is a lib/ld.so, use it
if [ -f $CURDIR/lib/ld.so ]; then
echo "Using lib/ld.so"
echo "Using library: $LIBRARY"
exec $CURDIR/lib/ld.so $CURDIR/sam3-cpp "$@"
fi
echo "Using library: $LIBRARY"
exec $CURDIR/sam3-cpp "$@"

50
backend/go/sam3-cpp/test.sh Executable file
View File

@@ -0,0 +1,50 @@
#!/bin/bash
set -e
CURDIR=$(dirname "$(realpath $0)")
echo "Running sam3-cpp backend tests..."
# The test requires a SAM model in GGML format.
# Uses EdgeTAM Q4_0 (~15MB) for fast CI testing.
SAM3_MODEL_DIR="${SAM3_MODEL_DIR:-$CURDIR/test-models}"
SAM3_MODEL_FILE="${SAM3_MODEL_FILE:-edgetam_q4_0.ggml}"
SAM3_MODEL_URL="${SAM3_MODEL_URL:-https://huggingface.co/PABannier/sam3.cpp/resolve/main/edgetam_q4_0.ggml}"
# Download model if not present
if [ ! -f "$SAM3_MODEL_DIR/$SAM3_MODEL_FILE" ]; then
echo "Downloading EdgeTAM Q4_0 model for testing..."
mkdir -p "$SAM3_MODEL_DIR"
curl -L -o "$SAM3_MODEL_DIR/$SAM3_MODEL_FILE" "$SAM3_MODEL_URL" --progress-bar
echo "Model downloaded."
fi
# Create a test image (4x4 red pixel PNG) using base64
# This is a minimal valid PNG for testing the pipeline
TEST_IMAGE_DIR="$CURDIR/test-data"
mkdir -p "$TEST_IMAGE_DIR"
# Generate a simple test image using Python if available, otherwise use a pre-encoded one
if command -v python3 &> /dev/null; then
python3 -c "
import struct, zlib, base64
def create_png(width, height, r, g, b):
raw = b''
for y in range(height):
raw += b'\x00' # filter byte
for x in range(width):
raw += bytes([r, g, b])
def chunk(ctype, data):
c = ctype + data
return struct.pack('>I', len(data)) + c + struct.pack('>I', zlib.crc32(c) & 0xffffffff)
ihdr = struct.pack('>IIBBBBB', width, height, 8, 2, 0, 0, 0)
return b'\x89PNG\r\n\x1a\n' + chunk(b'IHDR', ihdr) + chunk(b'IDAT', zlib.compress(raw)) + chunk(b'IEND', b'')
with open('$TEST_IMAGE_DIR/test.png', 'wb') as f:
f.write(create_png(64, 64, 255, 0, 0))
"
echo "Test image created."
fi
echo "sam3-cpp test setup complete."
echo "Model: $SAM3_MODEL_DIR/$SAM3_MODEL_FILE"
echo "Note: Full integration tests run via the LocalAI test-extra target."

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# stablediffusion.cpp (ggml)
STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp
STABLEDIFFUSION_GGML_VERSION?=d6dd6d7b555c233bb9bc9f20b4751eb8c9269743
STABLEDIFFUSION_GGML_VERSION?=6b675a5ede9b0edf0a0f44191e8b79d7ef27615a
CMAKE_ARGS+=-DGGML_MAX_NAME=128
@@ -32,7 +32,7 @@ else ifeq ($(BUILD_TYPE),hipblas)
ROCM_PATH ?= /opt/rocm
export CXX=$(ROCM_HOME)/llvm/bin/clang++
export CC=$(ROCM_HOME)/llvm/bin/clang
AMDGPU_TARGETS?=gfx803,gfx900,gfx906,gfx908,gfx90a,gfx942,gfx1010,gfx1030,gfx1032,gfx1100,gfx1101,gfx1102,gfx1200,gfx1201
AMDGPU_TARGETS?=gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1200,gfx1201
CMAKE_ARGS+=-DSD_HIPBLAS=ON -DGGML_HIPBLAS=ON -DAMDGPU_TARGETS=$(AMDGPU_TARGETS)
else ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DSD_VULKAN=ON -DGGML_VULKAN=ON

View File

@@ -27,107 +27,7 @@
#include <stdlib.h>
#include <regex>
// Names of the sampler method, same order as enum sample_method in stable-diffusion.h
const char* sample_method_str[] = {
"euler",
"euler_a",
"heun",
"dpm2",
"dpm++2s_a",
"dpm++2m",
"dpm++2mv2",
"ipndm",
"ipndm_v",
"lcm",
"ddim_trailing",
"tcd",
"res_multistep",
"res_2s",
};
static_assert(std::size(sample_method_str) == SAMPLE_METHOD_COUNT, "sample method mismatch");
// Names of the sigma schedule overrides, same order as sample_schedule in stable-diffusion.h
const char* schedulers[] = {
"discrete",
"karras",
"exponential",
"ays",
"gits",
"sgm_uniform",
"simple",
"smoothstep",
"kl_optimal",
"lcm",
"bong_tangent",
};
static_assert(std::size(schedulers) == SCHEDULER_COUNT, "schedulers mismatch");
// New enum string arrays
const char* rng_type_str[] = {
"std_default",
"cuda",
"cpu",
};
static_assert(std::size(rng_type_str) == RNG_TYPE_COUNT, "rng type mismatch");
const char* prediction_str[] = {
"epsilon",
"v",
"edm_v",
"flow",
"flux_flow",
"flux2_flow",
};
static_assert(std::size(prediction_str) == PREDICTION_COUNT, "prediction mismatch");
const char* lora_apply_mode_str[] = {
"auto",
"immediately",
"at_runtime",
};
static_assert(std::size(lora_apply_mode_str) == LORA_APPLY_MODE_COUNT, "lora apply mode mismatch");
constexpr const char* sd_type_str[] = {
"f32", // 0
"f16", // 1
"q4_0", // 2
"q4_1", // 3
nullptr, // 4
nullptr, // 5
"q5_0", // 6
"q5_1", // 7
"q8_0", // 8
"q8_1", // 9
"q2_k", // 10
"q3_k", // 11
"q4_k", // 12
"q5_k", // 13
"q6_k", // 14
"q8_k", // 15
"iq2_xxs", // 16
"iq2_xs", // 17
"iq3_xxs", // 18
"iq1_s", // 19
"iq4_nl", // 20
"iq3_s", // 21
"iq2_s", // 22
"iq4_xs", // 23
"i8", // 24
"i16", // 25
"i32", // 26
"i64", // 27
"f64", // 28
"iq1_m", // 29
"bf16", // 30
nullptr, nullptr, nullptr, // 31-33
"tq1_0", // 34
"tq2_0", // 35
nullptr, nullptr, nullptr, // 36-38
"mxfp4" // 39
};
static_assert(std::size(sd_type_str) == SD_TYPE_COUNT, "sd type mismatch");
sd_ctx_params_t ctx_params;
sd_ctx_t* sd_c;
@@ -596,75 +496,45 @@ int load_model(const char *model, char *model_path, char* options[], int threads
if (!strcmp(optname, "flow_shift")) flow_shift = atof(optval);
if (!strcmp(optname, "rng_type")) {
int found = -1;
for (int m = 0; m < RNG_TYPE_COUNT; m++) {
if (!strcmp(optval, rng_type_str[m])) {
found = m;
break;
}
}
if (found != -1) {
rng_type = (rng_type_t)found;
rng_type_t parsed = str_to_rng_type(optval);
if (parsed != RNG_TYPE_COUNT) {
rng_type = parsed;
fprintf(stderr, "Found rng_type: %s\n", optval);
} else {
fprintf(stderr, "Invalid rng_type: %s, using default\n", optval);
}
}
if (!strcmp(optname, "sampler_rng_type")) {
int found = -1;
for (int m = 0; m < RNG_TYPE_COUNT; m++) {
if (!strcmp(optval, rng_type_str[m])) {
found = m;
break;
}
}
if (found != -1) {
sampler_rng_type = (rng_type_t)found;
rng_type_t parsed = str_to_rng_type(optval);
if (parsed != RNG_TYPE_COUNT) {
sampler_rng_type = parsed;
fprintf(stderr, "Found sampler_rng_type: %s\n", optval);
} else {
fprintf(stderr, "Invalid sampler_rng_type: %s, using default\n", optval);
}
}
if (!strcmp(optname, "prediction")) {
int found = -1;
for (int m = 0; m < PREDICTION_COUNT; m++) {
if (!strcmp(optval, prediction_str[m])) {
found = m;
break;
}
}
if (found != -1) {
prediction = (prediction_t)found;
prediction_t parsed = str_to_prediction(optval);
if (parsed != PREDICTION_COUNT) {
prediction = parsed;
fprintf(stderr, "Found prediction: %s\n", optval);
} else {
fprintf(stderr, "Invalid prediction: %s, using default\n", optval);
}
}
if (!strcmp(optname, "lora_apply_mode")) {
int found = -1;
for (int m = 0; m < LORA_APPLY_MODE_COUNT; m++) {
if (!strcmp(optval, lora_apply_mode_str[m])) {
found = m;
break;
}
}
if (found != -1) {
lora_apply_mode = (lora_apply_mode_t)found;
lora_apply_mode_t parsed = str_to_lora_apply_mode(optval);
if (parsed != LORA_APPLY_MODE_COUNT) {
lora_apply_mode = parsed;
fprintf(stderr, "Found lora_apply_mode: %s\n", optval);
} else {
fprintf(stderr, "Invalid lora_apply_mode: %s, using default\n", optval);
}
}
if (!strcmp(optname, "wtype")) {
int found = -1;
for (int m = 0; m < SD_TYPE_COUNT; m++) {
if (sd_type_str[m] && !strcmp(optval, sd_type_str[m])) {
found = m;
break;
}
}
if (found != -1) {
wtype = (sd_type_t)found;
sd_type_t parsed = str_to_sd_type(optval);
if (parsed != SD_TYPE_COUNT) {
wtype = parsed;
fprintf(stderr, "Found wtype: %s\n", optval);
} else {
fprintf(stderr, "Invalid wtype: %s, using default\n", optval);
@@ -735,27 +605,25 @@ int load_model(const char *model, char *model_path, char* options[], int threads
fprintf (stderr, "Created context: OK\n");
int sample_method_found = -1;
for (int m = 0; m < SAMPLE_METHOD_COUNT; m++) {
if (!strcmp(sampler, sample_method_str[m])) {
sample_method_found = m;
fprintf(stderr, "Found sampler: %s\n", sampler);
}
sample_method_t sm = str_to_sample_method(sampler);
if (sm != SAMPLE_METHOD_COUNT) {
sample_method_found = (int)sm;
fprintf(stderr, "Found sampler: %s\n", sampler);
}
if (sample_method_found == -1) {
sample_method_found = sd_get_default_sample_method(sd_ctx);
fprintf(stderr, "Invalid sample method, using default: %s\n", sample_method_str[sample_method_found]);
fprintf(stderr, "Invalid sample method, using default: %s\n", sd_sample_method_name((sample_method_t)sample_method_found));
}
sample_method = (sample_method_t)sample_method_found;
for (int d = 0; d < SCHEDULER_COUNT; d++) {
if (!strcmp(scheduler_str, schedulers[d])) {
scheduler = (scheduler_t)d;
fprintf (stderr, "Found scheduler: %s\n", scheduler_str);
}
scheduler_t sched = str_to_scheduler(scheduler_str);
if (sched != SCHEDULER_COUNT) {
scheduler = sched;
fprintf(stderr, "Found scheduler: %s\n", scheduler_str);
}
if (scheduler == SCHEDULER_COUNT) {
scheduler = sd_get_default_scheduler(sd_ctx, sample_method);
fprintf(stderr, "Invalid scheduler, using default: %s\n", schedulers[scheduler]);
scheduler = sd_get_default_scheduler(sd_ctx, sample_method);
fprintf(stderr, "Invalid scheduler, using default: %s\n", sd_scheduler_name(scheduler));
}
sd_c = sd_ctx;

View File

@@ -138,7 +138,7 @@ func TestAudioTranscription(t *testing.T) {
if err != nil {
t.Fatal(err)
}
defer os.RemoveAll(tmpDir)
t.Cleanup(func() { os.RemoveAll(tmpDir) })
// Download sample audio — JFK "ask not what your country can do for you" clip
audioFile := filepath.Join(tmpDir, "sample.wav")

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# whisper.cpp version
WHISPER_REPO?=https://github.com/ggml-org/whisper.cpp
WHISPER_CPP_VERSION?=30c5194c9691e4e9a98b3dea9f19727397d3f46e
WHISPER_CPP_VERSION?=95ea8f9bfb03a15db08a8989966fd1ae3361e20d
SO_TARGET?=libgowhisper.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF

View File

@@ -29,6 +29,20 @@
nvidia-cuda-12: "cuda12-llama-cpp"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-llama-cpp"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-llama-cpp"
- &ikllamacpp
name: "ik-llama-cpp"
alias: "ik-llama-cpp"
license: mit
description: |
Fork of llama.cpp optimized for CPU performance by ikawrakow
urls:
- https://github.com/ikawrakow/ik_llama.cpp
tags:
- text-to-text
- LLM
- CPU
capabilities:
default: "cpu-ik-llama-cpp"
- &whispercpp
name: "whisper"
alias: "whisper"
@@ -125,6 +139,31 @@
nvidia-cuda-13: "cuda13-rfdetr"
nvidia-cuda-12: "cuda12-rfdetr"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-rfdetr"
- &sam3cpp
name: "sam3-cpp"
alias: "sam3-cpp"
license: mit
description: |
Segment Anything Model (SAM 3/2/EdgeTAM) in C/C++ using GGML.
Supports text-prompted and point/box-prompted image segmentation.
urls:
- https://github.com/PABannier/sam3.cpp
tags:
- image-segmentation
- object-detection
- sam3
- gpu
- cpu
capabilities:
default: "cpu-sam3-cpp"
nvidia: "cuda12-sam3-cpp"
nvidia-cuda-12: "cuda12-sam3-cpp"
nvidia-cuda-13: "cuda13-sam3-cpp"
nvidia-l4t: "nvidia-l4t-arm64-sam3-cpp"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-sam3-cpp"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-sam3-cpp"
intel: "intel-sycl-f32-sam3-cpp"
vulkan: "vulkan-sam3-cpp"
- &vllm
name: "vllm"
license: apache-2.0
@@ -158,6 +197,7 @@
amd: "rocm-vllm"
intel: "intel-vllm"
nvidia-cuda-12: "cuda12-vllm"
cpu: "cpu-vllm"
- &vllm-omni
name: "vllm-omni"
license: apache-2.0
@@ -387,6 +427,30 @@
nvidia-l4t: "nvidia-l4t-arm64-acestep-cpp"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-acestep-cpp"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-acestep-cpp"
- &qwen3ttscpp
name: "qwen3-tts-cpp"
description: |
Qwen3-TTS C++ backend using GGML. Native C++ text-to-speech with voice cloning support.
Generates 24kHz mono audio from text with optional reference audio for voice cloning via ECAPA-TDNN speaker embeddings.
urls:
- https://github.com/predict-woo/qwen3-tts.cpp
tags:
- text-to-speech
- tts
- voice-cloning
alias: "qwen3-tts-cpp"
capabilities:
default: "cpu-qwen3-tts-cpp"
nvidia: "cuda12-qwen3-tts-cpp"
nvidia-cuda-13: "cuda13-qwen3-tts-cpp"
nvidia-cuda-12: "cuda12-qwen3-tts-cpp"
intel: "intel-sycl-f16-qwen3-tts-cpp"
metal: "metal-qwen3-tts-cpp"
amd: "rocm-qwen3-tts-cpp"
vulkan: "vulkan-qwen3-tts-cpp"
nvidia-l4t: "nvidia-l4t-arm64-qwen3-tts-cpp"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-qwen3-tts-cpp"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-qwen3-tts-cpp"
- &faster-whisper
icon: https://avatars.githubusercontent.com/u/1520500?s=200&v=4
description: |
@@ -400,12 +464,15 @@
license: MIT
name: "faster-whisper"
capabilities:
default: "cpu-faster-whisper"
nvidia: "cuda12-faster-whisper"
intel: "intel-faster-whisper"
amd: "rocm-faster-whisper"
metal: "metal-faster-whisper"
nvidia-cuda-13: "cuda13-faster-whisper"
nvidia-cuda-12: "cuda12-faster-whisper"
nvidia-l4t: "nvidia-l4t-arm64-faster-whisper"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-faster-whisper"
- &moonshine
description: |
Moonshine is a fast, accurate, and efficient speech-to-text transcription model using ONNX Runtime.
@@ -438,6 +505,7 @@
- whisperx
license: BSD-4-Clause
name: "whisperx"
alias: "whisperx"
capabilities:
nvidia: "cuda12-whisperx"
amd: "rocm-whisperx"
@@ -445,6 +513,8 @@
default: "cpu-whisperx"
nvidia-cuda-13: "cuda13-whisperx"
nvidia-cuda-12: "cuda12-whisperx"
nvidia-l4t: "nvidia-l4t-arm64-whisperx"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-whisperx"
- &kokoro
icon: https://avatars.githubusercontent.com/u/166769057?v=4
description: |
@@ -468,6 +538,26 @@
nvidia-cuda-13: "cuda13-kokoro"
nvidia-cuda-12: "cuda12-kokoro"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-kokoro"
- &kokoros
icon: https://avatars.githubusercontent.com/u/166769057?v=4
description: |
Kokoros is a pure Rust TTS backend using the Kokoro ONNX model (82M parameters).
It provides fast, high-quality text-to-speech with streaming support, built on
ONNX Runtime for efficient CPU inference. Supports English, Japanese, Mandarin
Chinese, and German.
urls:
- https://huggingface.co/hexgrad/Kokoro-82M
- https://github.com/lucasjinreal/Kokoros
tags:
- text-to-speech
- TTS
- Rust
- ONNX
license: apache-2.0
alias: "kokoros"
name: "kokoros"
capabilities:
default: "cpu-kokoros"
- &coqui
urls:
- https://github.com/idiap/coqui-ai-TTS
@@ -726,6 +816,7 @@
- TTS
- &opus
name: "opus"
alias: "opus"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-opus"
urls:
- https://opus-codec.org/
@@ -821,6 +912,10 @@
nvidia-cuda-12: "cuda12-llama-cpp-development"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-llama-cpp-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-llama-cpp-development"
- !!merge <<: *ikllamacpp
name: "ik-llama-cpp-development"
capabilities:
default: "cpu-ik-llama-cpp-development"
- !!merge <<: *neutts
name: "cpu-neutts"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-neutts"
@@ -1251,6 +1346,17 @@
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-llama-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-llama-cpp
## ik-llama-cpp
- !!merge <<: *ikllamacpp
name: "cpu-ik-llama-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-ik-llama-cpp"
mirrors:
- localai/localai-backends:latest-cpu-ik-llama-cpp
- !!merge <<: *ikllamacpp
name: "cpu-ik-llama-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-ik-llama-cpp"
mirrors:
- localai/localai-backends:master-cpu-ik-llama-cpp
## whisper
- !!merge <<: *whispercpp
name: "nvidia-l4t-arm64-whisper"
@@ -1458,6 +1564,7 @@
nvidia: "cuda12-vllm-development"
amd: "rocm-vllm-development"
intel: "intel-vllm-development"
cpu: "cpu-vllm-development"
- !!merge <<: *vllm
name: "cuda12-vllm"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-vllm"
@@ -1473,6 +1580,11 @@
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-vllm"
mirrors:
- localai/localai-backends:latest-gpu-intel-vllm
- !!merge <<: *vllm
name: "cpu-vllm"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-vllm"
mirrors:
- localai/localai-backends:latest-cpu-vllm
- !!merge <<: *vllm
name: "cuda12-vllm-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-vllm"
@@ -1488,6 +1600,11 @@
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-vllm"
mirrors:
- localai/localai-backends:master-gpu-intel-vllm
- !!merge <<: *vllm
name: "cpu-vllm-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-vllm"
mirrors:
- localai/localai-backends:master-cpu-vllm
# vllm-omni
- !!merge <<: *vllm-omni
name: "vllm-omni-development"
@@ -1601,6 +1718,89 @@
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-rfdetr"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-rfdetr
## sam3-cpp
- !!merge <<: *sam3cpp
name: "sam3-cpp-development"
capabilities:
default: "cpu-sam3-cpp-development"
nvidia: "cuda12-sam3-cpp-development"
nvidia-cuda-12: "cuda12-sam3-cpp-development"
nvidia-cuda-13: "cuda13-sam3-cpp-development"
nvidia-l4t: "nvidia-l4t-arm64-sam3-cpp-development"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-sam3-cpp-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-sam3-cpp-development"
intel: "intel-sycl-f32-sam3-cpp-development"
vulkan: "vulkan-sam3-cpp-development"
- !!merge <<: *sam3cpp
name: "cpu-sam3-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-sam3-cpp"
mirrors:
- localai/localai-backends:latest-cpu-sam3-cpp
- !!merge <<: *sam3cpp
name: "cpu-sam3-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-sam3-cpp"
mirrors:
- localai/localai-backends:master-cpu-sam3-cpp
- !!merge <<: *sam3cpp
name: "cuda12-sam3-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-sam3-cpp"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-sam3-cpp
- !!merge <<: *sam3cpp
name: "cuda12-sam3-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-sam3-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-sam3-cpp
- !!merge <<: *sam3cpp
name: "cuda13-sam3-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-sam3-cpp"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-sam3-cpp
- !!merge <<: *sam3cpp
name: "cuda13-sam3-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-sam3-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-sam3-cpp
- !!merge <<: *sam3cpp
name: "nvidia-l4t-arm64-sam3-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-sam3-cpp"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-arm64-sam3-cpp
- !!merge <<: *sam3cpp
name: "nvidia-l4t-arm64-sam3-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-sam3-cpp"
mirrors:
- localai/localai-backends:master-nvidia-l4t-arm64-sam3-cpp
- !!merge <<: *sam3cpp
name: "cuda13-nvidia-l4t-arm64-sam3-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-sam3-cpp"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-sam3-cpp
- !!merge <<: *sam3cpp
name: "cuda13-nvidia-l4t-arm64-sam3-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-sam3-cpp"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-sam3-cpp
- !!merge <<: *sam3cpp
name: "intel-sycl-f32-sam3-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-sam3-cpp"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f32-sam3-cpp
- !!merge <<: *sam3cpp
name: "intel-sycl-f32-sam3-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-sam3-cpp"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f32-sam3-cpp
- !!merge <<: *sam3cpp
name: "vulkan-sam3-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-sam3-cpp"
mirrors:
- localai/localai-backends:latest-gpu-vulkan-sam3-cpp
- !!merge <<: *sam3cpp
name: "vulkan-sam3-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-sam3-cpp"
mirrors:
- localai/localai-backends:master-gpu-vulkan-sam3-cpp
## Rerankers
- !!merge <<: *rerankers
name: "rerankers-development"
@@ -1972,6 +2172,107 @@
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-acestep-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-acestep-cpp
## qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "nvidia-l4t-arm64-qwen3-tts-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-arm64-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "nvidia-l4t-arm64-qwen3-tts-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:master-nvidia-l4t-arm64-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "cuda13-nvidia-l4t-arm64-qwen3-tts-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "cuda13-nvidia-l4t-arm64-qwen3-tts-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "cpu-qwen3-tts-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:latest-cpu-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "metal-qwen3-tts-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "metal-qwen3-tts-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "cpu-qwen3-tts-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:master-cpu-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "cuda12-qwen3-tts-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "rocm-qwen3-tts-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "intel-sycl-f32-qwen3-tts-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f32-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "intel-sycl-f16-qwen3-tts-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f16-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "vulkan-qwen3-tts-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:latest-gpu-vulkan-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "vulkan-qwen3-tts-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:master-gpu-vulkan-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "cuda12-qwen3-tts-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "rocm-qwen3-tts-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "intel-sycl-f32-qwen3-tts-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f32-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "intel-sycl-f16-qwen3-tts-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f16-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "cuda13-qwen3-tts-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "cuda13-qwen3-tts-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-qwen3-tts-cpp
## kokoro
- !!merge <<: *kokoro
name: "kokoro-development"
@@ -2041,15 +2342,32 @@
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-kokoro"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-kokoro
## kokoros (Rust)
- !!merge <<: *kokoros
name: "kokoros-development"
capabilities:
default: "cpu-kokoros-development"
- !!merge <<: *kokoros
name: "cpu-kokoros"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-kokoros"
mirrors:
- localai/localai-backends:latest-cpu-kokoros
- !!merge <<: *kokoros
name: "cpu-kokoros-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-kokoros"
mirrors:
- localai/localai-backends:master-cpu-kokoros
## faster-whisper
- !!merge <<: *faster-whisper
name: "faster-whisper-development"
capabilities:
default: "cpu-faster-whisper-development"
nvidia: "cuda12-faster-whisper-development"
intel: "intel-faster-whisper-development"
amd: "rocm-faster-whisper-development"
metal: "metal-faster-whisper-development"
nvidia-cuda-13: "cuda13-faster-whisper-development"
nvidia-l4t: "nvidia-l4t-arm64-faster-whisper-development"
- !!merge <<: *faster-whisper
name: "cuda12-faster-whisper-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-faster-whisper"
@@ -2090,6 +2408,36 @@
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-faster-whisper"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-faster-whisper
- !!merge <<: *faster-whisper
name: "cuda12-faster-whisper"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-faster-whisper"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-faster-whisper
- !!merge <<: *faster-whisper
name: "rocm-faster-whisper"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-faster-whisper"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-faster-whisper
- !!merge <<: *faster-whisper
name: "cpu-faster-whisper"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-faster-whisper"
mirrors:
- localai/localai-backends:latest-cpu-faster-whisper
- !!merge <<: *faster-whisper
name: "cpu-faster-whisper-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-faster-whisper"
mirrors:
- localai/localai-backends:master-cpu-faster-whisper
- !!merge <<: *faster-whisper
name: "nvidia-l4t-arm64-faster-whisper"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-faster-whisper"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-faster-whisper
- !!merge <<: *faster-whisper
name: "nvidia-l4t-arm64-faster-whisper-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-faster-whisper"
mirrors:
- localai/localai-backends:master-nvidia-l4t-faster-whisper
## moonshine
- !!merge <<: *moonshine
name: "moonshine-development"
@@ -2148,6 +2496,7 @@
default: "cpu-whisperx-development"
nvidia-cuda-13: "cuda13-whisperx-development"
nvidia-cuda-12: "cuda12-whisperx-development"
nvidia-l4t: "nvidia-l4t-arm64-whisperx-development"
- !!merge <<: *whisperx
name: "cpu-whisperx"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-whisperx"
@@ -2198,6 +2547,16 @@
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-whisperx"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-whisperx
- !!merge <<: *whisperx
name: "nvidia-l4t-arm64-whisperx"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-whisperx"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-whisperx
- !!merge <<: *whisperx
name: "nvidia-l4t-arm64-whisperx-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-whisperx"
mirrors:
- localai/localai-backends:master-nvidia-l4t-whisperx
## coqui
- !!merge <<: *coqui
@@ -3029,3 +3388,82 @@
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-voxtral"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-voxtral
- &trl
name: "trl"
alias: "trl"
license: apache-2.0
description: |
HuggingFace TRL fine-tuning backend. Supports SFT, DPO, GRPO, RLOO, Reward, KTO, ORPO training methods.
Works on CPU and GPU.
urls:
- https://github.com/huggingface/trl
tags:
- fine-tuning
- LLM
- CPU
- GPU
- CUDA
capabilities:
default: "cpu-trl"
nvidia: "cuda12-trl"
nvidia-cuda-12: "cuda12-trl"
nvidia-cuda-13: "cuda13-trl"
## TRL backend images
- !!merge <<: *trl
name: "cpu-trl"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-trl"
mirrors:
- localai/localai-backends:latest-cpu-trl
- !!merge <<: *trl
name: "cpu-trl-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-trl"
mirrors:
- localai/localai-backends:master-cpu-trl
- !!merge <<: *trl
name: "cuda12-trl"
uri: "quay.io/go-skynet/local-ai-backends:latest-cublas-cuda12-trl"
mirrors:
- localai/localai-backends:latest-cublas-cuda12-trl
- !!merge <<: *trl
name: "cuda12-trl-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cublas-cuda12-trl"
mirrors:
- localai/localai-backends:master-cublas-cuda12-trl
- !!merge <<: *trl
name: "cuda13-trl"
uri: "quay.io/go-skynet/local-ai-backends:latest-cublas-cuda13-trl"
mirrors:
- localai/localai-backends:latest-cublas-cuda13-trl
- !!merge <<: *trl
name: "cuda13-trl-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cublas-cuda13-trl"
mirrors:
- localai/localai-backends:master-cublas-cuda13-trl
## llama.cpp quantization backend
- &llama-cpp-quantization
name: "llama-cpp-quantization"
alias: "llama-cpp-quantization"
license: mit
icon: https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png
description: |
Model quantization backend using llama.cpp. Downloads HuggingFace models, converts them to GGUF format,
and quantizes them to various formats (q4_k_m, q5_k_m, q8_0, f16, etc.).
urls:
- https://github.com/ggml-org/llama.cpp
tags:
- quantization
- GGUF
- CPU
capabilities:
default: "cpu-llama-cpp-quantization"
metal: "metal-darwin-arm64-llama-cpp-quantization"
- !!merge <<: *llama-cpp-quantization
name: "cpu-llama-cpp-quantization"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-llama-cpp-quantization"
mirrors:
- localai/localai-backends:latest-cpu-llama-cpp-quantization
- !!merge <<: *llama-cpp-quantization
name: "metal-darwin-arm64-llama-cpp-quantization"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-llama-cpp-quantization"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-llama-cpp-quantization

View File

@@ -19,6 +19,10 @@ import tempfile
import backend_pb2
import backend_pb2_grpc
import grpc
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'common'))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'common'))
from grpc_auth import get_auth_interceptors
from acestep.inference import (
GenerationParams,
GenerationConfig,
@@ -444,6 +448,8 @@ def serve(address):
("grpc.max_send_message_length", 50 * 1024 * 1024),
("grpc.max_receive_message_length", 50 * 1024 * 1024),
],
interceptors=get_auth_interceptors(),
)
backend_pb2_grpc.add_BackendServicer_to_server(BackendServicer(), server)
server.add_insecure_port(address)

View File

@@ -1,5 +1,5 @@
--extra-index-url https://download.pytorch.org/whl/rocm6.4
torch==2.8.0+rocm6.4
--extra-index-url https://download.pytorch.org/whl/rocm7.0
torch==2.10.0+rocm7.0
torchaudio
torchvision

View File

@@ -16,6 +16,10 @@ import torchaudio as ta
from chatterbox.tts import ChatterboxTTS
from chatterbox.mtl_tts import ChatterboxMultilingualTTS
import grpc
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'common'))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'common'))
from grpc_auth import get_auth_interceptors
import tempfile
def is_float(s):
@@ -225,7 +229,9 @@ def serve(address):
('grpc.max_message_length', 50 * 1024 * 1024), # 50MB
('grpc.max_send_message_length', 50 * 1024 * 1024), # 50MB
('grpc.max_receive_message_length', 50 * 1024 * 1024), # 50MB
])
],
interceptors=get_auth_interceptors(),
)
backend_pb2_grpc.add_BackendServicer_to_server(BackendServicer(), server)
server.add_insecure_port(address)
server.start()

View File

@@ -1,6 +1,6 @@
--extra-index-url https://download.pytorch.org/whl/rocm6.4
torch==2.9.1+rocm6.4
torchaudio==2.9.1+rocm6.4
--extra-index-url https://download.pytorch.org/whl/rocm7.0
torch==2.10.0+rocm7.0
torchaudio==2.10.0+rocm7.0
transformers
numpy>=1.24.0,<1.26.0
# https://github.com/mudler/LocalAI/pull/6240#issuecomment-3329518289

View File

@@ -0,0 +1,78 @@
"""Shared gRPC bearer token authentication interceptor for LocalAI Python backends.
When the environment variable LOCALAI_GRPC_AUTH_TOKEN is set, requests without
a valid Bearer token in the 'authorization' metadata header are rejected with
UNAUTHENTICATED. When the variable is empty or unset, no authentication is
performed (backward compatible).
"""
import hmac
import os
import grpc
class _AbortHandler(grpc.RpcMethodHandler):
"""A method handler that immediately aborts with UNAUTHENTICATED."""
def __init__(self):
self.request_streaming = False
self.response_streaming = False
self.request_deserializer = None
self.response_serializer = None
self.unary_unary = self._abort
self.unary_stream = None
self.stream_unary = None
self.stream_stream = None
@staticmethod
def _abort(request, context):
context.abort(grpc.StatusCode.UNAUTHENTICATED, "invalid token")
class TokenAuthInterceptor(grpc.ServerInterceptor):
"""Sync gRPC server interceptor that validates a bearer token."""
def __init__(self, token: str):
self._token = token
self._abort_handler = _AbortHandler()
def intercept_service(self, continuation, handler_call_details):
metadata = dict(handler_call_details.invocation_metadata)
auth = metadata.get("authorization", "")
expected = "Bearer " + self._token
if not hmac.compare_digest(auth, expected):
return self._abort_handler
return continuation(handler_call_details)
class AsyncTokenAuthInterceptor(grpc.aio.ServerInterceptor):
"""Async gRPC server interceptor that validates a bearer token."""
def __init__(self, token: str):
self._token = token
async def intercept_service(self, continuation, handler_call_details):
metadata = dict(handler_call_details.invocation_metadata)
auth = metadata.get("authorization", "")
expected = "Bearer " + self._token
if not hmac.compare_digest(auth, expected):
return _AbortHandler()
return await continuation(handler_call_details)
def get_auth_interceptors(*, aio: bool = False):
"""Return a list of gRPC interceptors for bearer token auth.
Args:
aio: If True, return async-compatible interceptors for grpc.aio.server().
If False (default), return sync interceptors for grpc.server().
Returns an empty list when LOCALAI_GRPC_AUTH_TOKEN is not set.
"""
token = os.environ.get("LOCALAI_GRPC_AUTH_TOKEN", "")
if not token:
return []
if aio:
return [AsyncTokenAuthInterceptor(token)]
return [TokenAuthInterceptor(token)]

View File

@@ -1,2 +1,2 @@
--extra-index-url https://download.pytorch.org/whl/rocm6.4
--extra-index-url https://download.pytorch.org/whl/rocm7.0
torch

View File

@@ -1,3 +1,3 @@
grpcio==1.78.1
grpcio==1.80.0
protobuf
grpcio-tools

View File

@@ -0,0 +1,84 @@
"""Shared utilities for vLLM-based backends."""
import json
import sys
def parse_options(options_list):
"""Parse Options[] list of 'key:value' strings into a dict.
Supports type inference for common cases (bool, int, float).
Used by LoadModel to extract backend-specific options.
"""
opts = {}
for opt in options_list:
if ":" not in opt:
continue
key, value = opt.split(":", 1)
key = key.strip()
value = value.strip()
# Try type conversion
if value.lower() in ("true", "false"):
opts[key] = value.lower() == "true"
else:
try:
opts[key] = int(value)
except ValueError:
try:
opts[key] = float(value)
except ValueError:
opts[key] = value
return opts
def messages_to_dicts(proto_messages):
"""Convert proto Message objects to list of dicts for apply_chat_template().
Handles: role, content, name, tool_call_id, reasoning_content, tool_calls (JSON string -> list).
"""
result = []
for msg in proto_messages:
d = {"role": msg.role, "content": msg.content or ""}
if msg.name:
d["name"] = msg.name
if msg.tool_call_id:
d["tool_call_id"] = msg.tool_call_id
if msg.reasoning_content:
d["reasoning_content"] = msg.reasoning_content
if msg.tool_calls:
try:
d["tool_calls"] = json.loads(msg.tool_calls)
except json.JSONDecodeError:
pass
result.append(d)
return result
def setup_parsers(opts):
"""Return (tool_parser_cls, reasoning_parser_cls) tuple from opts dict.
Uses vLLM's native ToolParserManager and ReasoningParserManager.
Returns (None, None) if vLLM is not installed or parsers not available.
"""
tool_parser_cls = None
reasoning_parser_cls = None
tool_parser_name = opts.get("tool_parser")
reasoning_parser_name = opts.get("reasoning_parser")
if tool_parser_name:
try:
from vllm.tool_parsers import ToolParserManager
tool_parser_cls = ToolParserManager.get_tool_parser(tool_parser_name)
print(f"[vllm_utils] Loaded tool_parser: {tool_parser_name}", file=sys.stderr)
except Exception as e:
print(f"[vllm_utils] Failed to load tool_parser {tool_parser_name}: {e}", file=sys.stderr)
if reasoning_parser_name:
try:
from vllm.reasoning import ReasoningParserManager
reasoning_parser_cls = ReasoningParserManager.get_reasoning_parser(reasoning_parser_name)
print(f"[vllm_utils] Loaded reasoning_parser: {reasoning_parser_name}", file=sys.stderr)
except Exception as e:
print(f"[vllm_utils] Failed to load reasoning_parser {reasoning_parser_name}: {e}", file=sys.stderr)
return tool_parser_cls, reasoning_parser_cls

View File

@@ -15,6 +15,10 @@ import torch
from TTS.api import TTS
import grpc
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'common'))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'common'))
from grpc_auth import get_auth_interceptors
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
@@ -93,7 +97,9 @@ def serve(address):
('grpc.max_message_length', 50 * 1024 * 1024), # 50MB
('grpc.max_send_message_length', 50 * 1024 * 1024), # 50MB
('grpc.max_receive_message_length', 50 * 1024 * 1024), # 50MB
])
],
interceptors=get_auth_interceptors(),
)
backend_pb2_grpc.add_BackendServicer_to_server(BackendServicer(), server)
server.add_insecure_port(address)
server.start()

View File

@@ -1,4 +1,6 @@
--extra-index-url https://download.pytorch.org/whl/cpu
transformers==4.48.3
accelerate
torch==2.4.1
torchaudio==2.4.1
coqui-tts

View File

@@ -1,6 +1,6 @@
--extra-index-url https://download.pytorch.org/whl/rocm6.4
torch==2.8.0+rocm6.4
torchaudio==2.8.0+rocm6.4
--extra-index-url https://download.pytorch.org/whl/rocm7.0
torch==2.10.0+rocm7.0
torchaudio==2.10.0+rocm7.0
transformers==4.48.3
accelerate
coqui-tts

View File

@@ -1,4 +1,4 @@
grpcio==1.78.1
grpcio==1.80.0
protobuf
certifi
packaging==24.1

View File

@@ -22,6 +22,10 @@ import backend_pb2
import backend_pb2_grpc
import grpc
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'common'))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'common'))
from grpc_auth import get_auth_interceptors
# Import dynamic loader for pipeline discovery
from diffusers_dynamic_loader import (
@@ -1042,7 +1046,9 @@ def serve(address):
('grpc.max_message_length', 50 * 1024 * 1024), # 50MB
('grpc.max_send_message_length', 50 * 1024 * 1024), # 50MB
('grpc.max_receive_message_length', 50 * 1024 * 1024), # 50MB
])
],
interceptors=get_auth_interceptors(),
)
backend_pb2_grpc.add_BackendServicer_to_server(BackendServicer(), server)
server.add_insecure_port(address)
server.start()

View File

@@ -1,6 +1,6 @@
--extra-index-url https://download.pytorch.org/whl/rocm6.4
torch==2.8.0+rocm6.4
torchvision==0.23.0+rocm6.4
--extra-index-url https://download.pytorch.org/whl/rocm7.0
torch==2.10.0+rocm7.0
torchvision==0.25.0+rocm7.0
git+https://github.com/huggingface/diffusers
opencv-python
transformers

View File

@@ -15,6 +15,10 @@ import torch
import soundfile as sf
import grpc
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'common'))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'common'))
from grpc_auth import get_auth_interceptors
def is_float(s):
@@ -165,6 +169,8 @@ def serve(address):
('grpc.max_send_message_length', 50 * 1024 * 1024),
('grpc.max_receive_message_length', 50 * 1024 * 1024),
]
,
interceptors=get_auth_interceptors(),
)
backend_pb2_grpc.add_BackendServicer_to_server(BackendServicer(), server)
server.add_insecure_port(address)

View File

@@ -14,6 +14,10 @@ import torch
from faster_whisper import WhisperModel
import grpc
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'common'))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'common'))
from grpc_auth import get_auth_interceptors
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
@@ -70,7 +74,9 @@ def serve(address):
('grpc.max_message_length', 50 * 1024 * 1024), # 50MB
('grpc.max_send_message_length', 50 * 1024 * 1024), # 50MB
('grpc.max_receive_message_length', 50 * 1024 * 1024), # 50MB
])
],
interceptors=get_auth_interceptors(),
)
backend_pb2_grpc.add_BackendServicer_to_server(BackendServicer(), server)
server.add_insecure_port(address)
server.start()

View File

@@ -16,4 +16,14 @@ if [ "x${BUILD_PROFILE}" == "xintel" ]; then
EXTRA_PIP_INSTALL_FLAGS+=" --upgrade --index-strategy=unsafe-first-match"
fi
if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
PYTHON_VERSION="3.12"
PYTHON_PATCH="12"
PY_STANDALONE_TAG="20251120"
fi
if [ "x${BUILD_PROFILE}" == "xl4t12" ]; then
USE_PIP=true
fi
installRequirements

View File

@@ -1,3 +1,3 @@
--extra-index-url https://download.pytorch.org/whl/rocm6.4
--extra-index-url https://download.pytorch.org/whl/rocm7.0
torch
faster-whisper

View File

@@ -0,0 +1,3 @@
--extra-index-url https://pypi.jetson-ai-lab.io/jp6/cu129/
torch
faster-whisper

View File

@@ -0,0 +1,3 @@
--extra-index-url https://download.pytorch.org/whl/cu130
torch
faster-whisper

View File

@@ -19,6 +19,10 @@ import numpy as np
import json
import grpc
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'common'))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'common'))
from grpc_auth import get_auth_interceptors
def is_float(s):
@@ -424,6 +428,8 @@ def serve(address):
("grpc.max_send_message_length", 50 * 1024 * 1024), # 50MB
("grpc.max_receive_message_length", 50 * 1024 * 1024), # 50MB
],
interceptors=get_auth_interceptors(),
)
backend_pb2_grpc.add_BackendServicer_to_server(BackendServicer(), server)
server.add_insecure_port(address)

View File

@@ -1,3 +1,3 @@
--extra-index-url https://download.pytorch.org/whl/rocm6.3
torch==2.7.1+rocm6.3
torchaudio==2.7.1+rocm6.3
--extra-index-url https://download.pytorch.org/whl/rocm7.0
torch==2.10.0+rocm7.0
torchaudio==2.10.0+rocm7.0

View File

@@ -16,6 +16,10 @@ from kittentts import KittenTTS
import soundfile as sf
import grpc
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'common'))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'common'))
from grpc_auth import get_auth_interceptors
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
@@ -77,7 +81,9 @@ def serve(address):
('grpc.max_message_length', 50 * 1024 * 1024), # 50MB
('grpc.max_send_message_length', 50 * 1024 * 1024), # 50MB
('grpc.max_receive_message_length', 50 * 1024 * 1024), # 50MB
])
],
interceptors=get_auth_interceptors(),
)
backend_pb2_grpc.add_BackendServicer_to_server(BackendServicer(), server)
server.add_insecure_port(address)
server.start()

View File

@@ -16,6 +16,10 @@ from kokoro import KPipeline
import soundfile as sf
import grpc
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'common'))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'common'))
from grpc_auth import get_auth_interceptors
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
@@ -84,7 +88,9 @@ def serve(address):
('grpc.max_message_length', 50 * 1024 * 1024), # 50MB
('grpc.max_send_message_length', 50 * 1024 * 1024), # 50MB
('grpc.max_receive_message_length', 50 * 1024 * 1024), # 50MB
])
],
interceptors=get_auth_interceptors(),
)
backend_pb2_grpc.add_BackendServicer_to_server(BackendServicer(), server)
server.add_insecure_port(address)
server.start()

View File

@@ -21,3 +21,8 @@ if [ "x${BUILD_PROFILE}" == "xl4t12" ]; then
fi
installRequirements
# spaCy is a dependency of misaki (used by kokoro for English phonemization).
# Pre-download the model here because at runtime the portable Python environment
# has no pip/uv, so spacy's auto-download would fail.
python -m spacy download en_core_web_sm

Some files were not shown because too many files have changed in this diff Show More