mirror of
https://github.com/mudler/LocalAI.git
synced 2026-04-30 03:55:58 -04:00
703b4fcae85dd6e19c6cf4707af270eecc2ef1d4
82 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
13734ae9fa |
feat: Add Sherpa ONNX backend for ASR and TTS (#8523)
feat(backend): Add Sherpa ONNX backend and Omnilingual ASR Adds a new Go backend wrapping sherpa-onnx via purego (no cgo). Same approach as opus/stablediffusion-ggml/whisper — a thin C shim (csrc/shim.c + shim.h → libsherpa-shim.so) wraps the bits purego can't reach directly: nested struct config writes, result-struct field reads, and the streaming TTS callback trampoline. The Go side uses opaque uintptr handles and purego.NewCallback for the TTS callback. Supports: - VAD via sherpa-onnx's Silero VAD - Offline ASR: Whisper, Paraformer, SenseVoice, Omnilingual CTC - Online/streaming ASR: zipformer transducer with endpoint detection (AudioTranscriptionStream emits delta events during decode) - Offline TTS: VITS (LJS, etc.) - Streaming TTS: sherpa-onnx's callback API → PCM chunks on a channel, prefixed by a streaming WAV header Gallery entries: omnilingual-0.3b-ctc-q8-sherpa (1600-language offline ASR), streaming-zipformer-en-sherpa (low-latency streaming ASR), silero-vad-sherpa, vits-ljs-sherpa. E2E coverage: tests/e2e-backends for offline + streaming ASR, tests/e2e for the full realtime pipeline (VAD + STT + TTS). Assisted-by: claude-opus-4-7-1M [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> |
||
|
|
f5eb13d3c2 |
feat(insightface): add antispoofing (liveness) detection (#9515)
* feat(insightface): add antispoofing (liveness) detection
Light up the anti_spoofing flag that was parked during the first pass.
Both FaceVerify and FaceAnalyze now run the Silent-Face MiniFASNetV2 +
MiniFASNetV1SE ensemble (~4 MB, Apache 2.0, CPU <10ms) when the flag is
set. Failed liveness on either image vetoes FaceVerify regardless of
embedding similarity. Every insightface* gallery entry now ships the
MiniFASNet ONNX weights so existing packs light up after reinstall.
Setting the flag against a model without the MiniFASNet files returns
FAILED_PRECONDITION (HTTP 412) with a clear install message — no
silent is_real=false.
FaceVerifyResponse gained per-image img{1,2}_is_real and
img{1,2}_antispoof_score (proto 9-12); FaceAnalysis's existing
is_real/antispoof_score fields are now populated. Schema fields are
pointers so they are fully absent from the JSON response when
anti_spoofing was not requested — avoids collapsing "not checked" with
"checked and fake" under Go's omitempty on bool.
Validated end-to-end over HTTP against a local install:
- verify + anti_spoofing, both real -> verified=true, score ~0.76
- verify + anti_spoofing, img2 spoof -> verified=false, img2_is_real=false
- analyze + anti_spoofing -> is_real and score per face
- flag against model without MiniFASNet -> HTTP 412 fail-loud
Assisted-by: Claude:claude-opus-4-7 go vet
* test(insightface): wire test target into test-extra
The root Makefile's `test-extra` already runs
`$(MAKE) -C backend/python/insightface test`, but the backend's
Makefile never defined the target — so the command silently errored
and the suite was never executed in CI. Adding the two-line target
(matching ace-step/Makefile) hooks `test.sh` → `runUnittests` →
`python -m unittest test.py`, which discovers both the pre-existing
engine classes (InsightFaceEngineTest, OnnxDirectEngineTest) and the
new AntispoofingTest. Each class skips gracefully when its weights
can't be downloaded from a network-restricted runner.
Assisted-by: Claude:claude-opus-4-7
* test(insightface): exercise antispoofing in e2e-backends (both paths)
Add a `face_antispoof` capability to the Ginkgo e2e suite and extend
the existing FaceVerify + FaceAnalyze specs with liveness assertions
covering BOTH paths:
real fixture -> is_real=true, score>0, verified stays true
spoof fixture -> is_real=false, verified vetoed to false
The spoof fixture is upstream's own `image_F2.jpg` (via the yakhyo
mirror) — verified locally against the MiniFASNetV2+V1SE ensemble to
classify as is_real=false with score ~0.013. That makes the assertion
deterministic across CI runs; synthetic/derived spoofs fool the model
unpredictably and would be flaky.
Makefile wires it up end-to-end:
- New INSIGHTFACE_ANTISPOOF_* cache dir + two ONNX downloads with
pinned SHAs, matching the gallery entries.
- insightface-antispoof-models target shared by both backend configs.
- FACE_SPOOF_IMAGE_URL passed via BACKEND_TEST_FACE_SPOOF_IMAGE_URL.
- Both e2e targets (buffalo-sc + opencv) now:
* depend on insightface-antispoof-models
* pass antispoof_v2_onnx / antispoof_v1se_onnx in BACKEND_TEST_OPTIONS
* include face_antispoof in BACKEND_TEST_CAPS
backend_test.go adds the new capability constant and a faceSpoofFile
fixture resolved the same way as faceFile1/2/3. Spoof assertions are
gated on both capFaceAntispoof AND faceSpoofFile being set, so a test
config that omits the spoof fixture degrades gracefully to "real path
only" instead of failing.
Assisted-by: Claude:claude-opus-4-7 go vet
|
||
|
|
181ebb6df4 |
feat: voice recognition (#9500)
* feat(voice-recognition): add /v1/voice/{verify,analyze,embed} + speaker-recognition backend
Audio analog to face recognition. Adds three gRPC RPCs
(VoiceVerify / VoiceAnalyze / VoiceEmbed), their Go service and HTTP
layers, a new FLAG_SPEAKER_RECOGNITION capability flag, and a Python
backend scaffold under backend/python/speaker-recognition/ wrapping
SpeechBrain ECAPA-TDNN with a parallel OnnxDirectEngine for
WeSpeaker / 3D-Speaker ONNX exports.
The kokoros Rust backend gets matching unimplemented trait stubs —
tonic's async_trait has no defaults, so adding an RPC without Rust
stubs breaks the build (same regression fixed by
|
||
|
|
20baec77ab |
feat(face-recognition): add insightface/onnx backend for 1:1 verify, 1:N identify, embedding, detection, analysis (#9480)
* feat(face-recognition): add insightface backend for 1:1 verify, 1:N identify, embedding, detection, analysis
Adds face recognition as a new first-class capability in LocalAI via the
`insightface` Python backend, with a pluggable two-engine design so
non-commercial (insightface model packs) and commercial-safe
(OpenCV Zoo YuNet + SFace) models share the same gRPC/HTTP surface.
New gRPC RPCs (backend/backend.proto):
* FaceVerify(FaceVerifyRequest) returns FaceVerifyResponse
* FaceAnalyze(FaceAnalyzeRequest) returns FaceAnalyzeResponse
Existing Embedding and Detect RPCs are reused (face image in
PredictOptions.Images / DetectOptions.src) for face embedding and
face detection respectively.
New HTTP endpoints under /v1/face/:
* verify — 1:1 image pair same-person decision
* analyze — per-face age + gender (emotion/race reserved)
* register — 1:N enrollment; stores embedding in vector store
* identify — 1:N recognition; detect → embed → StoresFind
* forget — remove a registered face by opaque ID
Service layer (core/services/facerecognition/) introduces a
`Registry` interface with one in-memory `storeRegistry` impl backed
by LocalAI's existing local-store gRPC vector backend. HTTP handlers
depend on the interface, not on StoresSet/StoresFind directly, so a
persistent PostgreSQL/pgvector implementation can be slotted in via a
single constructor change in core/application (TODO marker in the
package doc).
New usecase flag FLAG_FACE_RECOGNITION; insightface is also wired
into FLAG_DETECTION so /v1/detection works for face bounding boxes.
Gallery (backend/index.yaml) ships three entries:
* insightface-buffalo-l — SCRFD-10GF + ArcFace R50 + genderage
(~326MB pre-baked; non-commercial research use only)
* insightface-opencv — YuNet + SFace (~40MB pre-baked; Apache 2.0)
* insightface-buffalo-s — SCRFD-500MF + MBF (runtime download; non-commercial)
Python backend (backend/python/insightface/):
* engines.py — FaceEngine protocol with InsightFaceEngine and
OnnxDirectEngine; resolves model paths relative to the backend
directory so the same gallery config works in docker-scratch and
in the e2e-backends rootfs-extraction harness.
* backend.py — gRPC servicer implementing Health, LoadModel, Status,
Embedding, Detect, FaceVerify, FaceAnalyze.
* install.sh — pre-bakes buffalo_l + OpenCV YuNet/SFace inside the
backend directory so first-run is offline-clean (the final scratch
image only preserves files under /<backend>/).
* test.py — parametrized unit tests over both engines.
Tests:
* Registry unit tests (go test -race ./core/services/facerecognition/...)
— in-memory fake grpc.Backend, table-driven, covers register/
identify/forget/error paths + concurrent access.
* tests/e2e-backends/backend_test.go extended with face caps
(face_detect, face_embed, face_verify, face_analyze); relative
ordering + configurable verifyCeiling per engine.
* Makefile targets: test-extra-backend-insightface-buffalo-l,
-opencv, and the -all aggregate.
* CI: .github/workflows/test-extra.yml gains tests-insightface-grpc,
auto-triggered by changes under backend/python/insightface/.
Docs:
* docs/content/features/face-recognition.md — feature page with
license table, quickstart (defaults to the commercial-safe model),
models matrix, API reference, 1:N workflow, storage caveats.
* Cross-refs in object-detection.md, stores.md, embeddings.md, and
whats-new.md.
* Contributor README at backend/python/insightface/README.md.
Verified end-to-end:
* buffalo_l: 6/6 specs (health, load, face_detect, face_embed,
face_verify, face_analyze).
* opencv: 5/5 specs (same minus face_analyze — SFace has no
demographic head; correctly skipped via BACKEND_TEST_CAPS).
Assisted-by: Claude:claude-opus-4-7
* fix(face-recognition): move engine selection to model gallery, collapse backend entries
The previous commit put engine/model_pack options on backend gallery
entries (`backend/index.yaml`). That was wrong — `GalleryBackend`
(core/gallery/backend_types.go:32) has no `options` field, so the
YAML decoder silently dropped those keys and all three "different
insightface-*" backend entries resolved to the same container image
with no distinguishing configuration.
Correct split:
* `backend/index.yaml` now has ONE `insightface` backend entry
shipping the CPU + CUDA 12 container images. The Python backend
bundles both the non-commercial insightface model packs
(buffalo_l / buffalo_s) and the commercial-safe OpenCV Zoo
weights (YuNet + SFace); the active engine is selected at
LoadModel time via `options: ["engine:..."]`.
* `gallery/index.yaml` gains three model entries —
`insightface-buffalo-l`, `insightface-opencv`,
`insightface-buffalo-s` — each setting the appropriate
`overrides.backend` + `overrides.options` so installing one
actually gives the user the intended engine. This matches how
`rfdetr-base` lives in the model gallery against the `rfdetr`
backend.
The earlier e2e tests passed despite this bug because the Makefile
targets pass `BACKEND_TEST_OPTIONS` directly to LoadModel via gRPC,
bypassing any gallery resolution entirely. No code changes needed.
Assisted-by: Claude:claude-opus-4-7
* feat(face-recognition): cover all supported models in the gallery + drop weight baking
Follows up on the model-gallery split: adds entries for every model
configuration either engine actually supports, and switches weight
delivery from image-baked to LocalAI's standard gallery mechanism.
Gallery now has seven `insightface-*` model entries (gallery/index.yaml):
insightface (family) — non-commercial research use
• buffalo-l (326MB) — SCRFD-10GF + ResNet50 + genderage, default
• buffalo-m (313MB) — SCRFD-2.5GF + ResNet50 + genderage
• buffalo-s (159MB) — SCRFD-500MF + MBF + genderage
• buffalo-sc (16MB) — SCRFD-500MF + MBF, recognition only
(no landmarks, no demographics — analyze
returns empty attributes)
• antelopev2 (407MB) — SCRFD-10GF + ResNet100@Glint360K + genderage
OpenCV Zoo family — Apache 2.0 commercial-safe
• opencv — YuNet + SFace fp32 (~40MB)
• opencv-int8 — YuNet + SFace int8 (~12MB, ~3x smaller, faster on CPU)
Model weights are no longer baked into the backend image. The image
now ships only the Python runtime + libraries (~275MB content size,
~1.18GB disk vs ~1.21GB when weights were baked). Weights flow through
LocalAI's gallery mechanism:
* OpenCV variants list `files:` with ONNX URIs + SHA-256, so
`local-ai models install insightface-opencv` pulls them into the
models directory exactly like any other gallery-managed model.
* insightface packs (upstream distributes .zip archives only, not
individual ONNX files) auto-download on first LoadModel via
FaceAnalysis' built-in machinery, rooted at the LocalAI models
directory so they live alongside everything else — same pattern
`rfdetr` uses with `inference.get_model()`.
Backend changes (backend/python/insightface/):
* backend.py — LoadModel propagates `ModelOptions.ModelPath` (the
LocalAI models directory) to engines via a `_model_dir` hint.
This replaces the earlier ModelFile-dirname approach; ModelPath
is the canonical "models directory" variable set by the Go loader
(pkg/model/initializers.go:144) and is always populated.
* engines.py::_resolve_model_path — picks up `model_dir` and searches
it (plus basename-in-model-dir) before falling back to the dev
script-dir. This is how OnnxDirectEngine finds gallery-downloaded
YuNet/SFace files by filename only.
* engines.py::_flatten_insightface_pack — new helper that works
around an upstream packaging inconsistency: buffalo_l/s/sc zips
expand flat, but buffalo_m and antelopev2 zips wrap their ONNX
files in a redundant `<name>/` directory. insightface's own
loader looks one level too shallow and fails. We call
`ensure_available()` explicitly, flatten if nested, then hand to
FaceAnalysis.
* engines.py::InsightFaceEngine.prepare — root-resolution order now
includes the `_model_dir` hint so packs download into the LocalAI
models directory by default.
* install.sh — no longer pre-downloads any weights. Everything is
gallery-managed now.
* smoke.py (new) — parametrized smoke test that iterates over every
gallery configuration, simulating the LocalAI install flow
(creates a models dir, fetches OpenCV files with checksum
verification, lets insightface auto-download its packs), then
runs detect + embed + verify (+ analyze where supported) through
the in-process BackendServicer.
* test.py — OnnxDirectEngineTest no longer hardcodes `/models/opencv/`
paths; downloads ONNX files to a temp dir at setUpClass time and
passes ModelPath accordingly.
Registry change (core/services/facerecognition/store_registry.go):
* `dim=0` in NewStoreRegistry now means "accept whatever dimension
arrives" — needed because the backend supports 512-d ArcFace/MBF
and 128-d SFace via the same Registry. A non-zero dim still fails
fast with ErrDimensionMismatch.
* core/application plumbs `faceEmbeddingDim = 0`, explaining the
rationale in the comment.
Backend gallery description updated to reflect that the image carries
no weights — it's just Python + engines.
Smoke-tested all 7 configurations against the rebuilt image (with the
flatten fix applied), exit 0:
PASS: insightface-buffalo-l faces=6 dim=512 same-dist=0.000
PASS: insightface-buffalo-sc faces=6 dim=512 same-dist=0.000
PASS: insightface-buffalo-s faces=6 dim=512 same-dist=0.000
PASS: insightface-buffalo-m faces=6 dim=512 same-dist=0.000
PASS: insightface-antelopev2 faces=6 dim=512 same-dist=0.000
PASS: insightface-opencv faces=6 dim=128 same-dist=0.000
PASS: insightface-opencv-int8 faces=6 dim=128 same-dist=0.000
7/7 passed
Assisted-by: Claude:claude-opus-4-7
* fix(face-recognition): pre-fetch OpenCV ONNX for e2e target; drop stale pre-baked claim
CI regression from the previous commit: I moved OpenCV Zoo weight
delivery to LocalAI's gallery `files:` mechanism, but the
test-extra-backend-insightface-opencv target was still passing
relative paths `detector_onnx:models/opencv/yunet.onnx` in
BACKEND_TEST_OPTIONS. The e2e suite drives LoadModel directly over
gRPC without going through the gallery, so those relative paths
resolved to nothing and OpenCV's ONNXImporter failed:
LoadModel failed: Failed to load face engine:
OpenCV(4.13.0) ... Can't read ONNX file: models/opencv/yunet.onnx
Fix: add an `insightface-opencv-models` prerequisite target that
fetches the two ONNX files (YuNet + SFace) to a deterministic host
cache at /tmp/localai-insightface-opencv-cache/, verifies SHA-256,
and skips the download on re-runs. The opencv test target depends on
it and passes absolute paths in BACKEND_TEST_OPTIONS, so the backend
finds the files via its normal absolute-path resolution branch.
Also refresh the buffalo_l comment: it no longer says "pre-baked"
(nothing is — the pack auto-downloads from upstream's GitHub release
on first LoadModel, same as in CI).
Locally verified: `make test-extra-backend-insightface-opencv` passes
5/5 specs (health, load, face_detect, face_embed, face_verify).
Assisted-by: Claude:claude-opus-4-7
* feat(face-recognition): add POST /v1/face/embed + correct /v1/embeddings docs
The docs promised that /v1/embeddings returns face vectors when you
send an image data-URI. That was never true: /v1/embeddings is
OpenAI-compatible and text-only by contract — its handler goes
through `core/backend/embeddings.go::ModelEmbedding`, which sets
`predictOptions.Embeddings = s` (a string of TEXT to embed) and never
populates `predictOptions.Images[]`. The Python backend's Embedding
gRPC method does handle Images[] (that's how /v1/face/register reaches
it internally via `backend.FaceEmbed`), but the HTTP embeddings
endpoint wasn't wired to populate it.
Rather than overload /v1/embeddings with image-vs-text detection —
messy, and the endpoint is OpenAI-compatible by design — add a
dedicated /v1/face/embed endpoint that wraps `backend.FaceEmbed`
(already used internally by /v1/face/register and /v1/face/identify).
Matches LocalAI's convention of a dedicated path per non-standard flow
(/v1/rerank, /v1/detection, /v1/face/verify etc.).
Response:
{
"embedding": [<dim> floats, L2-normed],
"dim": int, // 512 for ArcFace R50 / MBF, 128 for SFace
"model": "<name>"
}
Live-tested on the opencv engine: returns a 128-d L2-normalized vector
(sum(x^2) = 1.0000). Sentinel in docs updated to note /v1/embeddings
is text-only and point image users at /v1/face/embed instead.
Assisted-by: Claude:claude-opus-4-7
* fix(http): map malformed image input + gRPC status codes to proper 4xx
Image-input failures on LocalAI's single-image endpoints (/v1/detection,
/v1/face/{verify,analyze,embed,register,identify}) have historically
returned 500 — even when the client was the one who sent garbage.
Classic example: you POST an "image" that isn't a URL, isn't a
data-URI, and isn't a valid JPEG/PNG — the server shouldn't claim
that's its fault.
Two helpers land in core/http/endpoints/localai/images.go and every
single-image handler is switched over:
* decodeImageInput(s)
Wraps utils.GetContentURIAsBase64 and turns any failure
(invalid URL, not a data-URI, download error, etc.) into
echo.NewHTTPError(400, "invalid image input: ...").
* mapBackendError(err)
Inspects the gRPC status on a backend call error and maps:
INVALID_ARGUMENT → 400 Bad Request
NOT_FOUND → 404 Not Found
FAILED_PRECONDITION → 412 Precondition Failed
Unimplemented → 501 Not Implemented
All other codes fall through unchanged (still 500).
Before, my 1×1 PNG error-path test returned:
HTTP 500 "rpc error: code = InvalidArgument desc = failed to decode one or both images"
After:
HTTP 400 "failed to decode one or both images"
Scope-limited to the LocalAI single-image endpoints. The multi-modal
paths (middleware/request.go, openresponses/responses.go,
openai/realtime.go) intentionally log-and-skip individual media parts
when decoding fails — different design intent (graceful degradation
of a multi-part message), not a 400-worthy failure. Left untouched.
Live-verified: every error case in /tmp/face_errors.py now returns
4xx with a meaningful message; the "image with no face (1x1 PNG)"
case specifically went from 500 → 400.
Assisted-by: Claude:claude-opus-4-7
* refactor(face-recognition): insightface packs go through gallery files:, drop FaceAnalysis
Follows up on the discovery that LocalAI's gallery `files:` mechanism
handles archives (zip, tar.gz, …) via mholt/archiver/v3 — the rhasspy
piper voices use exactly this pattern. Insightface packs are zip
archives, so we can now deliver them the same way every other
gallery-managed model gets delivered: declaratively, checksum-verified,
through LocalAI's standard download+extract pipeline.
Two changes:
1. Gallery (gallery/index.yaml) — every insightface-* entry gains a
`files:` list with the pack zip's URI + SHA-256. `local-ai models
install insightface-buffalo-l` now fetches the zip, verifies the
hash, and extracts it into the models directory. No more reliance
on insightface's library-internal `ensure_available()` auto-download
or its hardcoded `BASE_REPO_URL`.
2. InsightFaceEngine (backend/python/insightface/engines.py) — drops
the FaceAnalysis wrapper and drives insightface's `model_zoo`
directly. The ~50 lines FaceAnalysis provides — glob ONNX files,
route each through `model_zoo.get_model()`, build a
`{taskname: model}` dict, loop per-face at inference — are
reimplemented in `InsightFaceEngine`. The actual inference classes
(RetinaFace, ArcFaceONNX, Attribute, Landmark) are still
insightface's — we only replicate the glue, so drift risk against
upstream is minimal.
Why drop FaceAnalysis: it hard-codes a `<root>/models/<name>/*.onnx`
layout that doesn't match what LocalAI's zip extraction produces.
LocalAI unpacks archives flat into `<models_dir>`. Upstream packs
are inconsistent — buffalo_l/s/sc ship ONNX at the zip root (lands
at `<models_dir>/*.onnx`), buffalo_m/antelopev2 wrap in a redundant
`<name>/` dir (lands at `<models_dir>/<name>/*.onnx`). The new
`_locate_insightface_pack` helper searches both locations plus
legacy paths and returns whichever has ONNX files. Replaces the
earlier `_flatten_insightface_pack` helper (which tried to fight
FaceAnalysis's layout expectations; now we just find the files
wherever they are).
Net effect for users: install once via LocalAI's managed flow,
weights live alongside every other model, progress shows in the
jobs endpoint, no first-load network call. Same API surface,
cleaner plumbing.
Assisted-by: Claude:claude-opus-4-7
* fix(face-recognition): CI's insightface e2e path needs the pack pre-fetched
The e2e suite drives LoadModel over gRPC without going through LocalAI's
gallery flow, so the engine's `_model_dir` option (normally populated
from ModelPath) is empty. Previously the insightface target relied on
FaceAnalysis auto-download to paper over this, but we dropped
FaceAnalysis in favor of direct model_zoo calls — so the buffalo_l
target started failing at LoadModel with "no insightface pack found".
Mirror the opencv target's pre-fetch pattern: download buffalo_sc.zip
(same SHA as the gallery entry), extract it on the host, and pass
`root:<dir>` so the engine locates the pack without needing
ModelPath. Switched to buffalo_sc (smallest pack, ~16MB) to keep CI
fast; it covers the same insightface engine code path as buffalo_l.
Face analyze cap dropped since buffalo_sc has no age/gender head.
Assisted-by: Claude:claude-opus-4-7[1m]
* feat(face-recognition): surface face-recognition in advertised feature maps
The six /v1/face/* endpoints were missing from every place LocalAI
advertises its feature surface to clients:
* api_instructions — the machine-readable capability index at
GET /api/instructions. Added `face-recognition` as a dedicated
instruction area with an intro that calls out the in-memory
registry caveat and the /v1/face/embed vs /v1/embeddings split.
* auth/permissions — added FeatureFaceRecognition constant, routed
all six face endpoints through it so admins can gate them per-user
like any other API feature. Default ON (matches the other API
features).
* React UI capabilities — CAP_FACE_RECOGNITION symbol mapped to
FLAG_FACE_RECOGNITION. Declared only for now; the Face page is a
follow-up (noted in the plan).
Instruction count bumped 9 → 10; test updated.
Assisted-by: Claude:claude-opus-4-7[1m]
* docs(agents): capture advertising-surface steps in the endpoint guide
Before this change, adding a new /v1/* endpoint reliably missed one or
more of: the swagger @Tags annotation, the /api/instructions registry,
the auth RouteFeatureRegistry, and the React UI CAP_* symbol. The
endpoint would work but be invisible to API consumers, admins, and the
UI — and nothing in the existing docs said to look in those places.
Extend .agents/api-endpoints-and-auth.md with a new "Advertising
surfaces" section covering all four surfaces (swagger tags, /api/
instructions, capabilities.js, docs/), and expand the closing checklist
so it's impossible to ship a feature without visiting each one. Hoist a
one-liner reminder into AGENTS.md's Quick Reference so agents skim it
before diving in.
Assisted-by: Claude:claude-opus-4-7[1m]
|
||
|
|
c54897ad44 |
fix(tests): update InstallBackend call sites for new URI/Name/Alias params (#9467)
Commit
|
||
|
|
6f0051301b |
feat(backend): add tinygrad multimodal backend (experimental) (#9364)
* feat(backend): add tinygrad multimodal backend
Wire tinygrad as a new Python backend covering LLM text generation with
native tool-call extraction, embeddings, Stable Diffusion 1.x image
generation, and Whisper speech-to-text from a single self-contained
container.
Backend (`backend/python/tinygrad/`):
- `backend.py` gRPC servicer with LLM Predict/PredictStream (auto-detects
Llama / Qwen2 / Mistral architecture from `config.json`, supports
safetensors and GGUF), Embedding via mean-pooled last hidden state,
GenerateImage via the vendored SD1.x pipeline, AudioTranscription +
AudioTranscriptionStream via the vendored Whisper inference loop, plus
Tokenize / ModelMetadata / Status / Free.
- Vendored upstream model code under `vendor/` (MIT, headers preserved):
llama.py with an added `qkv_bias` flag for Qwen2-family bias support
and an `embed()` method that returns the last hidden state, plus
clip.py, unet.py, stable_diffusion.py (trimmed to drop the MLPerf
training branch that pulls `mlperf.initializers`), audio_helpers.py
and whisper.py (trimmed to drop the pyaudio listener).
- Pluggable tool-call parsers under `tool_parsers/`: hermes (Qwen2.5 /
Hermes), llama3_json (Llama 3.1+), qwen3_xml (Qwen 3), mistral
(Mistral / Mixtral). Auto-selected from model architecture or `Options`.
- `install.sh` pins Python 3.11.14 (tinygrad >=0.12 needs >=3.11; the
default portable python is 3.10).
- `package.sh` bundles libLLVM.so.1 + libedit/libtinfo/libgomp/libsndfile
into the scratch image. `run.sh` sets `CPU_LLVM=1` and `LLVM_PATH` so
tinygrad's CPU device uses the in-process libLLVM JIT instead of
shelling out to the missing `clang` binary.
- Local unit tests for Health and the four parsers in `test.py`.
Build wiring:
- Root `Makefile`: `.NOTPARALLEL`, `prepare-test-extra`, `test-extra`,
`BACKEND_TINYGRAD = tinygrad|python|.|false|true`,
docker-build-target eval, and `docker-build-backends` aggregator.
- `.github/workflows/backend.yml`: cpu / cuda12 / cuda13 build matrix
entries (mirrors the transformers backend placement).
- `backend/index.yaml`: `&tinygrad` meta + cpu/cuda12/cuda13 image
entries (latest + development).
E2E test wiring:
- `tests/e2e-backends/backend_test.go` gains an `image` capability that
exercises GenerateImage and asserts a non-empty PNG is written to
`dst`. New `BACKEND_TEST_IMAGE_PROMPT` / `BACKEND_TEST_IMAGE_STEPS`
knobs.
- Five new make targets next to `test-extra-backend-vllm`:
- `test-extra-backend-tinygrad` — Qwen2.5-0.5B-Instruct + hermes,
mirrors the vllm target 1:1 (5/9 specs in ~57s).
- `test-extra-backend-tinygrad-embeddings` — same model, embeddings
via LLM hidden state (3/9 in ~10s).
- `test-extra-backend-tinygrad-sd` — stable-diffusion-v1-5 mirror,
health/load/image (3/9 in ~10min, 4 diffusion steps on CPU).
- `test-extra-backend-tinygrad-whisper` — openai/whisper-tiny.en
against jfk.wav from whisper.cpp samples (4/9 in ~49s).
- `test-extra-backend-tinygrad-all` aggregate.
All four targets land green on the first MVP pass: 15 specs total, 0
failures across LLM+tools, embeddings, image generation, and speech
transcription.
* refactor(tinygrad): collapse to a single backend image
tinygrad generates its own GPU kernels (PTX renderer for CUDA, the
autogen ctypes wrappers for HIP / Metal / WebGPU) and never links
against cuDNN, cuBLAS, or any toolkit-version-tied library. The only
runtime dependency that varies across hosts is the driver's libcuda.so.1
/ libamdhip64.so, which are injected into the container at run time by
the nvidia-container / rocm runtimes. So unlike torch- or vLLM-based
backends, there is no reason to ship per-CUDA-version images.
- Drop the cuda12-tinygrad and cuda13-tinygrad build-matrix entries
from .github/workflows/backend.yml. The sole remaining entry is
renamed to -tinygrad (from -cpu-tinygrad) since it is no longer
CPU-only.
- Collapse backend/index.yaml to a single meta + development pair.
The meta anchor carries the latest uri directly; the development
entry points at the master tag.
- run.sh picks the tinygrad device at launch time by probing
/usr/lib/... for libcuda.so.1 / libamdhip64.so. When libcuda is
visible we set CUDA=1 + CUDA_PTX=1 so tinygrad uses its own PTX
renderer (avoids any nvrtc/toolkit dependency); otherwise we fall
back to HIP or CLANG. CPU_LLVM=1 + LLVM_PATH keep the in-process
libLLVM JIT for the CLANG path.
- backend.py's _select_tinygrad_device() is trimmed to a CLANG-only
fallback since production device selection happens in run.sh.
Re-ran test-extra-backend-tinygrad after the change:
Ran 5 of 9 Specs in 56.541 seconds — 5 Passed, 0 Failed
|
||
|
|
95efb8a562 |
feat(backend): add turboquant llama.cpp-fork backend (#9355)
* feat(backend): add turboquant llama.cpp-fork backend
turboquant is a llama.cpp fork (TheTom/llama-cpp-turboquant, branch
feature/turboquant-kv-cache) that adds a TurboQuant KV-cache scheme.
It ships as a first-class backend reusing backend/cpp/llama-cpp sources
via a thin wrapper Makefile: each variant target copies ../llama-cpp
into a sibling build dir and invokes llama-cpp's build-llama-cpp-grpc-server
with LLAMA_REPO/LLAMA_VERSION overridden to point at the fork. No
duplication of grpc-server.cpp — upstream fixes flow through automatically.
Wires up the full matrix (CPU, CUDA 12/13, L4T, L4T-CUDA13, ROCm, SYCL
f32/f16, Vulkan) in backend.yml and the gallery entries in index.yaml,
adds a tests-turboquant-grpc e2e job driven by BACKEND_TEST_CACHE_TYPE_K/V=q8_0
to exercise the KV-cache config path (backend_test.go gains dedicated env
vars wired into ModelOptions.CacheTypeKey/Value — a generic improvement
usable by any llama.cpp-family backend), and registers a nightly auto-bump
PR in bump_deps.yaml tracking feature/turboquant-kv-cache.
scripts/changed-backends.js gets a special-case so edits to
backend/cpp/llama-cpp/ also retrigger the turboquant CI pipeline, since
the wrapper reuses those sources.
* feat(turboquant): carry upstream patches against fork API drift
turboquant branched from llama.cpp before upstream commit 66060008
("server: respect the ignore eos flag", #21203) which added the
`logit_bias_eog` field to `server_context_meta` and a matching
parameter to `server_task::params_from_json_cmpl`. The shared
backend/cpp/llama-cpp/grpc-server.cpp depends on that field, so
building it against the fork unmodified fails.
Cherry-pick that commit as a patch file under
backend/cpp/turboquant/patches/ and apply it to the cloned fork
sources via a new apply-patches.sh hook called from the wrapper
Makefile. Simplifies the build flow too: instead of hopping through
llama-cpp's build-llama-cpp-grpc-server indirection, the wrapper now
drives the copied Makefile directly (clone -> patch -> build).
Drop the corresponding patch whenever the fork catches up with
upstream — the build fails fast if a patch stops applying, which
is the signal to retire it.
* docs: add turboquant backend section + clarify cache_type_k/v
Document the new turboquant (llama.cpp fork with TurboQuant KV-cache)
backend alongside the existing llama-cpp / ik-llama-cpp sections in
features/text-generation.md: when to pick it, how to install it from
the gallery, and a YAML example showing backend: turboquant together
with cache_type_k / cache_type_v.
Also expand the cache_type_k / cache_type_v table rows in
advanced/model-configuration.md to spell out the accepted llama.cpp
quantization values and note that these fields apply to all
llama.cpp-family backends, not just vLLM.
* feat(turboquant): patch ggml-rpc GGML_OP_COUNT assertion
The fork adds new GGML ops bringing GGML_OP_COUNT to 97, but
ggml/include/ggml-rpc.h static-asserts it equals 96, breaking
the GGML_RPC=ON build paths (turboquant-grpc / turboquant-rpc-server).
Carry a one-line patch that updates the expected count so the
assertion holds. Drop this patch whenever the fork fixes it upstream.
* feat(turboquant): allow turbo* KV-cache types and exercise them in e2e
The shared backend/cpp/llama-cpp/grpc-server.cpp carries its own
allow-list of accepted KV-cache types (kv_cache_types[]) and rejects
anything outside it before the value reaches llama.cpp's parser. That
list only contains the standard llama.cpp types — turbo2/turbo3/turbo4
would throw "Unsupported cache type" at LoadModel time, meaning
nothing the LocalAI gRPC layer accepted was actually fork-specific.
Add a build-time augmentation step (patch-grpc-server.sh, called from
the turboquant wrapper Makefile) that inserts GGML_TYPE_TURBO2_0/3_0/4_0
into the allow-list of the *copied* grpc-server.cpp under
turboquant-<flavor>-build/. The original file under backend/cpp/llama-cpp/
is never touched, so the stock llama-cpp build keeps compiling against
vanilla upstream which has no notion of those enum values.
Switch test-extra-backend-turboquant to set
BACKEND_TEST_CACHE_TYPE_K=turbo3 / _V=turbo3 so the e2e gRPC suite
actually runs the fork's TurboQuant KV-cache code paths (turbo3 also
auto-enables flash_attention in the fork). Picking q8_0 here would
only re-test the standard llama.cpp path that the upstream llama-cpp
backend already covers.
Refresh the docs (text-generation.md + model-configuration.md) to
list turbo2/turbo3/turbo4 explicitly and call out that you only get
the TurboQuant code path with this backend + a turbo* cache type.
* fix(turboquant): rewrite patch-grpc-server.sh in awk, not python3
The builder image (ubuntu:24.04 stage-2 in Dockerfile.turboquant)
does not install python3, so the python-based augmentation step
errored with `python3: command not found` at make time. Switch to
awk, which ships in coreutils and is already available everywhere
the rest of the wrapper Makefile runs.
* Apply suggestion from @mudler
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
---------
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
|
||
|
|
87e6de1989 |
feat: wire transcription for llama.cpp, add streaming support (#9353)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
a05c7def59 |
fix(e2e): update to new testcontainers
Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
d67623230f |
feat(vllm): parity with llama.cpp backend (#9328)
* fix(schema): serialize ToolCallID and Reasoning in Messages.ToProto
The ToProto conversion was dropping tool_call_id and reasoning_content
even though both proto and Go fields existed, breaking multi-turn tool
calling and reasoning passthrough to backends.
* refactor(config): introduce backend hook system and migrate llama-cpp defaults
Adds RegisterBackendHook/runBackendHooks so each backend can register
default-filling functions that run during ModelConfig.SetDefaults().
Migrates the existing GGUF guessing logic into hooks_llamacpp.go,
registered for both 'llama-cpp' and the empty backend (auto-detect).
Removes the old guesser.go shim.
* feat(config): add vLLM parser defaults hook and importer auto-detection
Introduces parser_defaults.json mapping model families to vLLM
tool_parser/reasoning_parser names, with longest-pattern-first matching.
The vllmDefaults hook auto-fills tool_parser and reasoning_parser
options at load time for known families, while the VLLMImporter writes
the same values into generated YAML so users can review and edit them.
Adds tests covering MatchParserDefaults, hook registration via
SetDefaults, and the user-override behavior.
* feat(vllm): wire native tool/reasoning parsers + chat deltas + logprobs
- Use vLLM's ToolParserManager/ReasoningParserManager to extract structured
output (tool calls, reasoning content) instead of reimplementing parsing
- Convert proto Messages to dicts and pass tools to apply_chat_template
- Emit ChatDelta with content/reasoning_content/tool_calls in Reply
- Extract prompt_tokens, completion_tokens, and logprobs from output
- Replace boolean GuidedDecoding with proper GuidedDecodingParams from Grammar
- Add TokenizeString and Free RPC methods
- Fix missing `time` import used by load_video()
* feat(vllm): CPU support + shared utils + vllm-omni feature parity
- Split vllm install per acceleration: move generic `vllm` out of
requirements-after.txt into per-profile after files (cublas12, hipblas,
intel) and add CPU wheel URL for cpu-after.txt
- requirements-cpu.txt now pulls torch==2.7.0+cpu from PyTorch CPU index
- backend/index.yaml: register cpu-vllm / cpu-vllm-development variants
- New backend/python/common/vllm_utils.py: shared parse_options,
messages_to_dicts, setup_parsers helpers (used by both vllm backends)
- vllm-omni: replace hardcoded chat template with tokenizer.apply_chat_template,
wire native parsers via shared utils, emit ChatDelta with token counts,
add TokenizeString and Free RPCs, detect CPU and set VLLM_TARGET_DEVICE
- Add test_cpu_inference.py: standalone script to validate CPU build with
a small model (Qwen2.5-0.5B-Instruct)
* fix(vllm): CPU build compatibility with vllm 0.14.1
Validated end-to-end on CPU with Qwen2.5-0.5B-Instruct (LoadModel, Predict,
TokenizeString, Free all working).
- requirements-cpu-after.txt: pin vllm to 0.14.1+cpu (pre-built wheel from
GitHub releases) for x86_64 and aarch64. vllm 0.14.1 is the newest CPU
wheel whose torch dependency resolves against published PyTorch builds
(torch==2.9.1+cpu). Later vllm CPU wheels currently require
torch==2.10.0+cpu which is only available on the PyTorch test channel
with incompatible torchvision.
- requirements-cpu.txt: bump torch to 2.9.1+cpu, add torchvision/torchaudio
so uv resolves them consistently from the PyTorch CPU index.
- install.sh: add --index-strategy=unsafe-best-match for CPU builds so uv
can mix the PyTorch index and PyPI for transitive deps (matches the
existing intel profile behaviour).
- backend.py LoadModel: vllm >= 0.14 removed AsyncLLMEngine.get_model_config
so the old code path errored out with AttributeError on model load.
Switch to the new get_tokenizer()/tokenizer accessor with a fallback
to building the tokenizer directly from request.Model.
* fix(vllm): tool parser constructor compat + e2e tool calling test
Concrete vLLM tool parsers override the abstract base's __init__ and
drop the tools kwarg (e.g. Hermes2ProToolParser only takes tokenizer).
Instantiating with tools= raised TypeError which was silently caught,
leaving chat_deltas.tool_calls empty.
Retry the constructor without the tools kwarg on TypeError — tools
aren't required by these parsers since extract_tool_calls finds tool
syntax in the raw model output directly.
Validated with Qwen/Qwen2.5-0.5B-Instruct + hermes parser on CPU:
the backend correctly returns ToolCallDelta{name='get_weather',
arguments='{"location": "Paris, France"}'} in ChatDelta.
test_tool_calls.py is a standalone smoke test that spawns the gRPC
backend, sends a chat completion with tools, and asserts the response
contains a structured tool call.
* ci(backend): build cpu-vllm container image
Add the cpu-vllm variant to the backend container build matrix so the
image registered in backend/index.yaml (cpu-vllm / cpu-vllm-development)
is actually produced by CI.
Follows the same pattern as the other CPU python backends
(cpu-diffusers, cpu-chatterbox, etc.) with build-type='' and no CUDA.
backend_pr.yml auto-picks this up via its matrix filter from backend.yml.
* test(e2e-backends): add tools capability + HF model name support
Extends tests/e2e-backends to cover backends that:
- Resolve HuggingFace model ids natively (vllm, vllm-omni) instead of
loading a local file: BACKEND_TEST_MODEL_NAME is passed verbatim as
ModelOptions.Model with no download/ModelFile.
- Parse tool calls into ChatDelta.tool_calls: new "tools" capability
sends a Predict with a get_weather function definition and asserts
the Reply contains a matching ToolCallDelta. Uses UseTokenizerTemplate
with OpenAI-style Messages so the backend can wire tools into the
model's chat template.
- Need backend-specific Options[]: BACKEND_TEST_OPTIONS lets a test set
e.g. "tool_parser:hermes,reasoning_parser:qwen3" at LoadModel time.
Adds make target test-extra-backend-vllm that:
- docker-build-vllm
- loads Qwen/Qwen2.5-0.5B-Instruct
- runs health,load,predict,stream,tools with tool_parser:hermes
Drops backend/python/vllm/test_{cpu_inference,tool_calls}.py — those
standalone scripts were scaffolding used while bringing up the Python
backend; the e2e-backends harness now covers the same ground uniformly
alongside llama-cpp and ik-llama-cpp.
* ci(test-extra): run vllm e2e tests on CPU
Adds tests-vllm-grpc to the test-extra workflow, mirroring the
llama-cpp and ik-llama-cpp gRPC jobs. Triggers when files under
backend/python/vllm/ change (or on run-all), builds the local-ai
vllm container image, and runs the tests/e2e-backends harness with
BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct, tool_parser:hermes,
and the tools capability enabled.
Uses ubuntu-latest (no GPU) — vllm runs on CPU via the cpu-vllm
wheel we pinned in requirements-cpu-after.txt. Frees disk space
before the build since the docker image + torch + vllm wheel is
sizeable.
* fix(vllm): build from source on CI to avoid SIGILL on prebuilt wheel
The prebuilt vllm 0.14.1+cpu wheel from GitHub releases is compiled with
SIMD instructions (AVX-512 VNNI/BF16 or AMX-BF16) that not every CPU
supports. GitHub Actions ubuntu-latest runners SIGILL when vllm spawns
the model_executor.models.registry subprocess for introspection, so
LoadModel never reaches the actual inference path.
- install.sh: when FROM_SOURCE=true on a CPU build, temporarily hide
requirements-cpu-after.txt so installRequirements installs the base
deps + torch CPU without pulling the prebuilt wheel, then clone vllm
and compile it with VLLM_TARGET_DEVICE=cpu. The resulting binaries
target the host's actual CPU.
- backend/Dockerfile.python: accept a FROM_SOURCE build-arg and expose
it as an ENV so install.sh sees it during `make`.
- Makefile docker-build-backend: forward FROM_SOURCE as --build-arg
when set, so backends that need source builds can opt in.
- Makefile test-extra-backend-vllm: call docker-build-vllm via a
recursive $(MAKE) invocation so FROM_SOURCE flows through.
- .github/workflows/test-extra.yml: set FROM_SOURCE=true on the
tests-vllm-grpc job. Slower but reliable — the prebuilt wheel only
works on hosts that share the build-time SIMD baseline.
Answers 'did you test locally?': yes, end-to-end on my local machine
with the prebuilt wheel (CPU supports AVX-512 VNNI). The CI runner CPU
gap was not covered locally — this commit plugs that gap.
* ci(vllm): use bigger-runner instead of source build
The prebuilt vllm 0.14.1+cpu wheel requires SIMD instructions (AVX-512
VNNI/BF16) that stock ubuntu-latest GitHub runners don't support —
vllm.model_executor.models.registry SIGILLs on import during LoadModel.
Source compilation works but takes 30-40 minutes per CI run, which is
too slow for an e2e smoke test. Instead, switch tests-vllm-grpc to the
bigger-runner self-hosted label (already used by backend.yml for the
llama-cpp CUDA build) — that hardware has the required SIMD baseline
and the prebuilt wheel runs cleanly.
FROM_SOURCE=true is kept as an opt-in escape hatch:
- install.sh still has the CPU source-build path for hosts that need it
- backend/Dockerfile.python still declares the ARG + ENV
- Makefile docker-build-backend still forwards the build-arg when set
Default CI path uses the fast prebuilt wheel; source build can be
re-enabled by exporting FROM_SOURCE=true in the environment.
* ci(vllm): install make + build deps on bigger-runner
bigger-runner is a bare self-hosted runner used by backend.yml for
docker image builds — it has docker but not the usual ubuntu-latest
toolchain. The make-based test target needs make, build-essential
(cgo in 'go test'), and curl/unzip (the Makefile protoc target
downloads protoc from github releases).
protoc-gen-go and protoc-gen-go-grpc come via 'go install' in the
install-go-tools target, which setup-go makes possible.
* ci(vllm): install libnuma1 + libgomp1 on bigger-runner
The vllm 0.14.1+cpu wheel ships a _C C++ extension that dlopens
libnuma.so.1 at import time. When the runner host doesn't have it,
the extension silently fails to register its torch ops, so
EngineCore crashes on init_device with:
AttributeError: '_OpNamespace' '_C_utils' object has no attribute
'init_cpu_threads_env'
Also add libgomp1 (OpenMP runtime, used by torch CPU kernels) to be
safe on stripped-down runners.
* feat(vllm): bundle libnuma/libgomp via package.sh
The vllm CPU wheel ships a _C extension that dlopens libnuma.so.1 at
import time; torch's CPU kernels in turn use libgomp.so.1 (OpenMP).
Without these on the host, vllm._C silently fails to register its
torch ops and EngineCore crashes with:
AttributeError: '_OpNamespace' '_C_utils' object has no attribute
'init_cpu_threads_env'
Rather than asking every user to install libnuma1/libgomp1 on their
host (or every LocalAI base image to ship them), bundle them into
the backend image itself — same pattern fish-speech and the GPU libs
already use. libbackend.sh adds ${EDIR}/lib to LD_LIBRARY_PATH at
run time so the bundled copies are picked up automatically.
- backend/python/vllm/package.sh (new): copies libnuma.so.1 and
libgomp.so.1 from the builder's multilib paths into ${BACKEND}/lib,
preserving soname symlinks. Runs during Dockerfile.python's
'Run backend-specific packaging' step (which already invokes
package.sh if present).
- backend/Dockerfile.python: install libnuma1 + libgomp1 in the
builder stage so package.sh has something to copy (the Ubuntu
base image otherwise only has libgomp in the gcc dep chain).
- test-extra.yml: drop the workaround that installed these libs on
the runner host — with the backend image self-contained, the
runner no longer needs them, and the test now exercises the
packaging path end-to-end the way a production host would.
* ci(vllm): disable tests-vllm-grpc job (heterogeneous runners)
Both ubuntu-latest and bigger-runner have inconsistent CPU baselines:
some instances support the AVX-512 VNNI/BF16 instructions the prebuilt
vllm 0.14.1+cpu wheel was compiled with, others SIGILL on import of
vllm.model_executor.models.registry. The libnuma packaging fix doesn't
help when the wheel itself can't be loaded.
FROM_SOURCE=true compiles vllm against the actual host CPU and works
everywhere, but takes 30-50 minutes per run — too slow for a smoke
test on every PR.
Comment out the job for now. The test itself is intact and passes
locally; run it via 'make test-extra-backend-vllm' on a host with the
required SIMD baseline. Re-enable when:
- we have a self-hosted runner label with guaranteed AVX-512 VNNI/BF16, or
- vllm publishes a CPU wheel with a wider baseline, or
- we set up a docker layer cache that makes FROM_SOURCE acceptable
The detect-changes vllm output, the test harness changes (tests/
e2e-backends + tools cap), the make target (test-extra-backend-vllm),
the package.sh and the Dockerfile/install.sh plumbing all stay in
place.
|
||
|
|
9ca03cf9cc |
feat(backends): add ik-llama-cpp (#9326)
* feat(backends): add ik-llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: add grpc e2e suite, hook to CI, update README Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> |
||
|
|
e1a6010874 |
fix(streaming): deduplicate tool call emissions during streaming (#9292)
The Go-side incremental JSON parser was emitting the same tool call on
every streaming token because it lacked the len > lastEmittedCount guard
that the XML parser had. On top of that, the post-streaming default:
case re-emitted all tool calls from index 0, duplicating everything.
This produced duplicate delta.tool_calls events causing clients to
accumulate arguments as "{args}{args}" — invalid JSON.
Fixes:
- JSON incremental parser: add len(jsonResults) > lastEmittedCount guard
and loop from lastEmittedCount (matching the XML parser pattern)
- Post-streaming default: case: skip i < lastEmittedCount entries that
were already emitted during streaming
- JSON parser: use blocking channel send (matching XML parser behavior)
|
||
|
|
13a6ed709c |
fix: thinking models with tools returning empty content (reasoning-only retry loop) (#9290)
When clients like Nextcloud or Home Assistant send requests with tools to thinking models (e.g. Gemma 4 with <|channel>thought tags), the response was empty despite the backend producing valid content. Root cause: the C++ autoparser puts clean content in both the raw Response and ChatDeltas. The Go-side PrependThinkingTokenIfNeeded then prepends the thinking start token to the already-clean content, causing ExtractReasoning to classify the entire response as unclosed reasoning. This made cbRawResult empty, triggering a retry loop that never succeeds. Two fixes: - inference.go: check ChatDeltas for content/tool_calls regardless of whether Response is empty, so skipCallerRetry fires correctly - chat.go: when ChatDeltas have content but no tool calls, use that content directly instead of falling back to the empty cbRawResult |
||
|
|
85be4ff03c |
feat(api): add ollama compatibility (#9284)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
0f9d516a6c |
fix(anthropic): do not emit empty tokens and fix SSE tool calls (#9258)
This fixes Claude Code compatibility Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
773489eeb1 |
fix(chat): do not retry if we had chatdeltas or tooldeltas from backend (#9244)
* fix(chat): do not retry if we had chatdeltas or tooldeltas from backend Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: use oai compat for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: apply to non-streaming path too Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * map also other fields Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
6b6c136210 |
fix(inflight): count inflight from load model, but release afterwards (#9194)
This should fix the count of 1 in flight always showing in the node list Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
59108fbe32 |
feat: add distributed mode (#9124)
* feat: add distributed mode (experimental) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix data races, mutexes, transactions Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactorings Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix events and tool stream in agent chat Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * use ginkgo Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(cron): compute correctly time boundaries avoiding re-triggering Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * enhancements, refactorings Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * do not flood of healthy checks Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * do not list obvious backends as text backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * tests fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactoring and consolidation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop redundant healthcheck Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * enhancements, refactorings Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
8cd3f9fc47 |
feat(ui, openai): Structured errors and link to traces in error toast (#9068)
First when sending errors over SSE we now clearly identify them as such instead of just sending the error string as a chat completion message. We use this in the UI to identify errors and link to them to the traces. Signed-off-by: Richard Palethorpe <io@richiejp.com> |
||
|
|
cfb7641eea |
feat(ui, gallery): Show model backends and add searchable model/backend selector (#9060)
* feat(ui, gallery): Display and filter by the backend models use Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(ui): Add searchable model backend/model selector and prevent delete models being selected Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com> |
||
|
|
35d509d8e7 |
feat(ui): Per model backend logs and various fixes (#9028)
* feat(gallery): Switch to expandable box instead of pop-over and display model files Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(ui, backends): Add individual backend logging Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(ui): Set the context settings from the model config Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com> |
||
|
|
3d9ccd1ddc |
fix(ui): Add tracing inline settings back and create UI tests (#9027)
Signed-off-by: Richard Palethorpe <io@richiejp.com> |
||
|
|
5affb747a9 |
chore: drop AIO images (#9004)
AIO images are behind, and takes effort to maintain these. Wizard and installation of models have been semplified massively, so AIO images lost their purpose. This allows us to be more laser focused on main images and reliefes stress from CI. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
f9a850c02a |
feat(realtime): WebRTC support (#8790)
* feat(realtime): WebRTC support Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(tracing): Show full LLM opts and deltas Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com> |
||
|
|
8818452d85 |
feat(ui): MCP Apps, mcp streaming and client-side support (#8947)
* Revert "fix: Add timeout-based wait for model deletion completion (#8756)"
This reverts commit
|
||
|
|
96efa4fce0 |
feat: add WebSocket mode support for the response api (#8676)
* feat: add WebSocket mode support for the response api Signed-off-by: bittoby <218712309+bittoby@users.noreply.github.com> * test: add e2e tests for WebSocket Responses API Signed-off-by: bittoby <218712309+bittoby@users.noreply.github.com> --------- Signed-off-by: bittoby <218712309+bittoby@users.noreply.github.com> |
||
|
|
697f6aa71c |
feat(audio): set audio content type (#8416)
* feat(audio): set audio content type Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: add tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
53276d28e7 |
feat(musicgen): add ace-step and UI interface (#8396)
* feat(musicgen): add ace-step and UI interface Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly handle model dir Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop auto-download Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to models, fixup UIs icons Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * l4t13 is incompatbile Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * avoid pinning version for cuda12 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop l4t12 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
b6459ddd57 |
feat(api): Add transcribe response format request parameter & adjust STT backends (#8318)
* WIP response format implementation for audio transcriptions (cherry picked from commit e271dd764bbc13846accf3beb8b6522153aa276f) Signed-off-by: Andres Smith <andressmithdev@pm.me> * Rework transcript response_format and add more formats (cherry picked from commit 6a93a8f63e2ee5726bca2980b0c9cf4ef8b7aeb8) Signed-off-by: Andres Smith <andressmithdev@pm.me> * Add test and replace go-openai package with official openai go client (cherry picked from commit f25d1a04e46526429c89db4c739e1e65942ca893) Signed-off-by: Andres Smith <andressmithdev@pm.me> * Fix faster-whisper backend and refactor transcription formatting to also work on CLI Signed-off-by: Andres Smith <andressmithdev@pm.me> (cherry picked from commit 69a93977d5e113eb7172bd85a0f918592d3d2168) Signed-off-by: Andres Smith <andressmithdev@pm.me> --------- Signed-off-by: Andres Smith <andressmithdev@pm.me> Co-authored-by: nanoandrew4 <nanoandrew4@gmail.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com> |
||
|
|
4077aaf978 |
chore: re-enable e2e tests, fixups anthropic API tools support (#8296)
* chore(tests): add mock backend e2e tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup anthropic tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * prepare e2e tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop repetitive tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop specific CI workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixup anthropic issues, move all e2e tests to use mocked backend Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
5ca8f0aea0 |
feat: add tool/function calling support to Anthropic Messages API (#7956)
* Initial plan * Add tool/function calling schema support to Anthropic Messages API Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> * Add E2E tests for Anthropic tool calling Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> * Make tool calling tests require model to use tools - First test now expects hasToolUse to be true with clear error message - Third test now expects toolUseID to be non-empty (removed conditional) - Both tests will now fail if model doesn't call the expected tools Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> * Add E2E test for tool calling with streaming responses - Tests that streaming events are properly emitted (content_block_start/delta/stop) - Verifies tool_use blocks are accumulated correctly in streaming mode - Ensures model calls tools and stop_reason is set to tool_use Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> |
||
|
|
4cbf9abfef |
feat: Add Anthropic Messages API support (#7948)
* Initial plan * Add Anthropic Messages API support Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> * Fix code review comments: add error handling for JSON operations Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> * Fix test suite to use existing schema test runner Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> * Add Anthropic e2e tests using anthropic-sdk-go for streaming and non-streaming Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> * Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> |
||
|
|
c37785b78c |
chore(refactor): move logging to common package based on slog (#7668)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
fc5b9ebfcc |
feat(loader): enhance single active backend to support LRU eviction (#7535)
* feat(loader): refactor single active backend support to LRU This changeset introduces LRU management of loaded backends. Users can set now a maximum number of models to be loaded concurrently, and, when setting LocalAI in single active backend mode we set LRU to 1 for backward compatibility. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: add tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
c313b2c671 |
fix(reranker): tests and top_n check fix #7212 (#7284)
reranker tests and top_n check fix #7212 Signed-off-by: Mikhail Khludnev <mkhl@apache.org> |
||
|
|
6410c99bf2 |
fix(llama-cpp): correctly calculate embeddings (#6259)
* chore(tests): check embeddings differs in llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(llama.cpp): use the correct field for embedding Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(llama.cpp): use embedding type none Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(tests): add test-cases in aio-e2e suite Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
f8a8cf3e95 |
feat(launcher): add LocalAI launcher app (#6127)
* Add launcher (WIP) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update gomod Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Cleanup, focus on systray Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Separate launcher from main Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add a way to identify the binary version Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Implement save config, and start on boot Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Save installed version as metadata Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Stop LocalAI on quit Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix goreleaser Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Check first if binary is there Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * do not show version if we don't have it Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build on CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * use fyne package Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to release Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fyne.Do Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * show WEBUI button only if LocalAI is started Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Default to localhost Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Show rel notes Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update logo Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small improvements and fix tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix e2e tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
089efe05fd |
feat(backends): add system backend, refactor (#6059)
- Add a system backend path - Refactor and consolidate system information in system state - Use system state in all the components to figure out the system paths to used whenever needed - Refactor BackendConfig -> ModelConfig. This was otherway misleading as now we do have a backend configuration which is not the model config. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
98e5291afc |
feat: refactor build process, drop embedded backends (#5875)
* feat: split remaining backends and drop embedded backends - Drop silero-vad, huggingface, and stores backend from embedded binaries - Refactor Makefile and Dockerfile to avoid building grpc backends - Drop golang code that was used to embed backends - Simplify building by using goreleaser Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(gallery): be specific with llama-cpp backend templates Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(docs): update Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(ci): minor fixes Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: drop all ffmpeg references Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: run protogen-go Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Always enable p2p mode Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update gorelease file Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(stores): do not always load Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix linting issues Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Simplify Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Mac OS fixup Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
294f7022f3 |
feat: do not bundle llama-cpp anymore (#5790)
* Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
2d64269763 |
feat: Add backend gallery (#5607)
* feat: Add backend gallery This PR add support to manage backends as similar to models. There is now available a backend gallery which can be used to install and remove extra backends. The backend gallery can be configured similarly as a model gallery, and API calls allows to install and remove new backends in runtime, and as well during the startup phase of LocalAI. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add backends docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * wip: Backend Dockerfile for python backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: drop extras images, build python backends separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixup on all backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Tweaks Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop old backends leftovers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Move dockerfile upper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix proto Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Feature dropped for consistency - we prefer model galleries Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add missing packages in the build image Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * exllama is ponly available on cublas Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pin torch on chatterbox Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups to index Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Debug CI * Install accellerators deps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add target arch * Add cuda minor version Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use self-hosted runners Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: use quay for test images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups for vllm and chatterbox Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small fixups on CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chatterbox is only available for nvidia Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Simplify CI builds Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt test, use qwen3 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(model gallery): add jina-reranker-v1-tiny-en-gguf Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(gguf-parser): recover from potential panics that can happen while reading ggufs with gguf-parser Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use reranker from llama.cpp in AIO images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Limit concurrent jobs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> |
||
|
|
40618164b2 |
chore: improve tests (#5646)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
21bdfe5fa4 |
fix: use rice when embedding large binaries (#5309)
* fix(embed): use go-rice for large backend assets Golang embed FS has a hard limit that we might exceed when providing many binary alternatives. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * simplify golang deps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(tests): switch to testcontainers and print logs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): do not build a test binary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * small fixup Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
2c425e9c69 |
feat(loader): enhance single active backend by treating as singleton (#5107)
feat(loader): enhance single active backend by treating at singleton Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
3cddf24747 |
feat: Centralized Request Processing middleware (#3847)
* squash past, centralize request middleware PR Signed-off-by: Dave Lee <dave@gray101.com> * migrate bruno request files to examples repo Signed-off-by: Dave Lee <dave@gray101.com> * fix Signed-off-by: Dave Lee <dave@gray101.com> * Update tests/e2e-aio/e2e_test.go Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Dave Lee <dave@gray101.com> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com> |
||
|
|
e8eb0b2c50 |
fix(stores): Stores fixes and testing (#4663)
* fix(stores): Actually check a vector is a unit vector/normalized Instead of just summing the components to see if they equal 1.0, take the actual magnitude/p-norm of the vector and check that is approximately 1.0. Note that this shouldn't change the order of results except in edge cases if I am too lax with the precision of the equality comparison. However it should improve performance for normalized vectors which were being misclassified. Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(stores): Add tests for known results and triangle inequality This adds some more tests to check the cosine similarity function has some expected mathematical properties. Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com> |
||
|
|
e15d29aba2 |
chore(stablediffusion-ncn): drop in favor of ggml implementation (#4652)
* chore(stablediffusion-ncn): drop in favor of ggml implementation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(ci): drop stablediffusion build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(tests): add Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(tests): try to fixup current tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Tests improvements Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(tests): use quality to specify step Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(tests): switch to sd-1.5 also increase prep time for downloading models Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
1e9bf19c8d |
feat(transformers): merge sentencetransformers backend (#4624)
* merge sentencetransformers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add alias to silently redirect sentencetransformers to transformers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add alias also for transformers-musicgen Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop from makefile Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Move tests from sentencetransformers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Remove sentencetransformers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Remove tests from CI (part of transformers) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Do not always try to load the tokenizer Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix typo Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Tiny adjustments Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
7b75e9de2d |
fix(rwkv model): add stoptoken (#4283)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
3c3050f68e |
feat(backends): Drop bert.cpp (#4272)
* feat(backends): Drop bert.cpp use llama.cpp 3.2 as a drop-in replacement for bert.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(tests): make test more robust Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |