mirror of
https://github.com/mudler/LocalAI.git
synced 2026-04-29 03:24:49 -04:00
master
1202 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
4443250756 |
chore: add golangci-lint with new-from-merge-base baseline (#9603)
* chore: add golangci-lint with new-from-merge-base baseline
Configure golangci-lint v2 with the standard linter set (errcheck, govet,
ineffassign, unused) plus forbidigo, which enforces the Ginkgo/Gomega-only
test convention from .agents/coding-style.md by rejecting stdlib testing
calls (t.Errorf, t.Fatalf, t.Run, ...). staticcheck is disabled — the
codebase has many pre-existing QF-style suggestions not worth gating on.
issues.new-from-merge-base = master makes the lint job a gate for new
issues only; the ~1300 pre-existing baseline stays visible via
'make lint-all' for incremental cleanup. CI runs 'make lint'.
Backends needing C/C++ headers we don't install in the lint runner are
excluded via a deny list in the Makefile (backend/go/{piper,silero-vad,
llm}, cmd/launcher). Discovery still flows through 'go list ./...', so
new packages are scanned automatically.
To make backend/go/{sam3-cpp,stablediffusion-ggml,whisper} typecheckable,
move their .cpp/.h sources into cpp/ subdirs (matching qwen3-tts-cpp /
acestep-cpp). Without this 'go list' rejects the package because Go does
not allow .cpp alongside .go without cgo.
Fix two real bugs found by lint in tests/integration/ (run only via
'make test-stores', not default CI): a stale zerolog reference left over
from the slog migration (
|
||
|
|
a0317d9926 |
refactor(tests): split app_test.go, move real-backend coverage to e2e-backends
core/http/app_test.go had grown to 1495 lines exercising three concerns at
once: HTTP-layer integration, real-backend inference (llama-gguf, tts,
stablediffusion, transformers embeddings, whisper), and service logic that
already has unit-level coverage. Each PR paid for 6 backend builds plus
real-model downloads to satisfy a single suite.
Reorg per layer:
- app_test.go (1495 -> 1003 lines) drives the mock-backend binary only.
Kept: auth, routing, gallery API, file:// import, /system, agent-jobs
HTTP plumbing, config-file model loading. Deleted real-inference specs
(llama-gguf chat, ggml completions/streaming, logprobs, logit_bias,
transcription, embeddings, External-gRPC, Stores duplicate, Model gallery
Context). Lifted Agent Jobs out of the deleted Stores Context.
- tests/e2e-backends/backend_test.go gains logprobs, logit_bias, and
no-first-token-dup specs (the latter folded into PredictStream). Two
new caps gate them so non-LLM backends opt out.
- tests/e2e-aio/e2e_test.go gains a streaming smoke under Context("text")
to catch container-level streaming regressions.
- tests/models_fixtures/ removed; all fixtures referenced testmodel.ggml.
app_test.go now writes per-Context inline mock-model YAMLs.
CI:
- test.yml + tests-e2e.yml gain paths-ignore (docs/, examples/, *.md,
backend/) so docs and backend-only PRs skip them. test.yml drops the
6-backend Build step plus TRANSFORMER_BACKEND/GO_TAGS=tts; tests-apple
drops the llama-cpp-darwin build.
- New tests-aio.yml runs the AIO container nightly + on workflow_dispatch
+ master/tags. The tests-e2e-container job moved out of test.yml so PRs
no longer pay AIO cost.
- New tests-llama-cpp-smoke job in test-extra.yml runs on every PR with
no detect-changes gate; pulls quay.io/go-skynet/local-ai-backends:
master-cpu-llama-cpp (no build on PR) and exercises predict/stream/
logprobs/logit_bias against Qwen3-0.6B. This is the PR-acceptance
real-backend gate after AIO moved to nightly. The path-gated heavy
test-extra-backend-llama-cpp wrapper appends the same caps so it
exercises the moved specs when the backend actually changes.
Makefile:
- Deleted test-models/testmodel.ggml (the wget chain), test-llama-gguf,
test-tts, test-stablediffusion, test-realtime-models. test target
drops --label-filter, HUGGINGFACE_GRPC, TRANSFORMER_BACKEND, TEST_DIR,
FIXTURES, CONFIG_FILE, MODELS_PATH, BACKENDS_PATH; depends on
build-mock-backend. test-stores keeps a focused entry point and depends
on backends/local-store. clean-tests also clears the mock-backend
binary.
Net per typical Go-side PR: ~25min (6 backend builds + tests + AIO) +
~8min e2e drops to ~5min mock-backend test + ~8min e2e + ~5-10min
llama-cpp-smoke (image pulled). Docs and backend-only PRs skip the
always-on workflows entirely.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7 [Edit] [Write] [Bash]
|
||
|
|
e5337039b0 |
[intel GPU support] Use latest oneapi-basekit image for Intel images to support b70 (#9543)
* Use latest oneapi-basekit image for Intel images The current `localai/localai:master-gpu-intel` images don't work with the intel arc pro b70. Updating the base_image to 2025.3.2 fixes it. Signed-off-by: Alex Brick <3220905+arbrick@users.noreply.github.com> * Update github workflow base image --------- Signed-off-by: Alex Brick <3220905+arbrick@users.noreply.github.com> |
||
|
|
13734ae9fa |
feat: Add Sherpa ONNX backend for ASR and TTS (#8523)
feat(backend): Add Sherpa ONNX backend and Omnilingual ASR Adds a new Go backend wrapping sherpa-onnx via purego (no cgo). Same approach as opus/stablediffusion-ggml/whisper — a thin C shim (csrc/shim.c + shim.h → libsherpa-shim.so) wraps the bits purego can't reach directly: nested struct config writes, result-struct field reads, and the streaming TTS callback trampoline. The Go side uses opaque uintptr handles and purego.NewCallback for the TTS callback. Supports: - VAD via sherpa-onnx's Silero VAD - Offline ASR: Whisper, Paraformer, SenseVoice, Omnilingual CTC - Online/streaming ASR: zipformer transducer with endpoint detection (AudioTranscriptionStream emits delta events during decode) - Offline TTS: VITS (LJS, etc.) - Streaming TTS: sherpa-onnx's callback API → PCM chunks on a channel, prefixed by a streaming WAV header Gallery entries: omnilingual-0.3b-ctc-q8-sherpa (1600-language offline ASR), streaming-zipformer-en-sherpa (low-latency streaming ASR), silero-vad-sherpa, vits-ljs-sherpa. E2E coverage: tests/e2e-backends for offline + streaming ASR, tests/e2e for the full realtime pipeline (VAD + STT + TTS). Assisted-by: claude-opus-4-7-1M [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> |
||
|
|
4906cbad04 |
feat: add biometrics UI (#9524)
* feat(react-ui): add Face & Voice Recognition pages
Expose the face and voice biometrics endpoints
(/v1/face/*, /v1/voice/*) through the React UI. Each page has four
tabs driving the six endpoints per modality: Analyze (demographics
with bounding boxes / waveform segments), Compare (verify with a
match gauge and live threshold slider), Enrollment (register /
identify / forget with a top-K matches view), Embedding (raw
vector inspector with sparkline + copy).
MediaInput supports file upload plus live capture: webcam
snap-to-canvas for face, MediaRecorder -> AudioContext ->
16-bit PCM mono WAV transcode for voice (libsndfile on the
backend only handles WAV/FLAC/OGG natively).
Sidebar gets a new Biometrics section feature-gated on
face_recognition / voice_recognition; routes are wrapped in
<RequireFeature>. No new dependencies -- Font Awesome icons
picked from the Free set.
Assisted-by: Claude:Opus 4.7
* fix(localai): accept data URI prefixes with codec/charset params
Browser MediaRecorder produces data URIs like
data:audio/webm;codecs=opus;base64,...
so the pre-';base64,' section can carry multiple parameter
segments. The `^data:([^;]+);base64,` regex in pkg/utils/base64.go
and core/http/endpoints/localai/audio.go only matched exactly one
segment, so recordings straight from the React UI's live-capture
tab failed the strip and then tripped the base64 decoder on the
leading 'data:' literal, surfacing as
"invalid audio base64: illegal base64 data at input byte 4"
Widened both regexes to `^data:[^,]+?;base64,` so any number of
';param=value' segments between the mime type and ';base64,' are
tolerated. Added a regression test covering the MediaRecorder
shape.
Assisted-by: Claude:Opus 4.7
* fix(insightface): scope pack ONNX loading to known manifests
LocalAI's gallery extracts buffalo_* zips flat into the models
directory, which inevitably mixes with ONNX files from other
backends (opencv face engine, MiniFASNet antispoof, WeSpeaker
voice embedding) and older buffalo pack installs. Feeding those
foreign files into insightface's model_zoo.get_model() blows up
inside the router -- it assumes a 4-D NCHW input and indexes
`input_shape[2]` on tensors that aren't shaped like a face model,
raising IndexError mid-load and leaving the backend unusable.
The router's dispatch isn't amenable to per-file try/except alone
(first-file-wins picks det_10g.onnx from buffalo_l even when the
user asked for buffalo_sc -- alphabetical order happens to favour
the wrong pack). Instead, ship an explicit manifest of the
upstream v0.7 pack contents and scope the glob to that when the
requested pack is known. The manifest is small and stable; future
packs can be added alongside or fall through to the tolerance
loop, which also swallows any remaining IndexError / ValueError
from foreign files with a clear `[insightface] skipped` stderr
line for diagnostics.
Assisted-by: Claude:Opus 4.7
* fix(speaker-recognition): extract FBank features for rank-3 ONNX encoders
Pre-exported speaker-encoder ONNX graphs come in two shapes:
rank-2 [batch, samples] -- some 3D-Speaker exports,
take raw waveform directly.
rank-3 [batch, frames, n_mels] -- WeSpeaker and most Kaldi-
lineage encoders, expect
pre-computed Kaldi FBank.
OnnxDirectEngine unconditionally fed `audio.reshape(1, -1)` --
correct for rank-2, IndexError-on-input_shape[3] on rank-3, which
surfaced to the UI as
"Invalid rank for input: feats Got: 2 Expected: 3"
Detect the input rank at session init and run Kaldi FBank
(80-dim, 25ms/10ms frames, dither=0.0, per-utterance CMN) before
the forward pass when rank>=3. All knobs are configurable via
backend options for encoders that deviate from defaults.
torchaudio.compliance.kaldi is already in the backend's
requirements (SpeechBrain pulls torchaudio in), so no new
dependency.
Assisted-by: Claude:Opus 4.7
* fix(biometrics): isolate face and voice vector stores
Face (ArcFace, 512-D) and voice (ECAPA-TDNN 192-D / WeSpeaker
256-D) biometric embeddings were colliding inside a single
in-memory local-store instance. Enrolling one after the other
failed with
"Try to add key with length N when existing length is M"
because local-store correctly refuses to mix dimensions in one
keyspace.
The registries were constructed with `storeName=""`, which in
StoreBackend() is just a WithModel() call. But ModelLoader's
cache is keyed on `modelID`, not `model` -- so both registries
collapsed to the same `modelID=""` slot and reused the same
backend process despite looking isolated on paper.
Three complementary fixes:
1. application.go -- give each registry a distinct default
namespace ("localai-face-biometrics" /
"localai-voice-biometrics"). The comment claimed
isolation, now it's actually enforced.
2. stores.go -- pass the storeName as both WithModelID and
WithModel so the ModelLoader cache key separates
namespaces and the loader spawns distinct processes.
3. local-store/store.go -- drop the Load() `opts.Model != ""`
guard. It was there to prevent generic model-loading loops
from picking up local-store by accident, but that auto-load
path is being retired; the guard now just blocks legitimate
namespace isolation. opts.Model is treated as a tag; the
per-tuple process isolation upstream handles discrimination.
Assisted-by: Claude:Opus 4.7
* fix(gallery): stale-file cleanup and upgrade-tmp directory safety
Two related robustness fixes for backend install/upgrade:
pkg/downloader/uri.go
OCI downloads passed through
if filepath.Ext(filePath) != "" ...
filePath = filepath.Dir(filePath)
which was intended to redirect file-shaped download targets
into their parent directory for OCI extraction. The heuristic
misfires on directory-shaped paths with a dot-suffix --
gallery.UpgradeBackend uses
tmpPath = "<backendsPath>/<name>.upgrade-tmp"
and Go's filepath.Ext treats ".upgrade-tmp" as an extension.
The rewrite landed the extraction at "<backendsPath>/", which
then **overwrote the real install** (backends/<name>/) with a
flat-layout file and left a stray run.sh at the top level. The
tmp dir itself stayed empty, so the validation step that
checked "<tmpPath>/run.sh" predictably failed with
"upgrade validation failed: run.sh not found in new backend"
Every manual upgrade silently corrupted the backends tree this
way. Guard the rewrite behind "target isn't already an existing
directory" -- InstallBackend / UpgradeBackend both pre-create
the target as a directory, so they get the correct behaviour;
existing file-path callers with a genuine dot-extension still
get the parent redirect.
core/gallery/backends.go
InstallBackend's MkdirAll returned ENOTDIR when something at
the target path was already a file (legacy dev builds dropped
golang backend binaries directly at `<backendsPath>/<name>`
instead of nesting them under their own subdir). That
permanently blocked reinstall and upgrade for anyone carrying
that state, since every retry hit the same error. Detect a
pre-existing non-directory, warn, and remove it before the
MkdirAll so the fresh install can write the correct nested
layout with metadata.json + run.sh.
Assisted-by: Claude:Opus 4.7
* fix(galleryop): refresh upgrade cache after backend ops
UpgradeChecker caches the last upgrade-check result and only
refreshes on the 6-hour tick or after an auto-upgrade cycle.
Manual upgrades (POST /api/backends/upgrade/:name) go through
the async galleryop worker, which completes the upgrade
correctly but never tells UpgradeChecker to re-check -- so
/api/backends/upgrades continued to list a just-upgraded backend
as upgradeable, indistinguishable from a failed upgrade, for up
to six hours.
Add an optional `OnBackendOpCompleted func()` hook on
GalleryService that fires after every successful install /
upgrade / delete on the backend channel (async, so a slow
callback doesn't stall the queue). startup.go wires it to
UpgradeChecker.TriggerCheck after both services exist. Result:
the upgrade banner clears within milliseconds of the worker
finishing.
Assisted-by: Claude:Opus 4.7
* build: prepend GOPATH/bin to PATH for protogen-go
install-go-tools runs `go install` for protoc-gen-go and
protoc-gen-go-grpc, which writes them into `go env GOPATH`/bin.
That directory isn't on every dev's PATH, and protoc resolves
its code-gen plugins via PATH, so the immediately-following
protoc invocation fails with
"protoc-gen-go: program not found"
which in turn blocks `make build` and any
`make backends/%` target that depends on build.
Prepend `go env GOPATH`/bin to PATH for the protoc invocation
so the freshly-installed plugins are found without requiring a
shell-profile change.
Assisted-by: Claude:Opus 4.7
* refactor(ui-api): non-blocking backend upgrade handler with opcache
POST /api/backends/upgrade/:name used to send the ManagementOp
directly onto the unbuffered BackendGalleryChannel, which blocked
the HTTP request whenever the galleryop worker was busy with a
prior operation. The op also didn't show up in /api/operations,
so the Backends UI couldn't reflect upgrade progress on the
affected row.
Register the op in opcache immediately, wrap it in a cancellable
context, store the cancellation function on the GalleryService,
and push onto the channel from a goroutine so the handler
returns right away. Response gains a `jobID` field and a
`message` string so clients have a consistent handle regardless
of whether the op is queued or running.
Pairs with the OnBackendOpCompleted hook added in the galleryop
commit — together the UI sees the upgrade start, watches
progress via /api/operations, and drops the "upgradeable" flag
the moment the worker finishes.
Assisted-by: Claude:Opus 4.7
|
||
|
|
f5eb13d3c2 |
feat(insightface): add antispoofing (liveness) detection (#9515)
* feat(insightface): add antispoofing (liveness) detection
Light up the anti_spoofing flag that was parked during the first pass.
Both FaceVerify and FaceAnalyze now run the Silent-Face MiniFASNetV2 +
MiniFASNetV1SE ensemble (~4 MB, Apache 2.0, CPU <10ms) when the flag is
set. Failed liveness on either image vetoes FaceVerify regardless of
embedding similarity. Every insightface* gallery entry now ships the
MiniFASNet ONNX weights so existing packs light up after reinstall.
Setting the flag against a model without the MiniFASNet files returns
FAILED_PRECONDITION (HTTP 412) with a clear install message — no
silent is_real=false.
FaceVerifyResponse gained per-image img{1,2}_is_real and
img{1,2}_antispoof_score (proto 9-12); FaceAnalysis's existing
is_real/antispoof_score fields are now populated. Schema fields are
pointers so they are fully absent from the JSON response when
anti_spoofing was not requested — avoids collapsing "not checked" with
"checked and fake" under Go's omitempty on bool.
Validated end-to-end over HTTP against a local install:
- verify + anti_spoofing, both real -> verified=true, score ~0.76
- verify + anti_spoofing, img2 spoof -> verified=false, img2_is_real=false
- analyze + anti_spoofing -> is_real and score per face
- flag against model without MiniFASNet -> HTTP 412 fail-loud
Assisted-by: Claude:claude-opus-4-7 go vet
* test(insightface): wire test target into test-extra
The root Makefile's `test-extra` already runs
`$(MAKE) -C backend/python/insightface test`, but the backend's
Makefile never defined the target — so the command silently errored
and the suite was never executed in CI. Adding the two-line target
(matching ace-step/Makefile) hooks `test.sh` → `runUnittests` →
`python -m unittest test.py`, which discovers both the pre-existing
engine classes (InsightFaceEngineTest, OnnxDirectEngineTest) and the
new AntispoofingTest. Each class skips gracefully when its weights
can't be downloaded from a network-restricted runner.
Assisted-by: Claude:claude-opus-4-7
* test(insightface): exercise antispoofing in e2e-backends (both paths)
Add a `face_antispoof` capability to the Ginkgo e2e suite and extend
the existing FaceVerify + FaceAnalyze specs with liveness assertions
covering BOTH paths:
real fixture -> is_real=true, score>0, verified stays true
spoof fixture -> is_real=false, verified vetoed to false
The spoof fixture is upstream's own `image_F2.jpg` (via the yakhyo
mirror) — verified locally against the MiniFASNetV2+V1SE ensemble to
classify as is_real=false with score ~0.013. That makes the assertion
deterministic across CI runs; synthetic/derived spoofs fool the model
unpredictably and would be flaky.
Makefile wires it up end-to-end:
- New INSIGHTFACE_ANTISPOOF_* cache dir + two ONNX downloads with
pinned SHAs, matching the gallery entries.
- insightface-antispoof-models target shared by both backend configs.
- FACE_SPOOF_IMAGE_URL passed via BACKEND_TEST_FACE_SPOOF_IMAGE_URL.
- Both e2e targets (buffalo-sc + opencv) now:
* depend on insightface-antispoof-models
* pass antispoof_v2_onnx / antispoof_v1se_onnx in BACKEND_TEST_OPTIONS
* include face_antispoof in BACKEND_TEST_CAPS
backend_test.go adds the new capability constant and a faceSpoofFile
fixture resolved the same way as faceFile1/2/3. Spoof assertions are
gated on both capFaceAntispoof AND faceSpoofFile being set, so a test
config that omits the spoof fixture degrades gracefully to "real path
only" instead of failing.
Assisted-by: Claude:claude-opus-4-7 go vet
|
||
|
|
181ebb6df4 |
feat: voice recognition (#9500)
* feat(voice-recognition): add /v1/voice/{verify,analyze,embed} + speaker-recognition backend
Audio analog to face recognition. Adds three gRPC RPCs
(VoiceVerify / VoiceAnalyze / VoiceEmbed), their Go service and HTTP
layers, a new FLAG_SPEAKER_RECOGNITION capability flag, and a Python
backend scaffold under backend/python/speaker-recognition/ wrapping
SpeechBrain ECAPA-TDNN with a parallel OnnxDirectEngine for
WeSpeaker / 3D-Speaker ONNX exports.
The kokoros Rust backend gets matching unimplemented trait stubs —
tonic's async_trait has no defaults, so adding an RPC without Rust
stubs breaks the build (same regression fixed by
|
||
|
|
20baec77ab |
feat(face-recognition): add insightface/onnx backend for 1:1 verify, 1:N identify, embedding, detection, analysis (#9480)
* feat(face-recognition): add insightface backend for 1:1 verify, 1:N identify, embedding, detection, analysis
Adds face recognition as a new first-class capability in LocalAI via the
`insightface` Python backend, with a pluggable two-engine design so
non-commercial (insightface model packs) and commercial-safe
(OpenCV Zoo YuNet + SFace) models share the same gRPC/HTTP surface.
New gRPC RPCs (backend/backend.proto):
* FaceVerify(FaceVerifyRequest) returns FaceVerifyResponse
* FaceAnalyze(FaceAnalyzeRequest) returns FaceAnalyzeResponse
Existing Embedding and Detect RPCs are reused (face image in
PredictOptions.Images / DetectOptions.src) for face embedding and
face detection respectively.
New HTTP endpoints under /v1/face/:
* verify — 1:1 image pair same-person decision
* analyze — per-face age + gender (emotion/race reserved)
* register — 1:N enrollment; stores embedding in vector store
* identify — 1:N recognition; detect → embed → StoresFind
* forget — remove a registered face by opaque ID
Service layer (core/services/facerecognition/) introduces a
`Registry` interface with one in-memory `storeRegistry` impl backed
by LocalAI's existing local-store gRPC vector backend. HTTP handlers
depend on the interface, not on StoresSet/StoresFind directly, so a
persistent PostgreSQL/pgvector implementation can be slotted in via a
single constructor change in core/application (TODO marker in the
package doc).
New usecase flag FLAG_FACE_RECOGNITION; insightface is also wired
into FLAG_DETECTION so /v1/detection works for face bounding boxes.
Gallery (backend/index.yaml) ships three entries:
* insightface-buffalo-l — SCRFD-10GF + ArcFace R50 + genderage
(~326MB pre-baked; non-commercial research use only)
* insightface-opencv — YuNet + SFace (~40MB pre-baked; Apache 2.0)
* insightface-buffalo-s — SCRFD-500MF + MBF (runtime download; non-commercial)
Python backend (backend/python/insightface/):
* engines.py — FaceEngine protocol with InsightFaceEngine and
OnnxDirectEngine; resolves model paths relative to the backend
directory so the same gallery config works in docker-scratch and
in the e2e-backends rootfs-extraction harness.
* backend.py — gRPC servicer implementing Health, LoadModel, Status,
Embedding, Detect, FaceVerify, FaceAnalyze.
* install.sh — pre-bakes buffalo_l + OpenCV YuNet/SFace inside the
backend directory so first-run is offline-clean (the final scratch
image only preserves files under /<backend>/).
* test.py — parametrized unit tests over both engines.
Tests:
* Registry unit tests (go test -race ./core/services/facerecognition/...)
— in-memory fake grpc.Backend, table-driven, covers register/
identify/forget/error paths + concurrent access.
* tests/e2e-backends/backend_test.go extended with face caps
(face_detect, face_embed, face_verify, face_analyze); relative
ordering + configurable verifyCeiling per engine.
* Makefile targets: test-extra-backend-insightface-buffalo-l,
-opencv, and the -all aggregate.
* CI: .github/workflows/test-extra.yml gains tests-insightface-grpc,
auto-triggered by changes under backend/python/insightface/.
Docs:
* docs/content/features/face-recognition.md — feature page with
license table, quickstart (defaults to the commercial-safe model),
models matrix, API reference, 1:N workflow, storage caveats.
* Cross-refs in object-detection.md, stores.md, embeddings.md, and
whats-new.md.
* Contributor README at backend/python/insightface/README.md.
Verified end-to-end:
* buffalo_l: 6/6 specs (health, load, face_detect, face_embed,
face_verify, face_analyze).
* opencv: 5/5 specs (same minus face_analyze — SFace has no
demographic head; correctly skipped via BACKEND_TEST_CAPS).
Assisted-by: Claude:claude-opus-4-7
* fix(face-recognition): move engine selection to model gallery, collapse backend entries
The previous commit put engine/model_pack options on backend gallery
entries (`backend/index.yaml`). That was wrong — `GalleryBackend`
(core/gallery/backend_types.go:32) has no `options` field, so the
YAML decoder silently dropped those keys and all three "different
insightface-*" backend entries resolved to the same container image
with no distinguishing configuration.
Correct split:
* `backend/index.yaml` now has ONE `insightface` backend entry
shipping the CPU + CUDA 12 container images. The Python backend
bundles both the non-commercial insightface model packs
(buffalo_l / buffalo_s) and the commercial-safe OpenCV Zoo
weights (YuNet + SFace); the active engine is selected at
LoadModel time via `options: ["engine:..."]`.
* `gallery/index.yaml` gains three model entries —
`insightface-buffalo-l`, `insightface-opencv`,
`insightface-buffalo-s` — each setting the appropriate
`overrides.backend` + `overrides.options` so installing one
actually gives the user the intended engine. This matches how
`rfdetr-base` lives in the model gallery against the `rfdetr`
backend.
The earlier e2e tests passed despite this bug because the Makefile
targets pass `BACKEND_TEST_OPTIONS` directly to LoadModel via gRPC,
bypassing any gallery resolution entirely. No code changes needed.
Assisted-by: Claude:claude-opus-4-7
* feat(face-recognition): cover all supported models in the gallery + drop weight baking
Follows up on the model-gallery split: adds entries for every model
configuration either engine actually supports, and switches weight
delivery from image-baked to LocalAI's standard gallery mechanism.
Gallery now has seven `insightface-*` model entries (gallery/index.yaml):
insightface (family) — non-commercial research use
• buffalo-l (326MB) — SCRFD-10GF + ResNet50 + genderage, default
• buffalo-m (313MB) — SCRFD-2.5GF + ResNet50 + genderage
• buffalo-s (159MB) — SCRFD-500MF + MBF + genderage
• buffalo-sc (16MB) — SCRFD-500MF + MBF, recognition only
(no landmarks, no demographics — analyze
returns empty attributes)
• antelopev2 (407MB) — SCRFD-10GF + ResNet100@Glint360K + genderage
OpenCV Zoo family — Apache 2.0 commercial-safe
• opencv — YuNet + SFace fp32 (~40MB)
• opencv-int8 — YuNet + SFace int8 (~12MB, ~3x smaller, faster on CPU)
Model weights are no longer baked into the backend image. The image
now ships only the Python runtime + libraries (~275MB content size,
~1.18GB disk vs ~1.21GB when weights were baked). Weights flow through
LocalAI's gallery mechanism:
* OpenCV variants list `files:` with ONNX URIs + SHA-256, so
`local-ai models install insightface-opencv` pulls them into the
models directory exactly like any other gallery-managed model.
* insightface packs (upstream distributes .zip archives only, not
individual ONNX files) auto-download on first LoadModel via
FaceAnalysis' built-in machinery, rooted at the LocalAI models
directory so they live alongside everything else — same pattern
`rfdetr` uses with `inference.get_model()`.
Backend changes (backend/python/insightface/):
* backend.py — LoadModel propagates `ModelOptions.ModelPath` (the
LocalAI models directory) to engines via a `_model_dir` hint.
This replaces the earlier ModelFile-dirname approach; ModelPath
is the canonical "models directory" variable set by the Go loader
(pkg/model/initializers.go:144) and is always populated.
* engines.py::_resolve_model_path — picks up `model_dir` and searches
it (plus basename-in-model-dir) before falling back to the dev
script-dir. This is how OnnxDirectEngine finds gallery-downloaded
YuNet/SFace files by filename only.
* engines.py::_flatten_insightface_pack — new helper that works
around an upstream packaging inconsistency: buffalo_l/s/sc zips
expand flat, but buffalo_m and antelopev2 zips wrap their ONNX
files in a redundant `<name>/` directory. insightface's own
loader looks one level too shallow and fails. We call
`ensure_available()` explicitly, flatten if nested, then hand to
FaceAnalysis.
* engines.py::InsightFaceEngine.prepare — root-resolution order now
includes the `_model_dir` hint so packs download into the LocalAI
models directory by default.
* install.sh — no longer pre-downloads any weights. Everything is
gallery-managed now.
* smoke.py (new) — parametrized smoke test that iterates over every
gallery configuration, simulating the LocalAI install flow
(creates a models dir, fetches OpenCV files with checksum
verification, lets insightface auto-download its packs), then
runs detect + embed + verify (+ analyze where supported) through
the in-process BackendServicer.
* test.py — OnnxDirectEngineTest no longer hardcodes `/models/opencv/`
paths; downloads ONNX files to a temp dir at setUpClass time and
passes ModelPath accordingly.
Registry change (core/services/facerecognition/store_registry.go):
* `dim=0` in NewStoreRegistry now means "accept whatever dimension
arrives" — needed because the backend supports 512-d ArcFace/MBF
and 128-d SFace via the same Registry. A non-zero dim still fails
fast with ErrDimensionMismatch.
* core/application plumbs `faceEmbeddingDim = 0`, explaining the
rationale in the comment.
Backend gallery description updated to reflect that the image carries
no weights — it's just Python + engines.
Smoke-tested all 7 configurations against the rebuilt image (with the
flatten fix applied), exit 0:
PASS: insightface-buffalo-l faces=6 dim=512 same-dist=0.000
PASS: insightface-buffalo-sc faces=6 dim=512 same-dist=0.000
PASS: insightface-buffalo-s faces=6 dim=512 same-dist=0.000
PASS: insightface-buffalo-m faces=6 dim=512 same-dist=0.000
PASS: insightface-antelopev2 faces=6 dim=512 same-dist=0.000
PASS: insightface-opencv faces=6 dim=128 same-dist=0.000
PASS: insightface-opencv-int8 faces=6 dim=128 same-dist=0.000
7/7 passed
Assisted-by: Claude:claude-opus-4-7
* fix(face-recognition): pre-fetch OpenCV ONNX for e2e target; drop stale pre-baked claim
CI regression from the previous commit: I moved OpenCV Zoo weight
delivery to LocalAI's gallery `files:` mechanism, but the
test-extra-backend-insightface-opencv target was still passing
relative paths `detector_onnx:models/opencv/yunet.onnx` in
BACKEND_TEST_OPTIONS. The e2e suite drives LoadModel directly over
gRPC without going through the gallery, so those relative paths
resolved to nothing and OpenCV's ONNXImporter failed:
LoadModel failed: Failed to load face engine:
OpenCV(4.13.0) ... Can't read ONNX file: models/opencv/yunet.onnx
Fix: add an `insightface-opencv-models` prerequisite target that
fetches the two ONNX files (YuNet + SFace) to a deterministic host
cache at /tmp/localai-insightface-opencv-cache/, verifies SHA-256,
and skips the download on re-runs. The opencv test target depends on
it and passes absolute paths in BACKEND_TEST_OPTIONS, so the backend
finds the files via its normal absolute-path resolution branch.
Also refresh the buffalo_l comment: it no longer says "pre-baked"
(nothing is — the pack auto-downloads from upstream's GitHub release
on first LoadModel, same as in CI).
Locally verified: `make test-extra-backend-insightface-opencv` passes
5/5 specs (health, load, face_detect, face_embed, face_verify).
Assisted-by: Claude:claude-opus-4-7
* feat(face-recognition): add POST /v1/face/embed + correct /v1/embeddings docs
The docs promised that /v1/embeddings returns face vectors when you
send an image data-URI. That was never true: /v1/embeddings is
OpenAI-compatible and text-only by contract — its handler goes
through `core/backend/embeddings.go::ModelEmbedding`, which sets
`predictOptions.Embeddings = s` (a string of TEXT to embed) and never
populates `predictOptions.Images[]`. The Python backend's Embedding
gRPC method does handle Images[] (that's how /v1/face/register reaches
it internally via `backend.FaceEmbed`), but the HTTP embeddings
endpoint wasn't wired to populate it.
Rather than overload /v1/embeddings with image-vs-text detection —
messy, and the endpoint is OpenAI-compatible by design — add a
dedicated /v1/face/embed endpoint that wraps `backend.FaceEmbed`
(already used internally by /v1/face/register and /v1/face/identify).
Matches LocalAI's convention of a dedicated path per non-standard flow
(/v1/rerank, /v1/detection, /v1/face/verify etc.).
Response:
{
"embedding": [<dim> floats, L2-normed],
"dim": int, // 512 for ArcFace R50 / MBF, 128 for SFace
"model": "<name>"
}
Live-tested on the opencv engine: returns a 128-d L2-normalized vector
(sum(x^2) = 1.0000). Sentinel in docs updated to note /v1/embeddings
is text-only and point image users at /v1/face/embed instead.
Assisted-by: Claude:claude-opus-4-7
* fix(http): map malformed image input + gRPC status codes to proper 4xx
Image-input failures on LocalAI's single-image endpoints (/v1/detection,
/v1/face/{verify,analyze,embed,register,identify}) have historically
returned 500 — even when the client was the one who sent garbage.
Classic example: you POST an "image" that isn't a URL, isn't a
data-URI, and isn't a valid JPEG/PNG — the server shouldn't claim
that's its fault.
Two helpers land in core/http/endpoints/localai/images.go and every
single-image handler is switched over:
* decodeImageInput(s)
Wraps utils.GetContentURIAsBase64 and turns any failure
(invalid URL, not a data-URI, download error, etc.) into
echo.NewHTTPError(400, "invalid image input: ...").
* mapBackendError(err)
Inspects the gRPC status on a backend call error and maps:
INVALID_ARGUMENT → 400 Bad Request
NOT_FOUND → 404 Not Found
FAILED_PRECONDITION → 412 Precondition Failed
Unimplemented → 501 Not Implemented
All other codes fall through unchanged (still 500).
Before, my 1×1 PNG error-path test returned:
HTTP 500 "rpc error: code = InvalidArgument desc = failed to decode one or both images"
After:
HTTP 400 "failed to decode one or both images"
Scope-limited to the LocalAI single-image endpoints. The multi-modal
paths (middleware/request.go, openresponses/responses.go,
openai/realtime.go) intentionally log-and-skip individual media parts
when decoding fails — different design intent (graceful degradation
of a multi-part message), not a 400-worthy failure. Left untouched.
Live-verified: every error case in /tmp/face_errors.py now returns
4xx with a meaningful message; the "image with no face (1x1 PNG)"
case specifically went from 500 → 400.
Assisted-by: Claude:claude-opus-4-7
* refactor(face-recognition): insightface packs go through gallery files:, drop FaceAnalysis
Follows up on the discovery that LocalAI's gallery `files:` mechanism
handles archives (zip, tar.gz, …) via mholt/archiver/v3 — the rhasspy
piper voices use exactly this pattern. Insightface packs are zip
archives, so we can now deliver them the same way every other
gallery-managed model gets delivered: declaratively, checksum-verified,
through LocalAI's standard download+extract pipeline.
Two changes:
1. Gallery (gallery/index.yaml) — every insightface-* entry gains a
`files:` list with the pack zip's URI + SHA-256. `local-ai models
install insightface-buffalo-l` now fetches the zip, verifies the
hash, and extracts it into the models directory. No more reliance
on insightface's library-internal `ensure_available()` auto-download
or its hardcoded `BASE_REPO_URL`.
2. InsightFaceEngine (backend/python/insightface/engines.py) — drops
the FaceAnalysis wrapper and drives insightface's `model_zoo`
directly. The ~50 lines FaceAnalysis provides — glob ONNX files,
route each through `model_zoo.get_model()`, build a
`{taskname: model}` dict, loop per-face at inference — are
reimplemented in `InsightFaceEngine`. The actual inference classes
(RetinaFace, ArcFaceONNX, Attribute, Landmark) are still
insightface's — we only replicate the glue, so drift risk against
upstream is minimal.
Why drop FaceAnalysis: it hard-codes a `<root>/models/<name>/*.onnx`
layout that doesn't match what LocalAI's zip extraction produces.
LocalAI unpacks archives flat into `<models_dir>`. Upstream packs
are inconsistent — buffalo_l/s/sc ship ONNX at the zip root (lands
at `<models_dir>/*.onnx`), buffalo_m/antelopev2 wrap in a redundant
`<name>/` dir (lands at `<models_dir>/<name>/*.onnx`). The new
`_locate_insightface_pack` helper searches both locations plus
legacy paths and returns whichever has ONNX files. Replaces the
earlier `_flatten_insightface_pack` helper (which tried to fight
FaceAnalysis's layout expectations; now we just find the files
wherever they are).
Net effect for users: install once via LocalAI's managed flow,
weights live alongside every other model, progress shows in the
jobs endpoint, no first-load network call. Same API surface,
cleaner plumbing.
Assisted-by: Claude:claude-opus-4-7
* fix(face-recognition): CI's insightface e2e path needs the pack pre-fetched
The e2e suite drives LoadModel over gRPC without going through LocalAI's
gallery flow, so the engine's `_model_dir` option (normally populated
from ModelPath) is empty. Previously the insightface target relied on
FaceAnalysis auto-download to paper over this, but we dropped
FaceAnalysis in favor of direct model_zoo calls — so the buffalo_l
target started failing at LoadModel with "no insightface pack found".
Mirror the opencv target's pre-fetch pattern: download buffalo_sc.zip
(same SHA as the gallery entry), extract it on the host, and pass
`root:<dir>` so the engine locates the pack without needing
ModelPath. Switched to buffalo_sc (smallest pack, ~16MB) to keep CI
fast; it covers the same insightface engine code path as buffalo_l.
Face analyze cap dropped since buffalo_sc has no age/gender head.
Assisted-by: Claude:claude-opus-4-7[1m]
* feat(face-recognition): surface face-recognition in advertised feature maps
The six /v1/face/* endpoints were missing from every place LocalAI
advertises its feature surface to clients:
* api_instructions — the machine-readable capability index at
GET /api/instructions. Added `face-recognition` as a dedicated
instruction area with an intro that calls out the in-memory
registry caveat and the /v1/face/embed vs /v1/embeddings split.
* auth/permissions — added FeatureFaceRecognition constant, routed
all six face endpoints through it so admins can gate them per-user
like any other API feature. Default ON (matches the other API
features).
* React UI capabilities — CAP_FACE_RECOGNITION symbol mapped to
FLAG_FACE_RECOGNITION. Declared only for now; the Face page is a
follow-up (noted in the plan).
Instruction count bumped 9 → 10; test updated.
Assisted-by: Claude:claude-opus-4-7[1m]
* docs(agents): capture advertising-surface steps in the endpoint guide
Before this change, adding a new /v1/* endpoint reliably missed one or
more of: the swagger @Tags annotation, the /api/instructions registry,
the auth RouteFeatureRegistry, and the React UI CAP_* symbol. The
endpoint would work but be invisible to API consumers, admins, and the
UI — and nothing in the existing docs said to look in those places.
Extend .agents/api-endpoints-and-auth.md with a new "Advertising
surfaces" section covering all four surfaces (swagger tags, /api/
instructions, capabilities.js, docs/), and expand the closing checklist
so it's impossible to ship a feature without visiting each one. Hoist a
one-liner reminder into AGENTS.md's Quick Reference so agents skim it
before diving in.
Assisted-by: Claude:claude-opus-4-7[1m]
|
||
|
|
a0cbc46be9 |
refactor(tinygrad): reuse tinygrad.apps.llm instead of vendored Transformer (#9380)
Drop the 295-line vendor/llama.py fork in favor of `tinygrad.apps.llm`, which now provides the Transformer blocks, GGUF loader (incl. Q4/Q6/Q8 quantization), KV-cache and generate loop we were maintaining ourselves. What changed: - New vendor/appsllm_adapter.py (~90 LOC) — HF -> GGUF-native state-dict keymap, Transformer kwargs builder, `_embed_hidden` helper, and a hard rejection of qkv_bias models (Qwen2 / 2.5 are no longer supported; the apps.llm Transformer ties `bias=False` on Q/K/V projections). - backend.py routes both safetensors and GGUF paths through apps.llm.Transformer. Generation now delegates to its (greedy-only) `generate()`; Temperature / TopK / TopP / RepetitionPenalty are still accepted on the wire but ignored — documented in the module docstring. - Jinja chat render now passes `enable_thinking=False` so Qwen3's reasoning preamble doesn't eat the tool-call token budget on small models. - Embedding path uses `_embed_hidden` (block stack + output_norm) rather than the custom `embed()` method we were carrying on the vendored Transformer. - test.py gains TestAppsLLMAdapter covering the keymap rename, tied embedding fallback, unknown-key skipping, and qkv_bias rejection. - Makefile fixtures move from Qwen/Qwen2.5-0.5B-Instruct to Qwen/Qwen3-0.6B (apps.llm-compatible) and tool_parser from qwen3_xml to hermes (the HF chat template emits hermes-style JSON tool calls). Verified with the docker-backed targets: test-extra-backend-tinygrad 5/5 PASS test-extra-backend-tinygrad-embeddings 3/3 PASS test-extra-backend-tinygrad-whisper 4/4 PASS test-extra-backend-tinygrad-sd 3/3 PASS |
||
|
|
b4e30692a2 |
feat(backends): add sglang (#9359)
* feat(backends): add sglang Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(sglang): force AVX-512 CXXFLAGS and disable CI e2e job sgl-kernel's shm.cpp uses __m512 AVX-512 intrinsics unconditionally; -march=native fails on CI runners without AVX-512 in /proc/cpuinfo. Force -march=sapphirerapids so the build always succeeds, matching sglang upstream's docker/xeon.Dockerfile recipe. The resulting binary still requires an AVX-512 capable CPU at runtime, so disable tests-sglang-grpc in test-extra.yml for the same reason tests-vllm-grpc is disabled. Local runs with make test-extra-backend-sglang still work on hosts with the right SIMD baseline. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(sglang): patch CMakeLists.txt instead of CXXFLAGS for AVX-512 CXXFLAGS with -march=sapphirerapids was being overridden by add_compile_options(-march=native) in sglang's CPU CMakeLists.txt, since CMake appends those flags after CXXFLAGS. Sed-patch the CMakeLists.txt directly after cloning to replace -march=native. --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
6f0051301b |
feat(backend): add tinygrad multimodal backend (experimental) (#9364)
* feat(backend): add tinygrad multimodal backend
Wire tinygrad as a new Python backend covering LLM text generation with
native tool-call extraction, embeddings, Stable Diffusion 1.x image
generation, and Whisper speech-to-text from a single self-contained
container.
Backend (`backend/python/tinygrad/`):
- `backend.py` gRPC servicer with LLM Predict/PredictStream (auto-detects
Llama / Qwen2 / Mistral architecture from `config.json`, supports
safetensors and GGUF), Embedding via mean-pooled last hidden state,
GenerateImage via the vendored SD1.x pipeline, AudioTranscription +
AudioTranscriptionStream via the vendored Whisper inference loop, plus
Tokenize / ModelMetadata / Status / Free.
- Vendored upstream model code under `vendor/` (MIT, headers preserved):
llama.py with an added `qkv_bias` flag for Qwen2-family bias support
and an `embed()` method that returns the last hidden state, plus
clip.py, unet.py, stable_diffusion.py (trimmed to drop the MLPerf
training branch that pulls `mlperf.initializers`), audio_helpers.py
and whisper.py (trimmed to drop the pyaudio listener).
- Pluggable tool-call parsers under `tool_parsers/`: hermes (Qwen2.5 /
Hermes), llama3_json (Llama 3.1+), qwen3_xml (Qwen 3), mistral
(Mistral / Mixtral). Auto-selected from model architecture or `Options`.
- `install.sh` pins Python 3.11.14 (tinygrad >=0.12 needs >=3.11; the
default portable python is 3.10).
- `package.sh` bundles libLLVM.so.1 + libedit/libtinfo/libgomp/libsndfile
into the scratch image. `run.sh` sets `CPU_LLVM=1` and `LLVM_PATH` so
tinygrad's CPU device uses the in-process libLLVM JIT instead of
shelling out to the missing `clang` binary.
- Local unit tests for Health and the four parsers in `test.py`.
Build wiring:
- Root `Makefile`: `.NOTPARALLEL`, `prepare-test-extra`, `test-extra`,
`BACKEND_TINYGRAD = tinygrad|python|.|false|true`,
docker-build-target eval, and `docker-build-backends` aggregator.
- `.github/workflows/backend.yml`: cpu / cuda12 / cuda13 build matrix
entries (mirrors the transformers backend placement).
- `backend/index.yaml`: `&tinygrad` meta + cpu/cuda12/cuda13 image
entries (latest + development).
E2E test wiring:
- `tests/e2e-backends/backend_test.go` gains an `image` capability that
exercises GenerateImage and asserts a non-empty PNG is written to
`dst`. New `BACKEND_TEST_IMAGE_PROMPT` / `BACKEND_TEST_IMAGE_STEPS`
knobs.
- Five new make targets next to `test-extra-backend-vllm`:
- `test-extra-backend-tinygrad` — Qwen2.5-0.5B-Instruct + hermes,
mirrors the vllm target 1:1 (5/9 specs in ~57s).
- `test-extra-backend-tinygrad-embeddings` — same model, embeddings
via LLM hidden state (3/9 in ~10s).
- `test-extra-backend-tinygrad-sd` — stable-diffusion-v1-5 mirror,
health/load/image (3/9 in ~10min, 4 diffusion steps on CPU).
- `test-extra-backend-tinygrad-whisper` — openai/whisper-tiny.en
against jfk.wav from whisper.cpp samples (4/9 in ~49s).
- `test-extra-backend-tinygrad-all` aggregate.
All four targets land green on the first MVP pass: 15 specs total, 0
failures across LLM+tools, embeddings, image generation, and speech
transcription.
* refactor(tinygrad): collapse to a single backend image
tinygrad generates its own GPU kernels (PTX renderer for CUDA, the
autogen ctypes wrappers for HIP / Metal / WebGPU) and never links
against cuDNN, cuBLAS, or any toolkit-version-tied library. The only
runtime dependency that varies across hosts is the driver's libcuda.so.1
/ libamdhip64.so, which are injected into the container at run time by
the nvidia-container / rocm runtimes. So unlike torch- or vLLM-based
backends, there is no reason to ship per-CUDA-version images.
- Drop the cuda12-tinygrad and cuda13-tinygrad build-matrix entries
from .github/workflows/backend.yml. The sole remaining entry is
renamed to -tinygrad (from -cpu-tinygrad) since it is no longer
CPU-only.
- Collapse backend/index.yaml to a single meta + development pair.
The meta anchor carries the latest uri directly; the development
entry points at the master tag.
- run.sh picks the tinygrad device at launch time by probing
/usr/lib/... for libcuda.so.1 / libamdhip64.so. When libcuda is
visible we set CUDA=1 + CUDA_PTX=1 so tinygrad uses its own PTX
renderer (avoids any nvrtc/toolkit dependency); otherwise we fall
back to HIP or CLANG. CPU_LLVM=1 + LLVM_PATH keep the in-process
libLLVM JIT for the CLANG path.
- backend.py's _select_tinygrad_device() is trimmed to a CLANG-only
fallback since production device selection happens in run.sh.
Re-ran test-extra-backend-tinygrad after the change:
Ran 5 of 9 Specs in 56.541 seconds — 5 Passed, 0 Failed
|
||
|
|
95efb8a562 |
feat(backend): add turboquant llama.cpp-fork backend (#9355)
* feat(backend): add turboquant llama.cpp-fork backend
turboquant is a llama.cpp fork (TheTom/llama-cpp-turboquant, branch
feature/turboquant-kv-cache) that adds a TurboQuant KV-cache scheme.
It ships as a first-class backend reusing backend/cpp/llama-cpp sources
via a thin wrapper Makefile: each variant target copies ../llama-cpp
into a sibling build dir and invokes llama-cpp's build-llama-cpp-grpc-server
with LLAMA_REPO/LLAMA_VERSION overridden to point at the fork. No
duplication of grpc-server.cpp — upstream fixes flow through automatically.
Wires up the full matrix (CPU, CUDA 12/13, L4T, L4T-CUDA13, ROCm, SYCL
f32/f16, Vulkan) in backend.yml and the gallery entries in index.yaml,
adds a tests-turboquant-grpc e2e job driven by BACKEND_TEST_CACHE_TYPE_K/V=q8_0
to exercise the KV-cache config path (backend_test.go gains dedicated env
vars wired into ModelOptions.CacheTypeKey/Value — a generic improvement
usable by any llama.cpp-family backend), and registers a nightly auto-bump
PR in bump_deps.yaml tracking feature/turboquant-kv-cache.
scripts/changed-backends.js gets a special-case so edits to
backend/cpp/llama-cpp/ also retrigger the turboquant CI pipeline, since
the wrapper reuses those sources.
* feat(turboquant): carry upstream patches against fork API drift
turboquant branched from llama.cpp before upstream commit 66060008
("server: respect the ignore eos flag", #21203) which added the
`logit_bias_eog` field to `server_context_meta` and a matching
parameter to `server_task::params_from_json_cmpl`. The shared
backend/cpp/llama-cpp/grpc-server.cpp depends on that field, so
building it against the fork unmodified fails.
Cherry-pick that commit as a patch file under
backend/cpp/turboquant/patches/ and apply it to the cloned fork
sources via a new apply-patches.sh hook called from the wrapper
Makefile. Simplifies the build flow too: instead of hopping through
llama-cpp's build-llama-cpp-grpc-server indirection, the wrapper now
drives the copied Makefile directly (clone -> patch -> build).
Drop the corresponding patch whenever the fork catches up with
upstream — the build fails fast if a patch stops applying, which
is the signal to retire it.
* docs: add turboquant backend section + clarify cache_type_k/v
Document the new turboquant (llama.cpp fork with TurboQuant KV-cache)
backend alongside the existing llama-cpp / ik-llama-cpp sections in
features/text-generation.md: when to pick it, how to install it from
the gallery, and a YAML example showing backend: turboquant together
with cache_type_k / cache_type_v.
Also expand the cache_type_k / cache_type_v table rows in
advanced/model-configuration.md to spell out the accepted llama.cpp
quantization values and note that these fields apply to all
llama.cpp-family backends, not just vLLM.
* feat(turboquant): patch ggml-rpc GGML_OP_COUNT assertion
The fork adds new GGML ops bringing GGML_OP_COUNT to 97, but
ggml/include/ggml-rpc.h static-asserts it equals 96, breaking
the GGML_RPC=ON build paths (turboquant-grpc / turboquant-rpc-server).
Carry a one-line patch that updates the expected count so the
assertion holds. Drop this patch whenever the fork fixes it upstream.
* feat(turboquant): allow turbo* KV-cache types and exercise them in e2e
The shared backend/cpp/llama-cpp/grpc-server.cpp carries its own
allow-list of accepted KV-cache types (kv_cache_types[]) and rejects
anything outside it before the value reaches llama.cpp's parser. That
list only contains the standard llama.cpp types — turbo2/turbo3/turbo4
would throw "Unsupported cache type" at LoadModel time, meaning
nothing the LocalAI gRPC layer accepted was actually fork-specific.
Add a build-time augmentation step (patch-grpc-server.sh, called from
the turboquant wrapper Makefile) that inserts GGML_TYPE_TURBO2_0/3_0/4_0
into the allow-list of the *copied* grpc-server.cpp under
turboquant-<flavor>-build/. The original file under backend/cpp/llama-cpp/
is never touched, so the stock llama-cpp build keeps compiling against
vanilla upstream which has no notion of those enum values.
Switch test-extra-backend-turboquant to set
BACKEND_TEST_CACHE_TYPE_K=turbo3 / _V=turbo3 so the e2e gRPC suite
actually runs the fork's TurboQuant KV-cache code paths (turbo3 also
auto-enables flash_attention in the fork). Picking q8_0 here would
only re-test the standard llama.cpp path that the upstream llama-cpp
backend already covers.
Refresh the docs (text-generation.md + model-configuration.md) to
list turbo2/turbo3/turbo4 explicitly and call out that you only get
the TurboQuant code path with this backend + a turbo* cache type.
* fix(turboquant): rewrite patch-grpc-server.sh in awk, not python3
The builder image (ubuntu:24.04 stage-2 in Dockerfile.turboquant)
does not install python3, so the python-based augmentation step
errored with `python3: command not found` at make time. Switch to
awk, which ships in coreutils and is already available everywhere
the rest of the wrapper Makefile runs.
* Apply suggestion from @mudler
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
---------
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
|
||
|
|
87e6de1989 |
feat: wire transcription for llama.cpp, add streaming support (#9353)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
016da02845 |
feat: refactor shared helpers and enhance MLX backend functionality (#9335)
* refactor(backends): extract python_utils + add mlx_utils shared helpers
Move parse_options() and messages_to_dicts() out of vllm_utils.py into a
new framework-agnostic python_utils.py, and re-export them from vllm_utils
so existing vllm / vllm-omni imports keep working.
Add mlx_utils.py with split_reasoning() and parse_tool_calls() — ported
from mlx_vlm/server.py's process_tool_calls. These work with any
mlx-lm / mlx-vlm tool module (anything exposing tool_call_start,
tool_call_end, parse_tool_call). Used by the mlx and mlx-vlm backends in
later commits to emit structured ChatDelta.tool_calls without
reimplementing per-model parsing.
Shared smoke tests confirm:
- parse_options round-trips bool/int/float/string
- vllm_utils re-exports are identity-equal to python_utils originals
- mlx_utils parse_tool_calls handles <tool_call>...</tool_call> with a
shim module and produces a correctly-indexed list with JSON arguments
- mlx_utils split_reasoning extracts <think> blocks and leaves clean
content
* feat(mlx): wire native tool parsers + ChatDelta + token usage + logprobs
Bring the MLX backend up to the same structured-output contract as vLLM
and llama.cpp: emit Reply.chat_deltas so the OpenAI HTTP layer sees
tool_calls and reasoning_content, not just raw text.
Key insight: mlx_lm.load() returns a TokenizerWrapper that already auto-
detects the right tool parser from the model's chat template
(_infer_tool_parser in mlx_lm/tokenizer_utils.py). The wrapper exposes
has_tool_calling, has_thinking, tool_parser, tool_call_start,
tool_call_end, think_start, think_end — no user configuration needed,
unlike vLLM.
Changes in backend/python/mlx/backend.py:
- Imports: replace inline parse_options / messages_to_dicts with the
shared helpers from python_utils. Pull split_reasoning / parse_tool_calls
from the new mlx_utils shared module.
- LoadModel: log the auto-detected has_tool_calling / has_thinking /
tool_parser_type for observability. Drop the local is_float / is_int
duplicates.
- _prepare_prompt: run request.Messages through messages_to_dicts so
tool_call_id / tool_calls / reasoning_content survive the conversion,
and pass tools=json.loads(request.Tools) + enable_thinking=True (when
request.Metadata says so) to apply_chat_template. Falls back on
TypeError for tokenizers whose template doesn't accept those kwargs.
- _build_generation_params: return an additional (logits_params,
stop_words) pair. Maps RepetitionPenalty / PresencePenalty /
FrequencyPenalty to mlx_lm.sample_utils.make_logits_processors and
threads StopPrompts through to post-decode truncation.
- New _tool_module_from_tokenizer / _finalize_output / _truncate_at_stop
helpers. _finalize_output runs split_reasoning when has_thinking is
true and parse_tool_calls (using a SimpleNamespace shim around the
wrapper's tool_parser callable) when has_tool_calling is true, then
extracts prompt_tokens, generation_tokens and (best-effort) logprobs
from the last GenerationResponse chunk.
- Predict: use make_logits_processors, accumulate text + last_response,
finalize into a structured Reply carrying chat_deltas,
prompt_tokens, tokens, logprobs. Early-stops on user stop sequences.
- PredictStream: per-chunk Reply still carries raw message bytes for
back-compat but now also emits chat_deltas=[ChatDelta(content=delta)].
On loop exit, emit a terminal Reply with structured
reasoning_content / tool_calls / token counts / logprobs — so the Go
side sees tool calls without needing the regex fallback.
- TokenizeString RPC: uses the TokenizerWrapper's encode(); returns
length + tokens or FAILED_PRECONDITION if the model isn't loaded.
- Free RPC: drops model / tokenizer / lru_cache, runs gc.collect(),
calls mx.metal.clear_cache() when available, and best-effort clears
torch.cuda as a belt-and-suspenders.
* feat(mlx-vlm): mirror MLX parity (tool parsers + ChatDelta + samplers)
Same treatment as the MLX backend: emit structured Reply.chat_deltas,
tool_calls, reasoning_content, token counts and logprobs, and extend
sampling parameter coverage beyond the temp/top_p pair the backend
used to handle.
- Imports: drop the inline is_float/is_int helpers, pull parse_options /
messages_to_dicts from python_utils and split_reasoning /
parse_tool_calls from mlx_utils. Also import make_sampler and
make_logits_processors from mlx_lm.sample_utils — mlx-vlm re-uses them.
- LoadModel: use parse_options; call mlx_vlm.tool_parsers._infer_tool_parser
/ load_tool_module to auto-detect a tool module from the processor's
chat_template. Stash think_start / think_end / has_thinking so later
finalisation can split reasoning blocks without duck-typing on each
call. Logs the detected parser type.
- _prepare_prompt: convert proto Messages via messages_to_dicts (so
tool_call_id / tool_calls survive), pass tools=json.loads(request.Tools)
and enable_thinking=True to apply_chat_template when present, fall
back on TypeError for older mlx-vlm versions. Also handle the
prompt-only + media and empty-prompt + media paths consistently.
- _build_generation_params: return (max_tokens, sampler_params,
logits_params, stop_words). Maps repetition_penalty / presence_penalty /
frequency_penalty and passes them through make_logits_processors.
- _finalize_output / _truncate_at_stop: common helper used by Predict
and PredictStream to split reasoning, run parse_tool_calls against the
auto-detected tool module, build ToolCallDelta list, and extract token
counts + logprobs from the last GenerationResult.
- Predict / PredictStream: switch from mlx_vlm.generate to mlx_vlm.stream_generate
in both paths, accumulate text + last_response, pass sampler and
logits_processors through, emit content-only ChatDelta per streaming
chunk followed by a terminal Reply carrying reasoning_content,
tool_calls, prompt_tokens, tokens and logprobs. Non-streaming Predict
returns the same structured Reply shape.
- New helper _collect_media extracted from the duplicated base64 image /
audio decode loop.
- New TokenizeString RPC using the processor's tokenizer.encode and
Free RPC that drops model/processor/config, runs gc + Metal cache
clear + best-effort torch.cuda cache clear.
* feat(importer/mlx): auto-set tool_parser/reasoning_parser on import
Mirror what core/gallery/importers/vllm.go does: after applying the
shared inference defaults, look up the model URI in parser_defaults.json
and append matching tool_parser:/reasoning_parser: entries to Options.
The MLX backends auto-detect tool parsers from the chat template at
runtime so they don't actually consume these options — but surfacing
them in the generated YAML:
- keeps the import experience consistent with vllm
- gives users a single visible place to override
- documents the intended parser for a given model family
* test(mlx): add helper unit tests + TokenizeString/Free + e2e make targets
- backend/python/mlx/test.py: add TestSharedHelpers with server-less
unit tests for parse_options, messages_to_dicts, split_reasoning and
parse_tool_calls (using a SimpleNamespace shim to fake a tool module
without requiring a model). Plus test_tokenize_string and test_free
RPC tests that load a tiny MLX-quantized Llama and exercise the new
RPCs end-to-end.
- backend/python/mlx-vlm/test.py: same helper unit tests + cleanup of
the duplicated import block at the top of the file.
- Makefile: register BACKEND_MLX and BACKEND_MLX_VLM (they were missing
from the docker-build-target eval list — only mlx-distributed had a
generated target before). Add test-extra-backend-mlx and
test-extra-backend-mlx-vlm convenience targets that build the
respective image and run tests/e2e-backends with the tools capability
against mlx-community/Qwen2.5-0.5B-Instruct-4bit. The MLX backend
auto-detects the tool parser from the chat template so no
BACKEND_TEST_OPTIONS is needed (unlike vllm).
* fix(libbackend): don't pass --copies to venv unless PORTABLE_PYTHON=true
backend/python/common/libbackend.sh:ensureVenv() always invoked
'python -m venv --copies', but macOS system python (and some other
builds) refuses with:
Error: This build of python cannot create venvs without using symlinks
--copies only matters when _makeVenvPortable later relocates the venv,
which only happens when PORTABLE_PYTHON=true. Make --copies conditional
on that flag and fall back to default (symlinked) venv otherwise.
Caught while bringing up the mlx backend on Apple Silicon — the same
build path is used by every Python backend with USE_PIP=true.
* fix(mlx): support mlx-lm 0.29.x tool calling + drop deprecated clear_cache
The released mlx-lm 0.29.x ships a much simpler tool-calling API than
HEAD: TokenizerWrapper detects the <tool_call>...</tool_call> markers
from the tokenizer vocab and exposes has_tool_calling /
tool_call_start / tool_call_end, but does NOT expose a tool_parser
callable on the wrapper and does NOT ship a mlx_lm.tool_parsers
subpackage at all (those only exist on main).
Caught while running the smoke test on Apple Silicon with the
released mlx-lm 0.29.1: tokenizer.tool_parser raised AttributeError
(falling through to the underlying HF tokenizer), so
_tool_module_from_tokenizer always returned None and tool calls slipped
through as raw <tool_call>...</tool_call> text in Reply.message instead
of being parsed into ChatDelta.tool_calls.
Fix: when has_tool_calling is True but tokenizer.tool_parser is missing,
default the parse_tool_call callable to json.loads(body.strip()) — that's
exactly what mlx_lm.tool_parsers.json_tools.parse_tool_call does on HEAD
and covers the only format 0.29 detects (<tool_call>JSON</tool_call>).
Future mlx-lm releases that ship more parsers will be picked up
automatically via the tokenizer.tool_parser attribute when present.
Also tighten the LoadModel logging — the old log line read
init_kwargs.get('tool_parser_type') which doesn't exist on 0.29 and
showed None even when has_tool_calling was True. Log the actual
tool_call_start / tool_call_end markers instead.
While here, switch Free()'s Metal cache clear from the deprecated
mx.metal.clear_cache to mx.clear_cache (mlx >= 0.30), with a
fallback for older releases. Mirrored to the mlx-vlm backend.
* feat(mlx-distributed): mirror MLX parity (tool calls + ChatDelta + sampler)
Same treatment as the mlx and mlx-vlm backends: emit Reply.chat_deltas
with structured tool_calls / reasoning_content / token counts /
logprobs, expand sampling parameter coverage beyond temp+top_p, and
add the missing TokenizeString and Free RPCs.
Notes specific to mlx-distributed:
- Rank 0 is the only rank that owns a sampler — workers participate in
the pipeline-parallel forward pass via mx.distributed and don't
re-implement sampling. So the new logits_params (repetition_penalty,
presence_penalty, frequency_penalty) and stop_words apply on rank 0
only; we don't need to extend coordinator.broadcast_generation_params,
which still ships only max_tokens / temperature / top_p to workers
(everything else is a rank-0 concern).
- Free() now broadcasts CMD_SHUTDOWN to workers when a coordinator is
active, so they release the model on their end too. The constant is
already defined and handled by the existing worker loop in
backend.py:633 (CMD_SHUTDOWN = -1).
- Drop the locally-defined is_float / is_int / parse_options trio in
favor of python_utils.parse_options, re-exported under the module
name for back-compat with anything that imported it directly.
- _prepare_prompt: route through messages_to_dicts so tool_call_id /
tool_calls / reasoning_content survive, pass tools=json.loads(
request.Tools) and enable_thinking=True to apply_chat_template, fall
back on TypeError for templates that don't accept those kwargs.
- New _tool_module_from_tokenizer (with the json.loads fallback for
mlx-lm 0.29.x), _finalize_output, _truncate_at_stop helpers — same
contract as the mlx backend.
- LoadModel logs the auto-detected has_tool_calling / has_thinking /
tool_call_start / tool_call_end so users can see what the wrapper
picked up for the loaded model.
- backend/python/mlx-distributed/test.py: add the same TestSharedHelpers
unit tests (parse_options, messages_to_dicts, split_reasoning,
parse_tool_calls) that exist for mlx and mlx-vlm.
|
||
|
|
d67623230f |
feat(vllm): parity with llama.cpp backend (#9328)
* fix(schema): serialize ToolCallID and Reasoning in Messages.ToProto
The ToProto conversion was dropping tool_call_id and reasoning_content
even though both proto and Go fields existed, breaking multi-turn tool
calling and reasoning passthrough to backends.
* refactor(config): introduce backend hook system and migrate llama-cpp defaults
Adds RegisterBackendHook/runBackendHooks so each backend can register
default-filling functions that run during ModelConfig.SetDefaults().
Migrates the existing GGUF guessing logic into hooks_llamacpp.go,
registered for both 'llama-cpp' and the empty backend (auto-detect).
Removes the old guesser.go shim.
* feat(config): add vLLM parser defaults hook and importer auto-detection
Introduces parser_defaults.json mapping model families to vLLM
tool_parser/reasoning_parser names, with longest-pattern-first matching.
The vllmDefaults hook auto-fills tool_parser and reasoning_parser
options at load time for known families, while the VLLMImporter writes
the same values into generated YAML so users can review and edit them.
Adds tests covering MatchParserDefaults, hook registration via
SetDefaults, and the user-override behavior.
* feat(vllm): wire native tool/reasoning parsers + chat deltas + logprobs
- Use vLLM's ToolParserManager/ReasoningParserManager to extract structured
output (tool calls, reasoning content) instead of reimplementing parsing
- Convert proto Messages to dicts and pass tools to apply_chat_template
- Emit ChatDelta with content/reasoning_content/tool_calls in Reply
- Extract prompt_tokens, completion_tokens, and logprobs from output
- Replace boolean GuidedDecoding with proper GuidedDecodingParams from Grammar
- Add TokenizeString and Free RPC methods
- Fix missing `time` import used by load_video()
* feat(vllm): CPU support + shared utils + vllm-omni feature parity
- Split vllm install per acceleration: move generic `vllm` out of
requirements-after.txt into per-profile after files (cublas12, hipblas,
intel) and add CPU wheel URL for cpu-after.txt
- requirements-cpu.txt now pulls torch==2.7.0+cpu from PyTorch CPU index
- backend/index.yaml: register cpu-vllm / cpu-vllm-development variants
- New backend/python/common/vllm_utils.py: shared parse_options,
messages_to_dicts, setup_parsers helpers (used by both vllm backends)
- vllm-omni: replace hardcoded chat template with tokenizer.apply_chat_template,
wire native parsers via shared utils, emit ChatDelta with token counts,
add TokenizeString and Free RPCs, detect CPU and set VLLM_TARGET_DEVICE
- Add test_cpu_inference.py: standalone script to validate CPU build with
a small model (Qwen2.5-0.5B-Instruct)
* fix(vllm): CPU build compatibility with vllm 0.14.1
Validated end-to-end on CPU with Qwen2.5-0.5B-Instruct (LoadModel, Predict,
TokenizeString, Free all working).
- requirements-cpu-after.txt: pin vllm to 0.14.1+cpu (pre-built wheel from
GitHub releases) for x86_64 and aarch64. vllm 0.14.1 is the newest CPU
wheel whose torch dependency resolves against published PyTorch builds
(torch==2.9.1+cpu). Later vllm CPU wheels currently require
torch==2.10.0+cpu which is only available on the PyTorch test channel
with incompatible torchvision.
- requirements-cpu.txt: bump torch to 2.9.1+cpu, add torchvision/torchaudio
so uv resolves them consistently from the PyTorch CPU index.
- install.sh: add --index-strategy=unsafe-best-match for CPU builds so uv
can mix the PyTorch index and PyPI for transitive deps (matches the
existing intel profile behaviour).
- backend.py LoadModel: vllm >= 0.14 removed AsyncLLMEngine.get_model_config
so the old code path errored out with AttributeError on model load.
Switch to the new get_tokenizer()/tokenizer accessor with a fallback
to building the tokenizer directly from request.Model.
* fix(vllm): tool parser constructor compat + e2e tool calling test
Concrete vLLM tool parsers override the abstract base's __init__ and
drop the tools kwarg (e.g. Hermes2ProToolParser only takes tokenizer).
Instantiating with tools= raised TypeError which was silently caught,
leaving chat_deltas.tool_calls empty.
Retry the constructor without the tools kwarg on TypeError — tools
aren't required by these parsers since extract_tool_calls finds tool
syntax in the raw model output directly.
Validated with Qwen/Qwen2.5-0.5B-Instruct + hermes parser on CPU:
the backend correctly returns ToolCallDelta{name='get_weather',
arguments='{"location": "Paris, France"}'} in ChatDelta.
test_tool_calls.py is a standalone smoke test that spawns the gRPC
backend, sends a chat completion with tools, and asserts the response
contains a structured tool call.
* ci(backend): build cpu-vllm container image
Add the cpu-vllm variant to the backend container build matrix so the
image registered in backend/index.yaml (cpu-vllm / cpu-vllm-development)
is actually produced by CI.
Follows the same pattern as the other CPU python backends
(cpu-diffusers, cpu-chatterbox, etc.) with build-type='' and no CUDA.
backend_pr.yml auto-picks this up via its matrix filter from backend.yml.
* test(e2e-backends): add tools capability + HF model name support
Extends tests/e2e-backends to cover backends that:
- Resolve HuggingFace model ids natively (vllm, vllm-omni) instead of
loading a local file: BACKEND_TEST_MODEL_NAME is passed verbatim as
ModelOptions.Model with no download/ModelFile.
- Parse tool calls into ChatDelta.tool_calls: new "tools" capability
sends a Predict with a get_weather function definition and asserts
the Reply contains a matching ToolCallDelta. Uses UseTokenizerTemplate
with OpenAI-style Messages so the backend can wire tools into the
model's chat template.
- Need backend-specific Options[]: BACKEND_TEST_OPTIONS lets a test set
e.g. "tool_parser:hermes,reasoning_parser:qwen3" at LoadModel time.
Adds make target test-extra-backend-vllm that:
- docker-build-vllm
- loads Qwen/Qwen2.5-0.5B-Instruct
- runs health,load,predict,stream,tools with tool_parser:hermes
Drops backend/python/vllm/test_{cpu_inference,tool_calls}.py — those
standalone scripts were scaffolding used while bringing up the Python
backend; the e2e-backends harness now covers the same ground uniformly
alongside llama-cpp and ik-llama-cpp.
* ci(test-extra): run vllm e2e tests on CPU
Adds tests-vllm-grpc to the test-extra workflow, mirroring the
llama-cpp and ik-llama-cpp gRPC jobs. Triggers when files under
backend/python/vllm/ change (or on run-all), builds the local-ai
vllm container image, and runs the tests/e2e-backends harness with
BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct, tool_parser:hermes,
and the tools capability enabled.
Uses ubuntu-latest (no GPU) — vllm runs on CPU via the cpu-vllm
wheel we pinned in requirements-cpu-after.txt. Frees disk space
before the build since the docker image + torch + vllm wheel is
sizeable.
* fix(vllm): build from source on CI to avoid SIGILL on prebuilt wheel
The prebuilt vllm 0.14.1+cpu wheel from GitHub releases is compiled with
SIMD instructions (AVX-512 VNNI/BF16 or AMX-BF16) that not every CPU
supports. GitHub Actions ubuntu-latest runners SIGILL when vllm spawns
the model_executor.models.registry subprocess for introspection, so
LoadModel never reaches the actual inference path.
- install.sh: when FROM_SOURCE=true on a CPU build, temporarily hide
requirements-cpu-after.txt so installRequirements installs the base
deps + torch CPU without pulling the prebuilt wheel, then clone vllm
and compile it with VLLM_TARGET_DEVICE=cpu. The resulting binaries
target the host's actual CPU.
- backend/Dockerfile.python: accept a FROM_SOURCE build-arg and expose
it as an ENV so install.sh sees it during `make`.
- Makefile docker-build-backend: forward FROM_SOURCE as --build-arg
when set, so backends that need source builds can opt in.
- Makefile test-extra-backend-vllm: call docker-build-vllm via a
recursive $(MAKE) invocation so FROM_SOURCE flows through.
- .github/workflows/test-extra.yml: set FROM_SOURCE=true on the
tests-vllm-grpc job. Slower but reliable — the prebuilt wheel only
works on hosts that share the build-time SIMD baseline.
Answers 'did you test locally?': yes, end-to-end on my local machine
with the prebuilt wheel (CPU supports AVX-512 VNNI). The CI runner CPU
gap was not covered locally — this commit plugs that gap.
* ci(vllm): use bigger-runner instead of source build
The prebuilt vllm 0.14.1+cpu wheel requires SIMD instructions (AVX-512
VNNI/BF16) that stock ubuntu-latest GitHub runners don't support —
vllm.model_executor.models.registry SIGILLs on import during LoadModel.
Source compilation works but takes 30-40 minutes per CI run, which is
too slow for an e2e smoke test. Instead, switch tests-vllm-grpc to the
bigger-runner self-hosted label (already used by backend.yml for the
llama-cpp CUDA build) — that hardware has the required SIMD baseline
and the prebuilt wheel runs cleanly.
FROM_SOURCE=true is kept as an opt-in escape hatch:
- install.sh still has the CPU source-build path for hosts that need it
- backend/Dockerfile.python still declares the ARG + ENV
- Makefile docker-build-backend still forwards the build-arg when set
Default CI path uses the fast prebuilt wheel; source build can be
re-enabled by exporting FROM_SOURCE=true in the environment.
* ci(vllm): install make + build deps on bigger-runner
bigger-runner is a bare self-hosted runner used by backend.yml for
docker image builds — it has docker but not the usual ubuntu-latest
toolchain. The make-based test target needs make, build-essential
(cgo in 'go test'), and curl/unzip (the Makefile protoc target
downloads protoc from github releases).
protoc-gen-go and protoc-gen-go-grpc come via 'go install' in the
install-go-tools target, which setup-go makes possible.
* ci(vllm): install libnuma1 + libgomp1 on bigger-runner
The vllm 0.14.1+cpu wheel ships a _C C++ extension that dlopens
libnuma.so.1 at import time. When the runner host doesn't have it,
the extension silently fails to register its torch ops, so
EngineCore crashes on init_device with:
AttributeError: '_OpNamespace' '_C_utils' object has no attribute
'init_cpu_threads_env'
Also add libgomp1 (OpenMP runtime, used by torch CPU kernels) to be
safe on stripped-down runners.
* feat(vllm): bundle libnuma/libgomp via package.sh
The vllm CPU wheel ships a _C extension that dlopens libnuma.so.1 at
import time; torch's CPU kernels in turn use libgomp.so.1 (OpenMP).
Without these on the host, vllm._C silently fails to register its
torch ops and EngineCore crashes with:
AttributeError: '_OpNamespace' '_C_utils' object has no attribute
'init_cpu_threads_env'
Rather than asking every user to install libnuma1/libgomp1 on their
host (or every LocalAI base image to ship them), bundle them into
the backend image itself — same pattern fish-speech and the GPU libs
already use. libbackend.sh adds ${EDIR}/lib to LD_LIBRARY_PATH at
run time so the bundled copies are picked up automatically.
- backend/python/vllm/package.sh (new): copies libnuma.so.1 and
libgomp.so.1 from the builder's multilib paths into ${BACKEND}/lib,
preserving soname symlinks. Runs during Dockerfile.python's
'Run backend-specific packaging' step (which already invokes
package.sh if present).
- backend/Dockerfile.python: install libnuma1 + libgomp1 in the
builder stage so package.sh has something to copy (the Ubuntu
base image otherwise only has libgomp in the gcc dep chain).
- test-extra.yml: drop the workaround that installed these libs on
the runner host — with the backend image self-contained, the
runner no longer needs them, and the test now exercises the
packaging path end-to-end the way a production host would.
* ci(vllm): disable tests-vllm-grpc job (heterogeneous runners)
Both ubuntu-latest and bigger-runner have inconsistent CPU baselines:
some instances support the AVX-512 VNNI/BF16 instructions the prebuilt
vllm 0.14.1+cpu wheel was compiled with, others SIGILL on import of
vllm.model_executor.models.registry. The libnuma packaging fix doesn't
help when the wheel itself can't be loaded.
FROM_SOURCE=true compiles vllm against the actual host CPU and works
everywhere, but takes 30-50 minutes per run — too slow for a smoke
test on every PR.
Comment out the job for now. The test itself is intact and passes
locally; run it via 'make test-extra-backend-vllm' on a host with the
required SIMD baseline. Re-enable when:
- we have a self-hosted runner label with guaranteed AVX-512 VNNI/BF16, or
- vllm publishes a CPU wheel with a wider baseline, or
- we set up a docker layer cache that makes FROM_SOURCE acceptable
The detect-changes vllm output, the test harness changes (tests/
e2e-backends + tools cap), the make target (test-extra-backend-vllm),
the package.sh and the Dockerfile/install.sh plumbing all stay in
place.
|
||
|
|
9ca03cf9cc |
feat(backends): add ik-llama-cpp (#9326)
* feat(backends): add ik-llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: add grpc e2e suite, hook to CI, update README Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> |
||
|
|
7a0e6ae6d2 |
feat(qwen3tts.cpp): add new backend (#9316)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
706cf5d43c |
feat(sam.cpp): add sam.cpp detection backend (#9288)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
e00ce981f0 |
fix: try to add whisperx and faster-whisper for more variants (#9278)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
ea6e850809 |
feat: Add Kokoros backend (#9212)
Signed-off-by: Richard Palethorpe <io@richiejp.com> |
||
|
|
0e9d1a6588 |
chore(ci): drop unnecessary test
Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
031a36c995 |
feat: inferencing default, automatic tool parsing fallback and wire min_p (#9092)
* feat: wire min_p Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: inferencing defaults Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(refactor): re-use iterative parser Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: generate automatically inference defaults from unsloth Instead of trying to re-invent the wheel and maintain here the inference defaults, prefer to consume unsloth ones, and contribute there as necessary. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: apply defaults also to models installed via gallery Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: be consistent and apply fallback to all endpoint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
f7e8d9e791 |
feat(quantization): add quantization backend (#9096)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
d9c1db2b87 |
feat: add (experimental) fine-tuning support with TRL (#9088)
* feat: add fine-tuning endpoint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(experimental): add fine-tuning endpoint and TRL support This changeset defines new GRPC signatues for Fine tuning backends, and add TRL backend as initial fine-tuning engine. This implementation also supports exporting to GGUF and automatically importing it to LocalAI after fine-tuning. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * commit TRL backend, stop by killing process Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * move fine-tune to generic features Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * add evals, reorder menu Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
3d9ccd1ddc |
fix(ui): Add tracing inline settings back and create UI tests (#9027)
Signed-off-by: Richard Palethorpe <io@richiejp.com> |
||
|
|
5affb747a9 |
chore: drop AIO images (#9004)
AIO images are behind, and takes effort to maintain these. Wizard and installation of models have been semplified massively, so AIO images lost their purpose. This allows us to be more laser focused on main images and reliefes stress from CI. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
f9a850c02a |
feat(realtime): WebRTC support (#8790)
* feat(realtime): WebRTC support Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(tracing): Show full LLM opts and deltas Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com> |
||
|
|
a738f8b0e4 |
feat(backends): add ace-step.cpp (#8965)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
7dc691c171 |
feat: add fish-speech backend (#8962)
* feat: add fish-speech backend Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop portaudio Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
a026277ab9 |
feat(mlx-distributed): add new MLX-distributed backend (#8801)
* feat(mlx-distributed): add new MLX-distributed backend Add new MLX distributed backend with support for both TCP and RDMA for model sharding. This implementation ties in the discovery implementation already in place, and re-uses the same P2P mechanism for the TCP MLX-distributed inferencing. The Auto-parallel implementation is inspired by Exo's ones (who have been added to acknowledgement for the great work!) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * expose a CLI to facilitate backend starting Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: make manual rank0 configurable via model configs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add missing features from mlx backend Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> |
||
|
|
09ddaf94b2 |
feat(ui): move to React for frontend (#8772)
* feat(ui): move to React Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add import model Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * syntax highlight Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Minor fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
dfc6efb88d |
feat(backends): add faster-qwen3-tts (#8664)
* feat(backends): add faster-qwen3-tts Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: this backend is CUDA only Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: add requirements-install.txt with setuptools for build isolation The faster-qwen3-tts backend requires setuptools to build packages like sox that have setuptools as a build dependency. This ensures the build completes successfully in CI. Signed-off-by: LocalAI Bot <localai-bot@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: LocalAI Bot <localai-bot@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
bf5a1dd840 |
feat(voxtral): add voxtral backend (#8451)
* feat(voxtral): add voxtral backend Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * simplify Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
3370d807c2 |
feat(nemo): add Nemo (only asr for now) backend (#8436)
* feat(nemo): add Nemo (only asr for now) backend Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nemo): add Nemo backend without Python version pins (#8438) * Initial plan * Remove Python version pins from nemo backend install.sh Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> * Pin pyarrow to 20.0.0 in nemo requirements Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> |
||
|
|
53276d28e7 |
feat(musicgen): add ace-step and UI interface (#8396)
* feat(musicgen): add ace-step and UI interface Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly handle model dir Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop auto-download Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to models, fixup UIs icons Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * l4t13 is incompatbile Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * avoid pinning version for cuda12 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop l4t12 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
e7fc604dbc |
feat(metal): try to extend support to remaining backends (#8374)
* feat(metal): try to extend support to remaining backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * neutts doesn't work Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * split outetts out of transformers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Remove torch pin to whisperx Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
10a1e6c74d |
feat(whisperx): add whisperx backend for transcription with speaker diarization (#8299)
* feat(proto): add speaker field to TranscriptSegment for diarization
Add speaker field to the gRPC TranscriptSegment message and map it
through the Go schema, enabling backends to return speaker labels.
Signed-off-by: eureka928 <meobius123@gmail.com>
* feat(whisperx): add whisperx backend for transcription with diarization
Add Python gRPC backend using WhisperX for speech-to-text with
word-level timestamps, forced alignment, and speaker diarization
via pyannote-audio when HF_TOKEN is provided.
Signed-off-by: eureka928 <meobius123@gmail.com>
* feat(whisperx): register whisperx backend in Makefile
Signed-off-by: eureka928 <meobius123@gmail.com>
* feat(whisperx): add whisperx meta and image entries to index.yaml
Signed-off-by: eureka928 <meobius123@gmail.com>
* ci(whisperx): add build matrix entries for CPU, CUDA 12/13, and ROCm
Signed-off-by: eureka928 <meobius123@gmail.com>
* fix(whisperx): unpin torch versions and use CPU index for cpu requirements
Address review feedback:
- Use --extra-index-url for CPU torch wheels to reduce size
- Remove torch version pins, let uv resolve compatible versions
Signed-off-by: eureka928 <meobius123@gmail.com>
* fix(whisperx): pin torch ROCm variant to fix CI build failure
Signed-off-by: eureka928 <meobius123@gmail.com>
* fix(whisperx): pin torch CPU variant to fix uv resolution failure
Pin torch==2.8.0+cpu so uv resolves the CPU wheel from the extra
index instead of picking torch==2.8.0+cu128 from PyPI, which pulls
unresolvable CUDA dependencies.
Signed-off-by: eureka928 <meobius123@gmail.com>
* fix(whisperx): use unsafe-best-match index strategy to fix uv resolution failure
uv's default first-match strategy finds torch on PyPI before checking
the extra index, causing it to pick torch==2.8.0+cu128 instead of the
CPU variant. This makes whisperx's transitive torch dependency
unresolvable. Using unsafe-best-match lets uv consider all indexes.
Signed-off-by: eureka928 <meobius123@gmail.com>
* fix(whisperx): drop +cpu local version suffix to fix uv resolution failure
PEP 440 ==2.8.0 matches 2.8.0+cpu from the extra index, avoiding the
issue where uv cannot locate an explicit +cpu local version specifier.
This aligns with the pattern used by all other CPU backends.
Signed-off-by: eureka928 <meobius123@gmail.com>
* fix(backends): drop +rocm local version suffixes from hipblas requirements to fix uv resolution
uv cannot resolve PEP 440 local version specifiers (e.g. +rocm6.4,
+rocm6.3) in pinned requirements. The --extra-index-url already points
to the correct ROCm wheel index and --index-strategy unsafe-best-match
(set in libbackend.sh) ensures the ROCm variant is preferred.
Applies the same fix as
|
||
|
|
4ca5b737bf |
chore(cuda): target 12.8 for 12 to increase compatibility (#8297)
Some datacenter setups might be stuck with the 5.x kernel which doesn't play well with CUDA >=12.9. To incrase compatibility with the CUDA 12.x branch, downgrade to 12.8. For newer systems, it is still suggested to use CUDA 13.x wherever compatible. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
4077aaf978 |
chore: re-enable e2e tests, fixups anthropic API tools support (#8296)
* chore(tests): add mock backend e2e tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup anthropic tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * prepare e2e tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop repetitive tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop specific CI workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixup anthropic issues, move all e2e tests to use mocked backend Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
1e08e02598 |
feat(qwen-asr): add support to qwen-asr (#8281)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
9b973b79f6 |
feat: add VoxCPM tts backend (#8109)
* feat: add VoxCPM tts backend Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable voxcpm on arm64 cpu Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
26a374b717 |
chore: drop bark which is unmaintained (#8207)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
b2a8a63899 |
feat(vllm-omni): add new backend (#8188)
* feat(vllm-omni: add new backend Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * default to py3.12 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
05904c77f5 |
chore(exllama): drop backend now almost deprecated (#8186)
exllama2 development has stalled and only old architectures are supported. exllamav3 is still in development, meanwhile cleaning up exllama2 from the gallery. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
923ebbb344 |
feat(qwen-tts): add Qwen-tts backend (#8163)
* feat(qwen-tts): add Qwen-tts backend Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update intel deps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop flash-attn for cuda13 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
a6ff354c86 |
feat(tts): add pocket-tts backend (#8018)
* feat(pocket-tts): add new backend Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the gallery Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
b964b3d53e |
feat(backends): add moonshine backend for faster transcription (#7833)
* feat(backends): add moonshine backend for faster transcription Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add backend to CI, update AGENTS.md from this exercise Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
e6ba26c3e7 |
chore: Update to Ubuntu24.04 (cont #7423) (#7769)
* ci(workflows): bump GitHub Actions images to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): remove CUDA 11.x support from GitHub Actions (incompatible with ubuntu:24.04) Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): bump GitHub Actions CUDA support to 12.9 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): bump base image to ubuntu:24.04 and adjust Vulkan SDK/packages Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(backend): correct context paths for Python backends in workflows, Makefile and Dockerfile Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): disable parallel backend builds to avoid race conditions Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): export CUDA_MAJOR_VERSION and CUDA_MINOR_VERSION for override Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): update backend Dockerfiles to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): add ROCm env vars and default AMDGPU_TARGETS for hipBLAS builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(chatterbox): bump ROCm PyTorch to 2.9.1+rocm6.4 and update index URL; align hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore: add local-ai-launcher to .gitignore Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix backends GitHub Actions workflows after rebase Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): use build-time UBUNTU_VERSION variable Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(docker): remove libquadmath0 from requirements-stage base image Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): add backends/vllm to .NOTPARALLEL to prevent parallel builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(docker): correct CUDA installation steps in backend Dockerfiles Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): update ROCm to 6.4 and align Python hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for CUDA on arm64 builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): update base image and backend Dockerfiles for Ubuntu 24.04 compatibility on arm64 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): increase timeout for uv installs behind slow networks on backend/Dockerfile.python Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for vibevoice backend Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix failing GitHub Actions runners Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix: Allow FROM_SOURCE to be unset, use upstream Intel images etc. Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): rm all traces of CUDA 11 Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): Add Ubuntu codename as an argument Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> Signed-off-by: Richard Palethorpe <io@richiejp.com> Co-authored-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> |
||
|
|
560bf50299 |
chore(Makefile): refactor common make targets (#7858)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
716dba94b4 |
feat(whisper): Add prompt to condition transcription output (#7624)
* chore(makefile): Add buildargs for sd and cuda when building backend Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(whisper): Add prompt to condition transcription output Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com> |