Files
LocalAI/README.md
LocalAI [bot] de2ec2f136 feat(backends): add voice-detect + face-detect ggml backends (replace Python insightface/speaker-recognition) (#10441)
* feat(voice-detect): add Go purego backend for voice-detect.cpp

Add backend/go/voice-detect implementing the Backend gRPC voice subset
(VoiceEmbed/VoiceVerify/VoiceAnalyze) over libvoicedetect.so via purego,
mirroring the parakeet-cpp / omnivoice-cpp backends.

The flat voicedetect_capi C ABI is dlopen'd cgo-less; malloc'd string and
float-vector returns are owned by Go and released through the matching capi
free functions, with the per-ctx last error surfaced into Go errors. Calls are
serialized via base.SingleThread since the C context is not reentrant.

Proto field mapping:
- VoiceEmbed: VoiceEmbedRequest.audio (path) -> embed_path -> Embedding+Model.
- VoiceVerify: audio1/audio2 + threshold (<=0 falls back to the
  verify_threshold option, default 0.25) -> verify_paths -> verified/distance/
  threshold/confidence/model/processing_time_ms.
- VoiceAnalyze: audio (path) -> analyze_path_json; the JSON age/gender/emotion
  document maps to a single VoiceAnalysis segment (start/end 0; gender "label"
  -> dominant_gender with the remaining float scores as the gender map; emotion
  label/scores -> dominant_emotion/emotion).

The Makefile pins voice-detect.cpp to 47546430, clones+builds libvoicedetect.so
with ggml static-linked (PIC, GGML_NATIVE off) so dlopen needs no external
libggml/libvoicedetect; ldd on the artifact shows only system libs. Ginkgo
tests cover option parsing and analyze-JSON mapping; embed/verify smoke specs
gate on VOICEDETECT_BACKEND_TEST_MODEL + VOICEDETECT_BACKEND_TEST_WAV.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(voice-detect): wire backend into index, gallery and build

Register the voice-detect.cpp speaker-recognition + voice-analysis
backend (added in Voice-INT-A) into LocalAI's distribution surfaces,
mirroring the ced backend (the closest mudler C++/ggml audio analogue):

- backend/index.yaml: add the &voicedetect meta-backend (capabilities
  platform map, no top-level uri) plus the full set of concrete per-arch
  image entries (cpu/cuda12/cuda13/metal/rocm/sycl/vulkan/l4t and the
  -development variants). Referential integrity audited - every alias
  target resolves.
- gallery/index.yaml: add 5 model entries on backend voice-detect -
  ECAPA-TDNN, WeSpeaker ResNet34, 3D-Speaker ERes2Net, CAM++ and the
  wav2vec2 age/gender/emotion analyze model. The engine architecture is
  read from GGUF metadata (voicedetect.arch) at load. GGUF artifacts are
  not yet published: each files: entry points at the intended
  mudler/voice-detect-gguf location with a TODO to fill sha256 after
  upload (no fabricated hashes).
- .github/backend-matrix.yml: add the linux build matrix block + the
  darwin metal entry mirroring ced.
- .github/workflows/bump_deps.yaml: track mudler/voice-detect.cpp via
  VOICEDETECT_VERSION (pin 47546430, = 4754643).
- core/config/backend_capabilities.go: register voice-detect in the
  backend capability map (VoiceVerify/VoiceEmbed/VoiceAnalyze ->
  speaker_recognition), mirroring speaker-recognition.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(face-detect): add purego Go backend for face-detect.cpp

Add the LocalAI Go backend that dlopens libfacedetect.so (the flat
facedetect_capi_* C-ABI) via purego, mirroring the sibling voice-detect
backend. Implements the Face subset of the Backend gRPC service:

- Embeddings(PredictOptions): Images[0] base64 -> temp file -> embed_path
  -> L2-normalized ArcFace embedding.
- Detect(DetectOptions): src -> detect_path_json -> Detection boxes
  (class_name "face", [x1,y1,x2,y2] -> x/y/w/h).
- FaceVerify(FaceVerifyRequest): two images + threshold + anti_spoof ->
  verify_paths; best-effort img areas via detect.
- FaceAnalyze(FaceAnalyzeRequest): img -> analyze_path_json -> per-face
  age + gender ("M"/"F" normalized to "Man"/"Woman").

The Makefile pins face-detect.cpp to 636a1963 and builds the shared lib
with ggml + vendored libjpeg-turbo static (PIC), so the .so is
ldd-clean (no libggml) and exports only facedetect_capi_* (no jpeg_
symbols). Gated Ginkgo e2e mirrors voice-detect.

Note for the gallery-wiring task: backend registration (index.yaml,
gallery, core/config/backend_capabilities.go) is intentionally not
touched here.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(voice-detect): replace em dashes in net-new descriptions

Project style forbids em/en dashes. Replace the three U+2014 chars
introduced by the voice-detect gallery/index wiring with `-`/`:`.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(face-detect): wire backend into index, gallery and build

Register the face-detect.cpp face detection / embedding / verification /
analysis backend (added in Face-INT-A) into LocalAI's distribution
surfaces, mirroring the voice-detect wiring (the closest mudler C++/ggml
recognition analogue):

- backend/index.yaml: add the &facedetect meta-backend (capabilities
  platform map, no top-level uri to avoid the meta-backend gotcha) plus
  the full set of concrete per-arch image entries (cpu/cuda12/cuda13/
  metal/rocm/sycl-f16/sycl-f32/vulkan/l4t and the -development variants),
  22 entries. Referential integrity audited: every alias target resolves.
- gallery/index.yaml: add 4 model entries on backend face-detect -
  face-detect-buffalo-l/m/s (insightface SCRFD + ArcFace/MBF, NON-COMMERCIAL)
  and face-detect-yunet-sface (OpenCV-Zoo YuNet + SFace, APACHE-2.0, the
  commercial-friendly alternative). The detector/embedder architecture is
  read from GGUF metadata (facedetect.arch) at load; only the real
  verify_threshold option is set (0.35 buffalo, 0.363 sface). GGUF
  artifacts are not yet published: each files: entry points at the
  intended mudler/face-detect-gguf location with a TODO to fill sha256
  after upload (no fabricated hashes).
- core/config/backend_capabilities.go: register face-detect in the
  backend capability map (Embedding/Detect/FaceVerify/FaceAnalyze ->
  face_recognition), mirroring insightface.
- .github/backend-matrix.yml: add the linux build matrix block + the
  darwin metal entry mirroring voice-detect.
- .github/workflows/bump_deps.yaml: track mudler/face-detect.cpp via
  FACEDETECT_VERSION (pin 636a1963).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(recon): voice-detect metal build branch + face-detect gallery usecases

Add the missing metal BUILD_TYPE branch to the voice-detect Makefile
forwarding -DVOICEDETECT_GGML_METAL=ON, mirroring face-detect, so the
darwin metal CI artifact is built with the Metal backend instead of
CPU-only.

Expand the 4 face-detect gallery models' known_usecases to
[face_recognition, detection, embeddings] to match the backend
capabilities map and the mirrored insightface-buffalo entries, so
auto-selection for /v1/detect and /embeddings works.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* docs(recon): document voice-detect and face-detect ggml backends

Document the new standalone C++/ggml biometric backends as the
recommended/default option for face and voice recognition, keeping the
existing Python insightface / speaker-recognition backends framed as the
legacy path.

- features/face-recognition.md: add a face-detect (ggml) backend section
  with the gallery entries (buffalo-l/m/s non-commercial, yunet-sface
  Apache-2.0), licensing, and verify/detect/analyze quickstart.
- features/voice-recognition.md: add a voice-detect (ggml) backend
  section with the gallery entries (ecapa-tdnn, wespeaker-resnet34,
  eres2net, campplus speaker recognizers; emotion-wav2vec2 non-commercial
  analyze head) and quickstart.
- reference/compatibility-table.md: add face-detect.cpp and
  voice-detect.cpp rows to the Vision, Detection & Recognition table.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(gallery): publish recon backend GGUF uris + sha256

Fill in the published HuggingFace GGUF uris and verified sha256 for the
9 recon gallery entries (voice-detect-* and face-detect-*), and remove
the TODO publish markers. Correct the eres2net, campplus, and
emotion-wav2vec2 uris to the actual published filenames.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(gallery): re-embed buffalo anti-spoof + add audeering age/gender voice model

Update the 3 buffalo face-detect GGUF sha256 (anti-spoof ensemble now
embedded and re-uploaded under the same filenames/uris) and note the
FaceVerify anti_spoof request flag in each description. Add a new
voice-detect-age-gender-wav2vec2 gallery entry mirroring the emotion
model.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(gallery): add face-detect-buffalo-sc and antelopev2 packs

Add gallery entries for two newly-published insightface face packs on
the face-detect backend: buffalo_sc (smallest pack, SCRFD-500M + small
ArcFace) and antelopev2 (higher-accuracy, SCRFD-10G + ArcFace glint360k
R100, 512-d). Both are non-commercial research-only.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(recon): honor LocalAI per-model threads in voice/face-detect backends

LocalAI spawns one backend process per model and serves requests
concurrently, so the engines' own min(hardware_concurrency, 8) default
can oversubscribe cores. Forward the per-model Threads value from the
gRPC LoadModel options into the engine via VOICEDETECT_THREADS /
FACEDETECT_THREADS (read at backend construction) before the capi load.
A non-positive Threads is treated as unset, leaving the engine default.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump backend pins to CPU-optimized engine commits

voice-detect.cpp -> 0d9c1b3 (radix-2 FFT FBank, threads, flash attn + cached
pos-conv); face-detect.cpp -> 523aee1 (thread-gated direct conv, threads).
Brings the CPU optimizations into the LocalAI backend builds. GGUF format and
parity unchanged, so the published HF GGUFs remain valid.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump backend pins to round-2 CPU-optimized engines

voice-detect.cpp -> fe7e6a3 (ERes2Net 1x1->mul_mat, CAM++ layout+context,
wav2vec2 conv-LN, ECAPA capture-drop, AVX512 dispatch opt-in); face-detect.cpp
-> 9c8adb7 (AVX2 Winograd F(2x2,3x3) for SCRFD/ArcFace 3x3 convs, ArcFace
BN-fold). Parity unchanged (cosine=1.0); GGUF format unchanged, HF GGUFs valid.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump backend pins to round-3 Winograd engines

voice-detect.cpp -> 45122ec (Winograd F(2x2,3x3) for WeSpeaker/ERes2Net 3x3
convs, -22%/-20% @8t); face-detect.cpp -> cd5c962 (Winograd F(4x4,3x3) for
SCRFD large maps, -22% @1t on top of F(2x2), more load-stable). Parity held
(cosine=1.0); GGUF format unchanged, HF GGUFs valid.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump backend pins to round-4 Winograd engines (CPU opt complete)

voice-detect.cpp -> d2839ca (CAM++ FCM 2D convs through Winograd, -15.5%/-10.3%);
face-detect.cpp -> c1db23d (AVX2-vectorized Winograd tile transforms, SCRFD
detect -14%/-9.6%). Final CPU optimization round; the conv-kernel lever class is
now exhausted (parity held cosine=1.0; GGUF/parity unchanged, HF GGUFs valid).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump face-detect pin to deep-kernel engine (7ae5c4d)

face-detect.cpp -> 7ae5c4d: register-blocked winograd-domain GEMM microkernel
(2.8x isolated GFLOP/s), AVX-512 zmm evolution behind runtime CPUID dispatch
(ship-safe, AVX2 fallback bit-identical), bias/relu fused into the winograd
output transform, and SFace Conv+BN fold + bias/PReLU fusion. SCRFD detect
~1.4x faster end-to-end vs the round-4 baseline; parity bit-exact; portable
single binary (function-multiversioned, no global -mavx512f).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump voice-detect pin to ECAPA operand-order win (e9c56ae)

voice-detect.cpp -> e9c56ae: weight-as-src0 mul_mat order in ECAPA's F32
conv1d_same (routes through tinyBLAS sgemm); ECAPA embed 1.67x @1t / ~1.3x @8t,
parity cosine=1.0. Isolated to encoder.cpp (ECAPA-only); ERes2Net/CAM++/WeSpeaker
do not call conv1d_same so are provably unaffected.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump pins to FMA-throughput engines (voice f7b9f89, face 2d2d5f0)

face -> 2d2d5f0: route ArcFace 3x3 body convs through the AVX-512 winograd
microkernel (kWinoMinSize 80->14); ArcFace 1.62x @1t, SCRFD detect to 0.966 of
MLAS @1t, no regression. voice -> f7b9f89: runtime-CPUID-dispatched AVX-512
winograd-GEMM microkernel (ship-safe, AVX2 fallback bit-identical); WeSpeaker
1.90x @1t. Parity cosine=1.0 throughout; portable single binaries.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump pins to MLAS-class direct-conv engines (voice 7ecfd07, face be22d67)

Hand-tuned nChw16c AVX-512 register-tiled direct-conv microkernel (~263 GFLOP/s,
within 6-7% of MLAS per-op efficiency), runtime-CPUID-dispatched + AVX2 fallback,
fused bias/relu. voice 7ecfd07: default 3x3-s1 kernel for WeSpeaker (+37%/+32%)
+ ERes2Net, CAM++ pinned to Winograd. face be22d67: shape-gated to the ArcFace
recognizer body (+25-27% @8t); SCRFD detector stays on Winograd (no regression).
Parity cosine=1.0 / detect <=1px on AVX-512 + AVX2 paths. Portable single binaries.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump voice pin to Phase-A blocked backbone (f4e7eef)

WeSpeaker ResNet34 runs as one nChw16c blocked island (2 reorders/forward vs
~60) on AVX-512, default; per-conv directconv fallback on AVX2. +2.9% @1t /
+17-19% @8t vs per-conv directconv, parity cosine=1.0. The conv microkernel is
already FMA-bound near peak (~0.86-0.98x MLAS-implied); residual to MLAS is
sub-peak edge + non-conv tail, documented in docs/cpu-optimization.md.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump pins to breadth blocked-backbone (voice 7f66871, face d80092b)

voice 7f66871: AVX2-vectorized (ymm) blocked island - AVX2-only hosts now run
the blocked backbone for WeSpeaker (2.3x over per-conv-AVX2, cosine=1.0);
ERes2Net stays per-conv (blocked regresses, opt-in only); CAM++ Winograd-pinned.
face d80092b: ArcFace recognizer blocked island, AVX-512 default (-13% @8t, ~0.90x
MLAS, the closest conv result), auto per-conv on AVX2; SCRFD untouched on Winograd
(0 island invocations during detect). Parity cosine=1.0 / detect <=1px throughout.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump pins to small-spatial + stem conv kernels (voice 99b1804, face 47fdab6)

Measured-gap-driven conv kernels: small-spatial (fill the register tile when
output width <= tile width) + small-IC stem + strided-1x1/downsample recovery.
ArcFace recognizer 0.57 -> 0.70x MLAS @1t (the closest conv model), WeSpeaker
0.65 -> 0.79x @1t. Parity cosine=1.0 / detect <=1px. The OC-block-sharing lever
was a measured dead-end (deep stride-1 is L3-weight-bandwidth bound, not
read-port bound) and was NOT shipped. Kernel ceiling reached; further gap needs
an algorithm-class change (cache-blocked weight-stationary GEMM, or q8 weights).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump pins to GPU persistent-graph + multi-model-safe cache (voice 45d2e6b, face 0a4799a)

GPU wins (CUDA/ggml backend, no CPU-path change): persistent per-shape graph+context
cache in Backend::compute() eliminates the per-call cudaGraph re-instantiation churn
-> wav2vec2 emotion+age-gender now AT GPU parity with torch-cuDNN on GB10 (0.97-0.98x),
CAM++ -5.7ms; bit-identical parity. Cache hardened multi-model-safe (invalidate-on-free
keyed by the ModelLoader weights buffer) so LocalAI multi-model hosting cannot stale-hit.
Conv models still trail cuDNN (im2col-materialization-bound) - cuDNN implicit-GEMM lever next.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump pins to cuDNN-conv-capable engines (voice b6e4356, face 6107a24)

Adds the opt-in cuDNN implicit-GEMM conv path (VOICEDETECT_GGML_CUDNN /
FACEDETECT_GGML_CUDNN, DEFAULT OFF -> zero build/runtime dep until enabled).
On GPU it kills the im2col-materialization bottleneck and reaches torch-cuDNN
parity on the spill-bound convs: SCRFD detect 14.8->6.4ms (2.3x, ~parity),
WeSpeaker ~parity, ERes2Net beats torch (1.10x); ArcFace/CAM++ neutral (no
spill). Parity exact (SCRFD <=1px, cosine=1.0). To USE it in LocalAI, the CUDA
backend build must enable the flag AND bundle libcudnn - deferred until a
cuDNN-bundled GPU image; flag stays OFF here.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(recon): enable cuDNN conv path on arm64+CUDA13 recon backends

The voice-detect.cpp / face-detect.cpp engines have an opt-in cuDNN
implicit-GEMM conv path behind VOICEDETECT_GGML_CUDNN / FACEDETECT_GGML_CUDNN
(default OFF) that kills im2col on the GPU and reaches torch-cuDNN parity
(SCRFD 2.3x, WeSpeaker/ERes2Net parity), measured on the GB10
(arm64, CUDA 13, sm_121a).

Enable it for the CUDA build, but only where cuDNN actually ships: the
arm64 + CUDA 13 image (GB10/Jetson/L4T). x86 CUDA images carry no cuDNN,
so flipping it on globally for BUILD_TYPE=cublas would be a link failure.
The Makefiles gate on CUDA_MAJOR_VERSION=13 + arch (TARGETARCH from the
matrix/Docker build, uname -m fallback for local builds).

backend/Dockerfile.golang already installs the runtime libcudnn9-cuda-13
in the arm64+CUDA13 apt block; add the matching libcudnn9-dev-cuda-13 so
the build-time link resolves.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump voice-detect pin to ERes2Net blocked-default (30beecd)

Defaults VD_ERES2NET_BLOCKED ON: routes the ERes2Net Res2Net body through the
blocked nChw16c AVX-512 directconv island instead of the 1x1 mul_mat fast path
(CONT-transpose + skinny low-K GEMM). On the shipped GGML_NATIVE=OFF build (ggml
mul_mat is AVX2-only) this wins ~2x at every thread count (2.07x@1t, 2.2x@4t,
2.05x@8t); pure-AVX2 fallback still 1.3-1.62x. Parity exact (cosine=1.000000 vs
golden), so registered voices + verify/identify thresholds are unaffected. The
prior default-OFF rested on a stale comment whose 23pct regression only held on
the non-shipping GGML_NATIVE=ON build.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* docs(readme): announce native voice-detect + face-detect backends in Latest News

Add a Latest News entry for the new from-scratch C++/ggml biometric backends
(voice-detect.cpp + face-detect.cpp) that replace the Python insightface and
speaker-recognition backends: no Python/onnxruntime at inference, self-contained
GGUF, bit-exact parity, GPU cuDNN parity. Mirrors the parakeet.cpp /
locate-anything.cpp native-backend news entries. Refs PR #10441.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): re-pin to the squashed engine release commits

The voice-detect.cpp and face-detect.cpp histories were squashed to a single
release commit, which orphaned the previous pins (voice 30beecd, face 6107a24).
Re-pin to the new single-commit SHAs (voice 3d51077, face 06914b0); the tree is
identical, so the backend build is unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 09:29:08 +02:00

21 KiB




LocalAI stars LocalAI License

Follow LocalAI_API Join LocalAI Discord Community

mudler%2FLocalAI | Trendshift

Deutsch | Español | français | 日本語 | 한국어 | Português | Русский | 中文

LocalAI is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.

A small core, not a bundle. Each backend wraps a best-in-class engine (llama.cpp, vLLM, whisper.cpp, stable-diffusion, MLX...) in its own image, pulled only when a model needs it. You install nothing you don't use.

  • Composable by design: backends are separate and pulled on demand, so you install only what your model needs
  • Open and extensible: load any model, or build your own backend in any language against an open interface
  • Drop-in API compatibility: OpenAI, Anthropic, and ElevenLabs APIs across every backend
  • Any model, any modality: LLMs, vision, voice, image, and video behind one API
  • Any hardware: NVIDIA, AMD, Intel, Apple Silicon, Vulkan, or CPU-only
  • Multi-user ready: API key auth, user quotas, role-based access
  • Built-in AI agents: autonomous agents with tool use, RAG, MCP, and skills
  • Privacy-first: your data never leaves your infrastructure

A small LocalAI core with backends (llama.cpp, vLLM, MLX, whisper.cpp, stable-diffusion, kokoro, parakeet.cpp...) plugged in as separate on-demand images

Created by Ettore Di Giacinto and maintained by the LocalAI team.

📖 Documentation | 💬 Discord | 💻 Quickstart | 🖼️ Models | FAQ

Guided tour

https://github.com/user-attachments/assets/08cbb692-57da-48f7-963d-2e7b43883c18

Click to see more!

User and auth

https://github.com/user-attachments/assets/228fa9ad-81a3-4d43-bfb9-31557e14a36c

Agents

https://github.com/user-attachments/assets/6270b331-e21d-4087-a540-6290006b381a

Usage metrics per user

https://github.com/user-attachments/assets/cbb03379-23b4-4e3d-bd26-d152f057007f

Fine-tuning and Quantization

https://github.com/user-attachments/assets/5ba4ace9-d3df-4795-b7d4-b0b404ea71ee

WebRTC

https://github.com/user-attachments/assets/ed88e34c-fed3-4b83-8a67-4716a9feeb7b

Quickstart

macOS

Download LocalAI for macOS

Note: The DMG is not signed by Apple. After installing, run: sudo xattr -d com.apple.quarantine /Applications/LocalAI.app. See #6268 for details.

Containers (Docker, podman, ...)

Already ran LocalAI before? Use docker start -i local-ai to restart an existing container.

CPU only:

docker run -ti --name local-ai -p 8080:8080 localai/localai:latest

NVIDIA GPU:

# CUDA 13
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-13

# CUDA 12
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-12

# NVIDIA Jetson ARM64 (CUDA 12, for AGX Orin and similar)
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-nvidia-l4t-arm64

# NVIDIA Jetson ARM64 (CUDA 13, for DGX Spark)
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-nvidia-l4t-arm64-cuda-13

AMD GPU (ROCm):

docker run -ti --name local-ai -p 8080:8080 --device=/dev/kfd --device=/dev/dri --group-add=video localai/localai:latest-gpu-hipblas

Intel GPU (oneAPI):

docker run -ti --name local-ai -p 8080:8080 --device=/dev/dri/card1 --device=/dev/dri/renderD128 localai/localai:latest-gpu-intel

Vulkan GPU:

docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-gpu-vulkan

Loading models

# From the model gallery (see available models with `local-ai models list` or at https://models.localai.io)
local-ai run llama-3.2-1b-instruct:q4_k_m
# From Huggingface
local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
# From the Ollama OCI registry
local-ai run ollama://gemma:2b
# From a YAML config
local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
# From a standard OCI registry (e.g., Docker Hub)
local-ai run oci://localai/phi-2:latest

To test a running LocalAI server from the terminal, open an interactive chat session from another shell. Inside the prompt, /models lists installed models and /model <name> switches between them.

# Terminal 1
local-ai run llama-3.2-1b-instruct:q4_k_m

# Terminal 2
local-ai chat --model llama-3.2-1b-instruct:q4_k_m

Automatic Backend Detection: LocalAI automatically detects your GPU capabilities and downloads the appropriate backend. For advanced options, see GPU Acceleration.

For more details, see the Getting Started guide.

Latest News

For older news and full release notes, see GitHub Releases and the News page.

Features

Supported Backends & Acceleration

LocalAI supports 60+ backends including llama.cpp, vLLM, SGLang, transformers, whisper.cpp, diffusers, MLX, MLX-VLM, and many more. Hardware acceleration is available for NVIDIA (CUDA 12/13), AMD (ROCm), Intel (oneAPI/SYCL), Apple Silicon (Metal), Vulkan, and NVIDIA Jetson (L4T). All backends can be installed on-the-fly from the Backend Gallery.

See the full Backend & Model Compatibility Table and GPU Acceleration guide.

Backends built by us

Most backends wrap a best-in-class upstream engine. A handful of them are native C/C++/GGML engines (no Python at inference) developed and maintained by the LocalAI project itself:

Backend What it does
parakeet.cpp C++/GGML port of NVIDIA NeMo Parakeet ASR (tdt/ctc/rnnt/hybrid), with cache-aware streaming transcription
ced.cpp C++/GGML port of the CED audio-tagging models: sound-event classification (527-class AudioSet) over REST and the realtime API for live recognition
voxtral.c Voxtral Realtime 4B speech-to-text in pure C
vibevoice.cpp Native port of Microsoft VibeVoice for TTS (voice cloning) and long-form ASR with speaker diarization
rf-detr.cpp Native RF-DETR object detection and instance segmentation
locate-anything.cpp Open-vocabulary object detection and visual grounding (LocateAnything-3B)
depth-anything.cpp Depth Anything 3 monocular metric depth + camera pose estimation
privacy-filter.cpp Standalone GGML PII/NER token-classification engine powering LocalAI's PII redaction tier
LocalVQE Joint acoustic echo cancellation, noise suppression, and dereverberation
local-store Local-first vector database for embeddings (shipped in-tree)

We also maintain apex-quant, a per-tensor, per-layer quantization recipe for Mixture-of-Experts models that exploits their structural sparsity to produce GGUFs matching or beating Q8_0 quality - and they run out of the box on stock llama.cpp.

Resources

Team

LocalAI is maintained by a small team of humans, together with the wider community of contributors.

A huge thank you to everyone who contributes code, reviews PRs, files issues, and helps users in Discord — LocalAI is a community-driven project and wouldn't exist without you. See the full contributors list.

Citation

If you utilize this repository, data in a downstream project, please consider citing it with:

@misc{localai,
  author = {Ettore Di Giacinto},
  title = {LocalAI: The free, Open source OpenAI alternative},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/go-skynet/LocalAI}},

Sponsors

Do you find LocalAI useful?

Support the project by becoming a backer or sponsor. Your logo will show up here with a link to your website.

A huge thank you to our generous sponsors who support this project covering CI expenses, and our Sponsor list:

Past sponsors


Individual sponsors

A special thanks to individual sponsors, a full list is on GitHub and buymeacoffee. Special shout out to drikster80 for being generous. Thank you everyone!

Star history

LocalAI Star history Chart

License

LocalAI is a community-driven project created by Ettore Di Giacinto and maintained by the LocalAI team.

MIT - Author Ettore Di Giacinto mudler@localai.io

Acknowledgements

LocalAI couldn't have been built without the help of great software already available from the community. Thank you!

Contributors

This is a community project, a special thanks to our contributors!