Commit Graph

627 Commits

Author SHA1 Message Date
Ettore Di Giacinto
b28b448c68 docs(paged): record mmq shape serving profile
Assisted-by: Codex:gpt-5
2026-07-01 04:36:04 +00:00
Ettore Di Giacinto
2148fa466b feat(paged): add moe mmq shape trace patch
Assisted-by: Codex:gpt-5
2026-07-01 04:32:12 +00:00
Ettore Di Giacinto
3b9ec3e1f1 docs(paged): record mmq occupancy rejection
Assisted-by: Codex:gpt-5
2026-07-01 04:18:12 +00:00
Ettore Di Giacinto
3c2cb9f4ab docs(paged): record graph-node serving profile
Record the Phase 27 current-stack llama.cpp n128 serving profile captured with CUDA graph node tracing and gated before and after the run.

Assisted-by: Codex:gpt-5
2026-07-01 04:00:14 +00:00
Ettore Di Giacinto
ace1ffab28 docs(paged): record audited current snapshot
Record the Phase 26 current-stack paged-vs-vLLM serving snapshot with hardware classification and compact pre/post inference gates.

Assisted-by: Codex:gpt-5
2026-07-01 03:48:27 +00:00
Ettore Di Giacinto
a0194125f5 chore(paged): summarize snapshot inference gates
Emit a compact gate_summary.tsv from current serving snapshots so each artifact records the pre/post MoE md5, dense md5, and backend op checks. Add a summary-only mode for auditing existing artifacts and document the Phase 25 backfill on the Phase 20 snapshot.

Assisted-by: Codex:gpt-5
2026-07-01 03:35:54 +00:00
Ettore Di Giacinto
7108b68a70 chore(paged): record snapshot hardware class
Add a hardware report to the current serving snapshot harness so every paged-vs-vLLM artifact records the GPU identity and conservative hardware class before any server starts. Document the Phase 24 dry run and the GB10 classification for future parity comparisons.

Assisted-by: Codex:gpt-5
2026-07-01 03:31:11 +00:00
Ettore Di Giacinto
7aa15ce539 docs(paged): refresh parity handoff coordinates
Update the paged parity handoff to the current fork head, patch count, mirror invariant, current serving harness, and LocalAI AI-attribution policy after Phases 20-22.

Assisted-by: Codex:gpt-5
2026-07-01 03:25:14 +00:00
Ettore Di Giacinto
6c165747a9 docs(paged): verify patch-series mirror invariant
Record the Phase 22 strict git-apply mirror check proving the LocalAI paged patch series reconstructs the canonical llama.cpp fork tree after patch 0055.

Assisted-by: Codex:gpt-5
2026-07-01 03:22:43 +00:00
Ettore Di Giacinto
ff3f0620de chore(paged): add current serving snapshot harness
Add a reusable current-stack paged-vs-vLLM serving snapshot harness that targets the clean DGX mirror, enforces idle/lock preflight, runs pre/post inference gates, and records ratio summaries.

Assisted-by: Codex:gpt-5
2026-07-01 03:19:36 +00:00
Ettore Di Giacinto
c99678da42 docs(paged): refresh current serving snapshot
Record the Phase 20 same-session MoE paged-vs-vLLM serving snapshot on the current clean DGX mirror, including pre/post inference gates and the resulting GB10 parity decision.

Assisted-by: Codex:gpt-5
2026-07-01 03:15:30 +00:00
Ettore Di Giacinto
310eb3c866 docs(paged): reject MTP draft-shape scheduler
Record the Phase 19 trace-only serving entropy run and close the group/defer-by-draft follow-up based on measured shape distribution, throughput regression, and green inference gates.

Assisted-by: Codex:gpt-5
2026-07-01 03:03:49 +00:00
Ettore Di Giacinto
cced07c7fe docs(paged): add MTP shape trace patch
Add the next incremental llama.cpp patch for default-off speculative batch-shape tracing and record the Phase 18 red/green and inference-gate results.

Assisted-by: Codex:gpt-5
2026-07-01 02:54:29 +00:00
Ettore Di Giacinto
6e35476340 docs(paged): scope MTP graph-shape follow-up
Record Phase 17 source inspection: MTP verification rows change hard graph dimensions, padding is not a safe shortcut, and any future work should start with shape instrumentation before an opt-in scheduler experiment.

Assisted-by: Codex:gpt-5
2026-07-01 02:37:21 +00:00
Ettore Di Giacinto
ae76d42a96 docs(paged): profile MTP graph reuse loss
Record Phase 16 nsys evidence that current MTP serving loses paged decode graph reuse and increases GPU work, explaining the Phase 15 serving regression.

Assisted-by: Codex:gpt-5
2026-07-01 02:32:49 +00:00
Ettore Di Giacinto
4d171e62bb docs(paged): reject MTP serving lever
Add the repeatable MTP serving A/B runner and record Phase 15 results showing current llama-server MTP regresses GB10 serving throughput despite passing inference gates.

Assisted-by: Codex:gpt-5
2026-07-01 02:29:28 +00:00
Ettore Di Giacinto
70394364a3 docs(paged): gate MTP rollback safety
Record Phase 14 MTP rollback evidence, normalized greedy-prefix checks, and canonical inference gates.

Assisted-by: Codex:gpt-5
2026-07-01 02:15:11 +00:00
Ettore Di Giacinto
ede23df333 docs(paged): close W4A16 pad checklist
Mark the rejected-branch disposition as not taken because Phase 4 was kept as patch 0050 with recorded md5, op, perf, and mirror gates.

Assisted-by: Codex:gpt-5
2026-07-01 01:58:22 +00:00
Ettore Di Giacinto
abc70c209e docs(paged): close ragged MoE dispatch shortcut
Record the Phase 8 safety rerun, canonical transcript md5 gates, full and ragged MUL_MAT_ID op gates, and the no-production-patch decision for metadata-only fused dispatch work.

Assisted-by: Codex:gpt-5
2026-07-01 01:57:45 +00:00
Ettore Di Giacinto
2074b4fb5b docs(paged): reject GDN global Ai32 prototype
Record the default-off Global-Ai32 implementation, exact md5 gates, GB10 A/B regression, rejected diff artifact, and the resulting stop decision for GDN kernel work on GB10.

Assisted-by: Codex:gpt-5
2026-07-01 01:51:53 +00:00
Ettore Di Giacinto
adabd11919 docs(paged): scope GDN global Ai32 prototype
Record the shared-A/Ai GB10 cost model, the GO decision for one default-off f32 Ai prototype, and the Phase 13 implementation plan.

Assisted-by: Codex:gpt-5
2026-07-01 01:38:51 +00:00
Ettore Di Giacinto
1b5ae227eb docs(paged): reject GDN M5 QS-early phase
Record the Phase 11 default-off QS-early GDN experiment, its canonical md5 gates, the same-session GB10 A/B regression, and the rejected diff artifact.

Assisted-by: Codex:gpt-5
2026-07-01 01:29:44 +00:00
Ettore Di Giacinto
24e778de47 docs(paged): scope GDN M5 state-boundary phase
Add the Phase 11 design and implementation plan for a default-off C16 M5 QS-early GDN experiment after rejecting C32 slabs.

Assisted-by: Codex:gpt-5
2026-07-01 01:21:05 +00:00
Ettore Di Giacinto
3da3b169fb docs(paged): reject GDN C32 slab phase
Record the default-off C32 slab experiment, its md5 gates, the dense tail-row fix, and the performance regression that rejects the source patch.

Assisted-by: Codex:gpt-5
2026-07-01 01:15:00 +00:00
Ettore Di Giacinto
ff3ad84191 docs(paged): record GDN C32 slab baseline
Record the Phase 10 current-M5 prefill baseline and the source inspection finding that C32 M5 needs a real U-staging implementation rather than a launcher-only shortcut.

Assisted-by: Codex:gpt-5
2026-07-01 00:58:54 +00:00
Ettore Di Giacinto
9bbe02c161 fix(paged): gate MTP backend sampling
Record the Phase 9 MTP smoke gate, mirror the fork patch that disables backend sampling for MTP drafts, and scope the follow-up C32 slab GDN prefill phase.

Assisted-by: Codex:gpt-5
2026-07-01 00:54:25 +00:00
Ettore Di Giacinto
b862e2c568 docs(paged): stop ragged dispatch source shortcut
Assisted-by: Codex:gpt-5
2026-07-01 00:42:36 +00:00
Ettore Di Giacinto
b009de0ee0 test(paged): mirror ragged MoE dispatch gate
Assisted-by: Codex:gpt-5
2026-07-01 00:41:21 +00:00
Ettore Di Giacinto
89ef3a4020 docs(paged): record ragged MoE profile gate
Assisted-by: Codex:gpt-5
2026-07-01 00:35:21 +00:00
Ettore Di Giacinto
ef14748f06 docs(paged): scope ragged MoE dispatch phase
Assisted-by: Codex:gpt-5
2026-07-01 00:26:01 +00:00
Ettore Di Giacinto
b6885aa446 docs(paged): reject weighted combine fusion candidate
Assisted-by: Codex:gpt-5
2026-07-01 00:20:53 +00:00
Ettore Di Giacinto
4b6fc0fa1c test(paged): mirror MoE weighted combine gate
Assisted-by: Codex:gpt-5
2026-06-30 23:51:52 +00:00
Ettore Di Giacinto
22a93ce1a3 docs(paged): select weighted combine candidate
Assisted-by: Codex:gpt-5
2026-06-30 23:47:34 +00:00
Ettore Di Giacinto
3cf7fa1715 docs(paged): reject swiglu down fusion candidate
Assisted-by: Codex:gpt-5
2026-06-30 23:41:38 +00:00
Ettore Di Giacinto
d0fa463eac test(paged): mirror MoE swiglu down gate
Mirror the llama.cpp Phase 7 test gate for the merged MoE gate_up/SWIGLU/down chain and record the DGX md5/op gate evidence.

Assisted-by: Codex:gpt-5
2026-06-30 23:20:52 +00:00
Ettore Di Giacinto
34c4b5ce8d docs(paged): scope phase7 serving candidates
Mark the Phase 6 serving classifier complete, preserve the old parity final as historical, and scope Phase 7 source candidates with explicit md5 and op gates.

Assisted-by: Codex:gpt-5
2026-06-30 23:12:09 +00:00
Ettore Di Giacinto
b647460dee docs(paged): record phase6 serving classifier
Record both-engine serving nsys buckets, rejected sampler short-circuit, and rejected GDN/MMQ env grids for the GB10 parity work.

Assisted-by: Codex:gpt-5
2026-06-30 23:04:15 +00:00
Ettore Di Giacinto
f9e015d8e2 docs(paged): record W4A16 Wq padding rejection
Record the Phase 5 Wq shared-memory padding experiment, its gates, sub-threshold benchmark gain, and the decision to ship no 0051 patch.

Assisted-by: Codex:gpt-5
2026-06-30 22:23:14 +00:00
Ettore Di Giacinto
85c88320ef patches(paged): pad W4A16 A shared tile stride
Mirror fork commit d9b9be0be as patch 0050 and record the Phase 4 W4A16 shared-memory padding gates, benchmarks, and mirror verification.

Assisted-by: Codex:gpt-5
2026-06-30 22:15:21 +00:00
Ettore Di Giacinto
8b413d1cbd docs(paged): record W4A16 scale broadcast rejection
Record the Phase 3 scale-broadcast experiment, its md5 and MUL_MAT_ID gates, the prefill regression, and the decision to ship no 0050 patch.

Assisted-by: Codex:gpt-5
2026-06-30 22:06:17 +00:00
Ettore Di Giacinto
c5f2545cdd patches(paged): tune W4A16 grouped tile shape
Mirror fork commit 7dfa0e175 as patch 0049 and record the Phase 2 GB10 W4A16 shape sweep, md5 gates, MUL_MAT_ID checks, and mirror verification.

Assisted-by: Codex:gpt-5
2026-06-30 21:57:42 +00:00
Ettore Di Giacinto
d8edc615e7 patches(paged): mirror W4A16 packed metadata
Mirror the fork-first W4A16 packed tile metadata commit into the LocalAI paged patch series, record the Phase 1 benchmark result, and keep the implementation plan checked off.

Assisted-by: Codex:gpt-5
2026-06-30 21:21:53 +00:00
Ettore Di Giacinto
1c0709b700 docs(paged): record W4A16 phase1 kill gate
Record the clean forced W4A16 baseline, default comparison, selected metadata target, and completed plan checkpoint for the GB10 parity reopen.

Assisted-by: Codex:gpt-5
2026-06-30 20:40:40 +00:00
Ettore Di Giacinto
337ebb8a37 docs(paged): record phase0 decode repro
Record comparable graph-node-traced paged and vLLM decode difference-method artifacts for the GB10 parity reopen.

Assisted-by: Codex:gpt-5
2026-06-30 20:35:43 +00:00
Ettore Di Giacinto
ef5d4af203 docs(paged): record phase0 prefill baseline
Record clean-source MoE and dense prefill baselines for the GB10 parity reopen and mark the plan checkpoint complete.

Assisted-by: Codex:gpt-5
2026-06-30 20:22:18 +00:00
Ettore Di Giacinto
a9a2efb296 docs(paged): record phase0 clean build gates
Record the clean DGX build retry, binary provenance, canonical greedy md5 gates, and completed plan steps for the GB10 parity reopen.

Assisted-by: Codex:gpt-5
2026-06-30 20:19:14 +00:00
Ettore Di Giacinto
d288a0300f docs(paged): add GB10 parity implementation plan
Add the Superpowers implementation plan for the GB10 parity reopen, including Phase 0 provenance, decode repro, W4A16 kill gates, and later kernel workstream entry criteria.

Assisted-by: Codex:gpt-5
2026-06-30 15:50:01 +00:00
Ettore Di Giacinto
4cd90bfae9 paged: drop bf16-tau (patch 0026), subsumed by decode fusions (tau=100000 flat, zero speed benefit)
The opt-in hybrid per-head bf16 SSM-state lever (ssm_bf16_tau, patch 0026) is
removed from the llama-cpp-localai-paged patch series. Clean re-measurement after
the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache)
landed shows it buys nothing: forcing ALL gated-DeltaNet heads to bf16
(tau=100000, the most aggressive setting) gives flat decode throughput, 780.6 vs
780.0 t/s. The mode engages but adds zero speed because it is subsumed by the
fusions. The earlier "+12%" was measured before the fusions completed. bf16-tau
was a precision trade (not bit-exact, ~91% same-top-p) plus extra bug surface and
extra CUDA template-instantiation compile cost with no offsetting benefit.

Dependency check: no later patch (0028/0029/0030) depends on 0026. 0030's only
mention is a description comment; its code keys off fused_gdn_ar/ch/auto_fgdn,
which originate in 0018/0019/0021 (before 0026). The remaining series (0001-0025,
0028-0030) applies clean with git apply --check against the pin
0ed235ea2c17a19fc8238668653946721ed136fd. The Makefile applies the series by glob
(patches/paged/0*.patch); the resulting gap at 0026 is tolerated (0005/0027 are
already absent).

Removed:
- patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch
- the dead ssm_bf16_tau / ssm_hybrid_tau option handler in the shared
  grpc-server.cpp (it only set LLAMA_SSM_BF16_TAU, now a no-op the library no
  longer reads)
- the patched+bf16-tau benchmark columns and llama-patched-bf16tau rows
  (README + final_benchmark.csv), the ssm_bf16_tau option text in backend
  index.yaml, the gallery NOTE block, and the docs/features/backends.md mention.

The rejected-lever lesson is kept (why it was dropped: subsumed, tau=100000 flat)
in the backend README section 5, the paged-backend agent guide, and the
vLLM-parity methodology, so it is not re-tried.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 16:06:06 +00:00
Ettore Di Giacinto
ea72a56e2c Merge origin/master + pin-sync paged backend to 0ed235ea
master auto-bumped the stock llama-cpp pin 9d5d882d -> 0ed235ea and updated the
shared grpc-server.cpp. The paged backend's pin must track the stock pin (the
grpc-server.cpp is shared), so bump its LLAMA_VERSION to match. All 28 paged
patches apply clean on 0ed235ea (verified against a fresh upstream clone). The
bf16-tau state-serialization fix (patch 0026) is included. Bit-exact gate + full
grpc-server build verify on GPU/CI to follow.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 07:56:47 +00:00
LocalAI [bot]
de2ec2f136 feat(backends): add voice-detect + face-detect ggml backends (replace Python insightface/speaker-recognition) (#10441)
* feat(voice-detect): add Go purego backend for voice-detect.cpp

Add backend/go/voice-detect implementing the Backend gRPC voice subset
(VoiceEmbed/VoiceVerify/VoiceAnalyze) over libvoicedetect.so via purego,
mirroring the parakeet-cpp / omnivoice-cpp backends.

The flat voicedetect_capi C ABI is dlopen'd cgo-less; malloc'd string and
float-vector returns are owned by Go and released through the matching capi
free functions, with the per-ctx last error surfaced into Go errors. Calls are
serialized via base.SingleThread since the C context is not reentrant.

Proto field mapping:
- VoiceEmbed: VoiceEmbedRequest.audio (path) -> embed_path -> Embedding+Model.
- VoiceVerify: audio1/audio2 + threshold (<=0 falls back to the
  verify_threshold option, default 0.25) -> verify_paths -> verified/distance/
  threshold/confidence/model/processing_time_ms.
- VoiceAnalyze: audio (path) -> analyze_path_json; the JSON age/gender/emotion
  document maps to a single VoiceAnalysis segment (start/end 0; gender "label"
  -> dominant_gender with the remaining float scores as the gender map; emotion
  label/scores -> dominant_emotion/emotion).

The Makefile pins voice-detect.cpp to 47546430, clones+builds libvoicedetect.so
with ggml static-linked (PIC, GGML_NATIVE off) so dlopen needs no external
libggml/libvoicedetect; ldd on the artifact shows only system libs. Ginkgo
tests cover option parsing and analyze-JSON mapping; embed/verify smoke specs
gate on VOICEDETECT_BACKEND_TEST_MODEL + VOICEDETECT_BACKEND_TEST_WAV.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(voice-detect): wire backend into index, gallery and build

Register the voice-detect.cpp speaker-recognition + voice-analysis
backend (added in Voice-INT-A) into LocalAI's distribution surfaces,
mirroring the ced backend (the closest mudler C++/ggml audio analogue):

- backend/index.yaml: add the &voicedetect meta-backend (capabilities
  platform map, no top-level uri) plus the full set of concrete per-arch
  image entries (cpu/cuda12/cuda13/metal/rocm/sycl/vulkan/l4t and the
  -development variants). Referential integrity audited - every alias
  target resolves.
- gallery/index.yaml: add 5 model entries on backend voice-detect -
  ECAPA-TDNN, WeSpeaker ResNet34, 3D-Speaker ERes2Net, CAM++ and the
  wav2vec2 age/gender/emotion analyze model. The engine architecture is
  read from GGUF metadata (voicedetect.arch) at load. GGUF artifacts are
  not yet published: each files: entry points at the intended
  mudler/voice-detect-gguf location with a TODO to fill sha256 after
  upload (no fabricated hashes).
- .github/backend-matrix.yml: add the linux build matrix block + the
  darwin metal entry mirroring ced.
- .github/workflows/bump_deps.yaml: track mudler/voice-detect.cpp via
  VOICEDETECT_VERSION (pin 47546430, = 4754643).
- core/config/backend_capabilities.go: register voice-detect in the
  backend capability map (VoiceVerify/VoiceEmbed/VoiceAnalyze ->
  speaker_recognition), mirroring speaker-recognition.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(face-detect): add purego Go backend for face-detect.cpp

Add the LocalAI Go backend that dlopens libfacedetect.so (the flat
facedetect_capi_* C-ABI) via purego, mirroring the sibling voice-detect
backend. Implements the Face subset of the Backend gRPC service:

- Embeddings(PredictOptions): Images[0] base64 -> temp file -> embed_path
  -> L2-normalized ArcFace embedding.
- Detect(DetectOptions): src -> detect_path_json -> Detection boxes
  (class_name "face", [x1,y1,x2,y2] -> x/y/w/h).
- FaceVerify(FaceVerifyRequest): two images + threshold + anti_spoof ->
  verify_paths; best-effort img areas via detect.
- FaceAnalyze(FaceAnalyzeRequest): img -> analyze_path_json -> per-face
  age + gender ("M"/"F" normalized to "Man"/"Woman").

The Makefile pins face-detect.cpp to 636a1963 and builds the shared lib
with ggml + vendored libjpeg-turbo static (PIC), so the .so is
ldd-clean (no libggml) and exports only facedetect_capi_* (no jpeg_
symbols). Gated Ginkgo e2e mirrors voice-detect.

Note for the gallery-wiring task: backend registration (index.yaml,
gallery, core/config/backend_capabilities.go) is intentionally not
touched here.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(voice-detect): replace em dashes in net-new descriptions

Project style forbids em/en dashes. Replace the three U+2014 chars
introduced by the voice-detect gallery/index wiring with `-`/`:`.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(face-detect): wire backend into index, gallery and build

Register the face-detect.cpp face detection / embedding / verification /
analysis backend (added in Face-INT-A) into LocalAI's distribution
surfaces, mirroring the voice-detect wiring (the closest mudler C++/ggml
recognition analogue):

- backend/index.yaml: add the &facedetect meta-backend (capabilities
  platform map, no top-level uri to avoid the meta-backend gotcha) plus
  the full set of concrete per-arch image entries (cpu/cuda12/cuda13/
  metal/rocm/sycl-f16/sycl-f32/vulkan/l4t and the -development variants),
  22 entries. Referential integrity audited: every alias target resolves.
- gallery/index.yaml: add 4 model entries on backend face-detect -
  face-detect-buffalo-l/m/s (insightface SCRFD + ArcFace/MBF, NON-COMMERCIAL)
  and face-detect-yunet-sface (OpenCV-Zoo YuNet + SFace, APACHE-2.0, the
  commercial-friendly alternative). The detector/embedder architecture is
  read from GGUF metadata (facedetect.arch) at load; only the real
  verify_threshold option is set (0.35 buffalo, 0.363 sface). GGUF
  artifacts are not yet published: each files: entry points at the
  intended mudler/face-detect-gguf location with a TODO to fill sha256
  after upload (no fabricated hashes).
- core/config/backend_capabilities.go: register face-detect in the
  backend capability map (Embedding/Detect/FaceVerify/FaceAnalyze ->
  face_recognition), mirroring insightface.
- .github/backend-matrix.yml: add the linux build matrix block + the
  darwin metal entry mirroring voice-detect.
- .github/workflows/bump_deps.yaml: track mudler/face-detect.cpp via
  FACEDETECT_VERSION (pin 636a1963).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(recon): voice-detect metal build branch + face-detect gallery usecases

Add the missing metal BUILD_TYPE branch to the voice-detect Makefile
forwarding -DVOICEDETECT_GGML_METAL=ON, mirroring face-detect, so the
darwin metal CI artifact is built with the Metal backend instead of
CPU-only.

Expand the 4 face-detect gallery models' known_usecases to
[face_recognition, detection, embeddings] to match the backend
capabilities map and the mirrored insightface-buffalo entries, so
auto-selection for /v1/detect and /embeddings works.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* docs(recon): document voice-detect and face-detect ggml backends

Document the new standalone C++/ggml biometric backends as the
recommended/default option for face and voice recognition, keeping the
existing Python insightface / speaker-recognition backends framed as the
legacy path.

- features/face-recognition.md: add a face-detect (ggml) backend section
  with the gallery entries (buffalo-l/m/s non-commercial, yunet-sface
  Apache-2.0), licensing, and verify/detect/analyze quickstart.
- features/voice-recognition.md: add a voice-detect (ggml) backend
  section with the gallery entries (ecapa-tdnn, wespeaker-resnet34,
  eres2net, campplus speaker recognizers; emotion-wav2vec2 non-commercial
  analyze head) and quickstart.
- reference/compatibility-table.md: add face-detect.cpp and
  voice-detect.cpp rows to the Vision, Detection & Recognition table.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(gallery): publish recon backend GGUF uris + sha256

Fill in the published HuggingFace GGUF uris and verified sha256 for the
9 recon gallery entries (voice-detect-* and face-detect-*), and remove
the TODO publish markers. Correct the eres2net, campplus, and
emotion-wav2vec2 uris to the actual published filenames.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(gallery): re-embed buffalo anti-spoof + add audeering age/gender voice model

Update the 3 buffalo face-detect GGUF sha256 (anti-spoof ensemble now
embedded and re-uploaded under the same filenames/uris) and note the
FaceVerify anti_spoof request flag in each description. Add a new
voice-detect-age-gender-wav2vec2 gallery entry mirroring the emotion
model.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(gallery): add face-detect-buffalo-sc and antelopev2 packs

Add gallery entries for two newly-published insightface face packs on
the face-detect backend: buffalo_sc (smallest pack, SCRFD-500M + small
ArcFace) and antelopev2 (higher-accuracy, SCRFD-10G + ArcFace glint360k
R100, 512-d). Both are non-commercial research-only.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(recon): honor LocalAI per-model threads in voice/face-detect backends

LocalAI spawns one backend process per model and serves requests
concurrently, so the engines' own min(hardware_concurrency, 8) default
can oversubscribe cores. Forward the per-model Threads value from the
gRPC LoadModel options into the engine via VOICEDETECT_THREADS /
FACEDETECT_THREADS (read at backend construction) before the capi load.
A non-positive Threads is treated as unset, leaving the engine default.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump backend pins to CPU-optimized engine commits

voice-detect.cpp -> 0d9c1b3 (radix-2 FFT FBank, threads, flash attn + cached
pos-conv); face-detect.cpp -> 523aee1 (thread-gated direct conv, threads).
Brings the CPU optimizations into the LocalAI backend builds. GGUF format and
parity unchanged, so the published HF GGUFs remain valid.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump backend pins to round-2 CPU-optimized engines

voice-detect.cpp -> fe7e6a3 (ERes2Net 1x1->mul_mat, CAM++ layout+context,
wav2vec2 conv-LN, ECAPA capture-drop, AVX512 dispatch opt-in); face-detect.cpp
-> 9c8adb7 (AVX2 Winograd F(2x2,3x3) for SCRFD/ArcFace 3x3 convs, ArcFace
BN-fold). Parity unchanged (cosine=1.0); GGUF format unchanged, HF GGUFs valid.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump backend pins to round-3 Winograd engines

voice-detect.cpp -> 45122ec (Winograd F(2x2,3x3) for WeSpeaker/ERes2Net 3x3
convs, -22%/-20% @8t); face-detect.cpp -> cd5c962 (Winograd F(4x4,3x3) for
SCRFD large maps, -22% @1t on top of F(2x2), more load-stable). Parity held
(cosine=1.0); GGUF format unchanged, HF GGUFs valid.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump backend pins to round-4 Winograd engines (CPU opt complete)

voice-detect.cpp -> d2839ca (CAM++ FCM 2D convs through Winograd, -15.5%/-10.3%);
face-detect.cpp -> c1db23d (AVX2-vectorized Winograd tile transforms, SCRFD
detect -14%/-9.6%). Final CPU optimization round; the conv-kernel lever class is
now exhausted (parity held cosine=1.0; GGUF/parity unchanged, HF GGUFs valid).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump face-detect pin to deep-kernel engine (7ae5c4d)

face-detect.cpp -> 7ae5c4d: register-blocked winograd-domain GEMM microkernel
(2.8x isolated GFLOP/s), AVX-512 zmm evolution behind runtime CPUID dispatch
(ship-safe, AVX2 fallback bit-identical), bias/relu fused into the winograd
output transform, and SFace Conv+BN fold + bias/PReLU fusion. SCRFD detect
~1.4x faster end-to-end vs the round-4 baseline; parity bit-exact; portable
single binary (function-multiversioned, no global -mavx512f).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump voice-detect pin to ECAPA operand-order win (e9c56ae)

voice-detect.cpp -> e9c56ae: weight-as-src0 mul_mat order in ECAPA's F32
conv1d_same (routes through tinyBLAS sgemm); ECAPA embed 1.67x @1t / ~1.3x @8t,
parity cosine=1.0. Isolated to encoder.cpp (ECAPA-only); ERes2Net/CAM++/WeSpeaker
do not call conv1d_same so are provably unaffected.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump pins to FMA-throughput engines (voice f7b9f89, face 2d2d5f0)

face -> 2d2d5f0: route ArcFace 3x3 body convs through the AVX-512 winograd
microkernel (kWinoMinSize 80->14); ArcFace 1.62x @1t, SCRFD detect to 0.966 of
MLAS @1t, no regression. voice -> f7b9f89: runtime-CPUID-dispatched AVX-512
winograd-GEMM microkernel (ship-safe, AVX2 fallback bit-identical); WeSpeaker
1.90x @1t. Parity cosine=1.0 throughout; portable single binaries.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump pins to MLAS-class direct-conv engines (voice 7ecfd07, face be22d67)

Hand-tuned nChw16c AVX-512 register-tiled direct-conv microkernel (~263 GFLOP/s,
within 6-7% of MLAS per-op efficiency), runtime-CPUID-dispatched + AVX2 fallback,
fused bias/relu. voice 7ecfd07: default 3x3-s1 kernel for WeSpeaker (+37%/+32%)
+ ERes2Net, CAM++ pinned to Winograd. face be22d67: shape-gated to the ArcFace
recognizer body (+25-27% @8t); SCRFD detector stays on Winograd (no regression).
Parity cosine=1.0 / detect <=1px on AVX-512 + AVX2 paths. Portable single binaries.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump voice pin to Phase-A blocked backbone (f4e7eef)

WeSpeaker ResNet34 runs as one nChw16c blocked island (2 reorders/forward vs
~60) on AVX-512, default; per-conv directconv fallback on AVX2. +2.9% @1t /
+17-19% @8t vs per-conv directconv, parity cosine=1.0. The conv microkernel is
already FMA-bound near peak (~0.86-0.98x MLAS-implied); residual to MLAS is
sub-peak edge + non-conv tail, documented in docs/cpu-optimization.md.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump pins to breadth blocked-backbone (voice 7f66871, face d80092b)

voice 7f66871: AVX2-vectorized (ymm) blocked island - AVX2-only hosts now run
the blocked backbone for WeSpeaker (2.3x over per-conv-AVX2, cosine=1.0);
ERes2Net stays per-conv (blocked regresses, opt-in only); CAM++ Winograd-pinned.
face d80092b: ArcFace recognizer blocked island, AVX-512 default (-13% @8t, ~0.90x
MLAS, the closest conv result), auto per-conv on AVX2; SCRFD untouched on Winograd
(0 island invocations during detect). Parity cosine=1.0 / detect <=1px throughout.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump pins to small-spatial + stem conv kernels (voice 99b1804, face 47fdab6)

Measured-gap-driven conv kernels: small-spatial (fill the register tile when
output width <= tile width) + small-IC stem + strided-1x1/downsample recovery.
ArcFace recognizer 0.57 -> 0.70x MLAS @1t (the closest conv model), WeSpeaker
0.65 -> 0.79x @1t. Parity cosine=1.0 / detect <=1px. The OC-block-sharing lever
was a measured dead-end (deep stride-1 is L3-weight-bandwidth bound, not
read-port bound) and was NOT shipped. Kernel ceiling reached; further gap needs
an algorithm-class change (cache-blocked weight-stationary GEMM, or q8 weights).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump pins to GPU persistent-graph + multi-model-safe cache (voice 45d2e6b, face 0a4799a)

GPU wins (CUDA/ggml backend, no CPU-path change): persistent per-shape graph+context
cache in Backend::compute() eliminates the per-call cudaGraph re-instantiation churn
-> wav2vec2 emotion+age-gender now AT GPU parity with torch-cuDNN on GB10 (0.97-0.98x),
CAM++ -5.7ms; bit-identical parity. Cache hardened multi-model-safe (invalidate-on-free
keyed by the ModelLoader weights buffer) so LocalAI multi-model hosting cannot stale-hit.
Conv models still trail cuDNN (im2col-materialization-bound) - cuDNN implicit-GEMM lever next.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump pins to cuDNN-conv-capable engines (voice b6e4356, face 6107a24)

Adds the opt-in cuDNN implicit-GEMM conv path (VOICEDETECT_GGML_CUDNN /
FACEDETECT_GGML_CUDNN, DEFAULT OFF -> zero build/runtime dep until enabled).
On GPU it kills the im2col-materialization bottleneck and reaches torch-cuDNN
parity on the spill-bound convs: SCRFD detect 14.8->6.4ms (2.3x, ~parity),
WeSpeaker ~parity, ERes2Net beats torch (1.10x); ArcFace/CAM++ neutral (no
spill). Parity exact (SCRFD <=1px, cosine=1.0). To USE it in LocalAI, the CUDA
backend build must enable the flag AND bundle libcudnn - deferred until a
cuDNN-bundled GPU image; flag stays OFF here.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(recon): enable cuDNN conv path on arm64+CUDA13 recon backends

The voice-detect.cpp / face-detect.cpp engines have an opt-in cuDNN
implicit-GEMM conv path behind VOICEDETECT_GGML_CUDNN / FACEDETECT_GGML_CUDNN
(default OFF) that kills im2col on the GPU and reaches torch-cuDNN parity
(SCRFD 2.3x, WeSpeaker/ERes2Net parity), measured on the GB10
(arm64, CUDA 13, sm_121a).

Enable it for the CUDA build, but only where cuDNN actually ships: the
arm64 + CUDA 13 image (GB10/Jetson/L4T). x86 CUDA images carry no cuDNN,
so flipping it on globally for BUILD_TYPE=cublas would be a link failure.
The Makefiles gate on CUDA_MAJOR_VERSION=13 + arch (TARGETARCH from the
matrix/Docker build, uname -m fallback for local builds).

backend/Dockerfile.golang already installs the runtime libcudnn9-cuda-13
in the arm64+CUDA13 apt block; add the matching libcudnn9-dev-cuda-13 so
the build-time link resolves.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump voice-detect pin to ERes2Net blocked-default (30beecd)

Defaults VD_ERES2NET_BLOCKED ON: routes the ERes2Net Res2Net body through the
blocked nChw16c AVX-512 directconv island instead of the 1x1 mul_mat fast path
(CONT-transpose + skinny low-K GEMM). On the shipped GGML_NATIVE=OFF build (ggml
mul_mat is AVX2-only) this wins ~2x at every thread count (2.07x@1t, 2.2x@4t,
2.05x@8t); pure-AVX2 fallback still 1.3-1.62x. Parity exact (cosine=1.000000 vs
golden), so registered voices + verify/identify thresholds are unaffected. The
prior default-OFF rested on a stale comment whose 23pct regression only held on
the non-shipping GGML_NATIVE=ON build.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* docs(readme): announce native voice-detect + face-detect backends in Latest News

Add a Latest News entry for the new from-scratch C++/ggml biometric backends
(voice-detect.cpp + face-detect.cpp) that replace the Python insightface and
speaker-recognition backends: no Python/onnxruntime at inference, self-contained
GGUF, bit-exact parity, GPU cuDNN parity. Mirrors the parakeet.cpp /
locate-anything.cpp native-backend news entries. Refs PR #10441.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): re-pin to the squashed engine release commits

The voice-detect.cpp and face-detect.cpp histories were squashed to a single
release commit, which orphaned the previous pins (voice 30beecd, face 6107a24).
Re-pin to the new single-commit SHAs (voice 3d51077, face 06914b0); the tree is
identical, so the backend build is unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 09:29:08 +02:00