Yesterday two PRs (#9724 llama.cpp bump, #9731 llama-cpp-darwin
consolidation) merged 11 seconds apart. Both shared the same
backend.yml concurrency group (ci-backends-refs/heads/master-...) due
to "${{ github.head_ref || github.ref }}" — empty head_ref on push
events falls through to the static refs/heads/master. With
cancel-in-progress: true that meant the second merge cancelled the
first's in-flight backend builds. The first PR's CI never finished;
the second PR only touched CI files so its run was a no-op.
Two changes per workflow:
- group: replace "${{ github.head_ref || github.ref }}" with
"${{ github.event.pull_request.number || github.sha }}". On PRs
this groups by PR number (same as before, just keyed on number not
branch name); on push events it groups per-commit, so two master
pushes never share a group.
- cancel-in-progress: gate on github.event_name == 'pull_request' so
rapid pushes to a PR still cancel old runs (newer push wins) but
master pushes never cancel each other.
Trade-off vs alternatives:
- Merge queue would also solve this and additionally test the merged
commit before it lands. Heavier process change; out of scope here.
- Allowing per-commit master concurrency means two simultaneous master
runs may overlap and race on tag pushes, but each commit's manifest
digest is unique and the registry is last-writer-wins on tags —
newer commit's tag overwrites older.
Applied to 11 workflows that share the same concurrency pattern:
backend.yml, backend_pr.yml, image.yml, image-pr.yml, lint.yml,
test.yml, test-extra.yml, tests-e2e.yml, tests-aio.yml,
tests-ui-e2e.yml, generate_intel_image.yaml.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci: add per-arch + manifest-merge support for LocalAI server image
Mirror the backend_build.yml + backend_merge.yml pattern shipped in
PR #9726 for the LocalAI server image:
- image_build.yml accepts optional platform-tag (default ''), scopes
registry cache to cache-localai<suffix>-<platform-tag>, and pushes
by canonical digest only on push events. Digests upload as artifacts
named digests-localai<suffix>-<platform-tag>, with a "-core"
placeholder when tag-suffix is empty so the merge job's download
pattern doesn't over-match across multiple suffixes.
- image_merge.yml is a new reusable workflow that downloads matching
digest artifacts and assembles the final tagged manifest list via
docker buildx imagetools create.
Image names differ from backend_*.yml: the LocalAI server is published
under quay.io/go-skynet/local-ai and localai/localai (not -backends).
Not yet wired into image.yml / image-pr.yml — Commit C does that.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci: fan out per-arch split to remaining 34 backends
Convert all remaining linux/amd64,linux/arm64 entries in
backend-matrix.yml to per-arch + manifest-merge form. Each was a
single matrix entry running both arches on x86 under QEMU emulation;
each becomes two entries — amd64 on ubuntu-latest, arm64 on
ubuntu-24.04-arm (native).
Four backends that were on bigger-runner (-cpu-llama-cpp,
-cpu-turboquant, -gpu-vulkan-llama-cpp, -gpu-vulkan-turboquant) have
both legs moved to free tier as part of the same change. They are
compile-only (no torch/CUDA install) and fit comfortably with the
setup-build-disk /mnt relocation. Phase 4 (next commit) retires the
remaining 5 single-arch bigger-runner entries.
After this commit:
- 271 total matrix entries (was 237)
- 0 multi-arch entries left
- 36 per-arch pairs (34 new + 2 pilots from PR #9727)
- 5 bigger-runner entries remaining (single-arch, Phase 4 target)
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci: split LocalAI image multi-arch entries per arch + merge
Mirror the backend per-arch split for the main LocalAI image:
- image.yml's core-image-build matrix: split the core ('') and
-gpu-vulkan entries into amd64 + arm64 legs each. amd64 on
ubuntu-latest, arm64 on ubuntu-24.04-arm (native).
- New top-level core-image-merge and gpu-vulkan-image-merge jobs
call image_merge.yml after core-image-build completes.
- image-pr.yml's image-build matrix: split the -vulkan-core entry.
No merge job added on the PR side — image_build.yml's digest-push
is push-only-event-gated, so a PR-side merge would have nothing
to download.
After this commit, no workflow file references
linux/amd64,linux/arm64 in a single matrix slot.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci: retire bigger-runner from backend matrix (Phase 4)
Migrate the remaining 5 single-arch bigger-runner entries to
ubuntu-latest. Combined with the Phase 3 setup-build-disk /mnt
relocation (PR #9726), free-tier ubuntu-latest now has ~100 GB of
working space — enough for ROCm dev image (~16 GB), CUDA toolkit
(~5 GB), and the per-backend compile/install steps these entries do.
Backends migrated:
- -gpu-nvidia-cuda-12-llama-cpp
- -gpu-nvidia-cuda-12-turboquant
- -gpu-rocm-hipblas-faster-whisper
- -gpu-rocm-hipblas-coqui
- -cpu-ik-llama-cpp
After this commit, .github/backend-matrix.yml has zero bigger-runner
references. The bigger-runner used in tests-vibevoice-cpp-grpc-
transcription (test-extra.yml) is a separate concern handled in a
follow-up.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci: migrate 9 Intel oneAPI backends to free tier (Phase 5.1)
Intel oneAPI base image is ~6 GB; each backend's wheel install
stays well within the ~100 GB working space provided by Phase 3's
setup-build-disk /mnt relocation. Lowest-risk batch of the
arc-runner-set retirement.
Backends migrated:
vllm, sglang, vibevoice, qwen-asr, nemo, qwen-tts,
fish-speech, voxcpm, pocket-tts (all -gpu-intel-* variants).
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci: migrate 15 ROCm Python backends to free tier (Phase 5.2)
ROCm dev image (~16 GB) plus per-backend torch/wheels install fits
on ubuntu-latest with the /mnt-relocated Docker root. These entries
include the heavier vLLM/sglang/transformers/diffusers stack on
ROCm; if any specific backend OOMs or runs out of disk, individual
flips back to arc-runner-set are revertable per-entry.
Backends migrated: all 15 -gpu-rocm-hipblas-* entries previously on
arc-runner-set (vllm/vllm-omni/sglang/transformers/diffusers/
ace-step/kokoro/vibevoice/qwen-asr/nemo/qwen-tts/fish-speech/
voxcpm/pocket-tts/neutts).
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci: migrate 6 CUDA Python backends to free tier (Phase 5.3)
vLLM/sglang stacks on CUDA 12 and CUDA 13 are the heaviest
backends in the matrix — flash-attn intermediate layers can spike
disk usage during build. setup-build-disk's /mnt relocation gives
~100 GB working space which fits the documented peak.
Highest-risk batch of the arc-runner-set retirement; if any
backend fails to build on free tier, the per-entry runs-on flip
is the unit of revert.
Backends migrated: -gpu-nvidia-cuda-{12,13}-{vllm,vllm-omni,sglang}.
After this commit, .github/backend-matrix.yml has zero references
to arc-runner-set or bigger-runner. The migration is complete.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci: disable provenance on multi-registry digest pushes
Root-caused on master via PR #9727's pilot: when docker/build-push-action@v7
pushes a single build to TWO registries simultaneously with
push-by-digest=true, buildx generates a per-registry provenance
attestation manifest (because mode=max — the default for push:true —
includes the runner ID). That makes the resulting manifest-list digest
diverge across registries:
arm64 -cpu-faster-whisper build:
image manifest: sha256:d3bdd34b... (identical, content-only)
quay manifest list: sha256:66b4cfc8... (with quay attestation)
dockerhub manifest list: sha256:e0733c3b... (with dockerhub attestation)
steps.build.outputs.digest returns only one of the list digests
(empirically the dockerhub one). The merge job then asks
"quay.io/...@sha256:e0733c3b..." which doesn't exist on quay — that
list has digest 66b4cfc8 there. Result: imagetools create fails with
"not found" and the merge job fails (run 25581983094, job 75110021491).
Setting provenance: false drops the per-registry attestation; the
manifest-list digest becomes pure content, identical across both
registries, and steps.build.outputs.digest works on either lookup.
Applied to backend_build.yml and image_build.yml — both refactored
to use the same multi-registry digest-push pattern in the prior PRs.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
- Switch cache-from/cache-to in backend_build.yml and image_build.yml
from the unused gha cache to type=registry pointing at
quay.io/go-skynet/ci-cache:cache<tag-suffix>, mode=max with
ignore-error=true. Master/tag builds populate their own
per-matrix-entry cache; PR builds read-only.
- Drop the broken generate_grpc_cache.yaml cron. It targeted a `grpc`
Dockerfile stage that was removed by b1fc5acd in July 2025, has been
failing every night since, and never populated the gha cache. The new
registry-cache scheme is self-warming, so no separate populator is
needed.
- Remove the dead GRPC_VERSION / GRPC_BASE_IMAGE / GRPC_MAKEFLAGS
build-args from image_build.yml and the orphan ARG GRPC_BASE_IMAGE in
the root Dockerfile (the root Dockerfile no longer compiles gRPC; the
source build now lives in backend/Dockerfile.{llama-cpp,
ik-llama-cpp, turboquant} only and uses its own ARG defaults).
- Drop the unused grpc-base-image input from image_build.yml plus the
matrix passthroughs in image.yml / image-pr.yml.
- Drop the unused GRPC_VERSION env in test.yml.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7-1m
AIO images are behind, and takes effort to maintain these. Wizard and
installation of models have been semplified massively, so AIO images
lost their purpose.
This allows us to be more laser focused on main images and reliefes
stress from CI.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Some datacenter setups might be stuck with the 5.x kernel which doesn't
play well with CUDA >=12.9. To incrase compatibility with the CUDA 12.x
branch, downgrade to 12.8. For newer systems, it is still suggested to
use CUDA 13.x wherever compatible.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This image is for HW prior Jetpack 7. Jetpack 7 broke compatibility with
older devices (which are still in use) such as AGX Orin or Jetsons.
While we do have l4t-cuda-13 images with sbsa support for new Nvidia
devices (Thor, DGX, etc). For older HW we are forced to keep old images
around as 24.04 does not seem to be supported.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix: use ubuntu 24.04 for cuda13 l4t images
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Drop openblas from containers
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(ci): add cuda13 jobs
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add to pipelines and to capabilities. Start to work on the gallery
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* gallery
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* capabilities: try to detect by looking at /usr/local
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* neutts
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* backends.yaml
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* add cuda13 l4t requirements.txt
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* add cuda13 requirements.txt
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Pin vllm
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Not all backends are compatible
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* add vllm to requirements
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* vllm is not pre-compiled for cuda 13
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat: split remaining backends and drop embedded backends
- Drop silero-vad, huggingface, and stores backend from embedded
binaries
- Refactor Makefile and Dockerfile to avoid building grpc backends
- Drop golang code that was used to embed backends
- Simplify building by using goreleaser
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(gallery): be specific with llama-cpp backend templates
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(docs): update
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(ci): minor fixes
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore: drop all ffmpeg references
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix: run protogen-go
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Always enable p2p mode
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Update gorelease file
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(stores): do not always load
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fix linting issues
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Simplify
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Mac OS fixup
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat: Add backend gallery
This PR add support to manage backends as similar to models. There is
now available a backend gallery which can be used to install and remove
extra backends.
The backend gallery can be configured similarly as a model gallery, and
API calls allows to install and remove new backends in runtime, and as
well during the startup phase of LocalAI.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add backends docs
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* wip: Backend Dockerfile for python backends
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat: drop extras images, build python backends separately
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fixup on all backends
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test CI
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Tweaks
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Drop old backends leftovers
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fixup CI
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Move dockerfile upper
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fix proto
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Feature dropped for consistency - we prefer model galleries
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add missing packages in the build image
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* exllama is ponly available on cublas
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* pin torch on chatterbox
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fixups to index
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* CI
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Debug CI
* Install accellerators deps
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add target arch
* Add cuda minor version
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Use self-hosted runners
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci: use quay for test images
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fixups for vllm and chatterbox
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Small fixups on CI
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chatterbox is only available for nvidia
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Simplify CI builds
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Adapt test, use qwen3
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(model gallery): add jina-reranker-v1-tiny-en-gguf
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(gguf-parser): recover from potential panics that can happen while reading ggufs with gguf-parser
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Use reranker from llama.cpp in AIO images
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Limit concurrent jobs
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
* fix(cuda): downgrade to 12.0 to increase compatibility range
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* improve messaging
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>