The Dockerfile's HEALTHCHECK probes http://localhost:8080/readyz, which
is the OpenAI API server port. When the same image runs as a worker, it
listens on the gRPC base port (50051) and an HTTP file transfer server
on port-1 (50050) — nothing on 8080 — so docker always reports the
container as unhealthy.
Add unauthenticated /readyz and /healthz endpoints to the worker's HTTP
file transfer server, and override HEALTHCHECK_ENDPOINT for worker-1 in
the distributed compose file. Disable the healthcheck for agent-worker
since it is NATS-only and exposes no HTTP server.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
In distributed mode the Backends gallery used to fan every install out to
every worker — fine for auto-resolving (meta) backends like llama-cpp where
each node picks its own variant, but wrong for hardware-specific builds
like cpu-llama-cpp that would silently land on every GPU node.
Adds a node-targeted install path through the existing
POST /api/nodes/:id/backends/install plumbing, with two entry points:
- Backends gallery row gets a split-button in distributed mode. Auto-
resolving keeps "Install on all nodes" as the primary; chevron menu
opens the picker. Hardware-specific routes the primary directly to the
picker — no fan-out path on the row.
- Nodes-page drawer gets a "+ Add backend" button that navigates to
/app/backends?target=<node-id>; the gallery scopes itself to that node
(banner, single per-row install button, Reinstall/Remove for already-
installed). One gallery, two scopes — no second UI to maintain.
The picker (new NodeInstallPicker) shows a 3-state suitability column
(Compatible / Override / Installed), an auto-expanding variant override
disclosure that fires when selected nodes have no working GPU, parallel
per-node installs with inline status and Retry-failed-nodes, and a
mismatch confirm that names the consequence on the button itself.
A 409 fan-out guard on /api/backends/apply protects CLI/Terraform/script
users from the same footgun: hardware-specific installs in distributed
mode now return code "concrete_backend_requires_target" with a human-
readable error and a meta_alternative pointer.
The gallery list payload now surfaces capabilities, metaBackendFor and
per-row nodes (NodeBackendRef) so the picker and the new Nodes column
have everything they need without re-walking the gallery client-side.
GODEBUG=netdns=go is set on the compose services because the cgo DNS
resolver follows the container's nsswitch.conf to host systemd-resolved
(127.0.0.53), unreachable from inside the container; the pure-Go
resolver reads /etc/resolv.conf directly and uses Docker's embedded DNS.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude Code:claude-opus-4-7[1m] [Edit] [Bash] [Read] [Write]
Workers on NVIDIA unified-memory hardware (DGX Spark / GB10, Jetson AGX Thor,
Jetson Orin/Xavier/Nano) were reporting `available_vram=0` back to the frontend,
so the Nodes UI showed the node as fully used even when most of the unified
memory was actually free.
Three causes addressed:
* `isTegraDevice` only matched `/sys/devices/soc0/family == "Tegra"`. DGX Spark
(SBSA) reports JEDEC codes there instead — `jep106:0426` for the NVIDIA
manufacturer — so the Tegra/unified-memory fallback never ran. Renamed to
`isNVIDIAIntegratedGPU` and extended to also match `jep106:0426[:*]` via
`/sys/devices/soc0/soc_id`.
* The unified-iGPU code defaulted the device name to `"NVIDIA Jetson"` when
`/proc/device-tree/model` was missing. That's what happens for Thor inside a
docker container, and always on DGX Spark. New `nvidiaIntegratedGPUName`
resolves via dt-model → `/sys/devices/soc0/machine` → `soc_id` lookup
(`jep106:0426:8901` → `"NVIDIA GB10"`) so the Nodes UI labels the box
correctly.
* Worker heartbeat sent `available_vram=0` (or total-as-available) when VRAM
usage was momentarily unknown — e.g. when `nvidia-smi` intermittently failed
with `waitid: no child processes` under containers without `--init`. Each
such heartbeat overwrote the DB and made the UI flip to "fully used".
`heartbeatBody` now omits `available_vram` in that case so the DB keeps its
last good value.
Also updates the commented GPU blocks in both compose files with
`NVIDIA_DRIVER_CAPABILITIES=compute,utility`, `capabilities: [gpu, utility]`,
and `init: true`, and documents the requirement in the distributed-mode and
nvidia-l4t pages. Without `utility`, NVML/`nvidia-smi` are absent inside the
container, which is what put the DGX Spark worker into the buggy fallback in
the first place.
Detection verified on live hardware (dgx.casa / GB10 and 192.168.68.23 / Thor)
by running a cross-compiled probe of the new helpers on both host and inside
the worker container.
Assisted-by: Claude:opus-4.7 [Claude Code]
* feat: add distributed mode (experimental)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix data races, mutexes, transactions
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactorings
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix events and tool stream in agent chat
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* use ginkgo
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(cron): compute correctly time boundaries avoiding re-triggering
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* enhancements, refactorings
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* do not flood of healthy checks
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* do not list obvious backends as text backends
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* tests fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Drop redundant healthcheck
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* enhancements, refactorings
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>