mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
* fix(grpc): self-terminate backend workers when LocalAI dies non-gracefully
Symptom: a backend model-worker subprocess (the per-model gRPC server LocalAI
spawns) can be orphaned and linger — holding VRAM and its listen port — if the
LocalAI process is killed non-gracefully (e.g. a supervisor's graceful-shutdown
grace period elapses and LocalAI is SIGKILLed) before its own teardown runs.
Root cause: LocalAI's graceful teardown (pkg/signals/handler.go installs the
SIGINT/SIGTERM handler; core/cli/run.go registers app.Shutdown ->
ModelLoader.StopAllGRPC -> process.Stop in pkg/model/process.go) only runs when
LocalAI receives a catchable signal and survives long enough to run its
handlers. Backends are spawned via github.com/mudler/go-processmanager v0.1.1,
whose getSysProcAttr() sets Setpgid:true (own process group, so the group can be
signalled) but never PR_SET_PDEATHSIG/Pdeathsig, and exposes no Config field or
option for a caller to inject/extend SysProcAttr. LocalAI fully delegates
spawning to that library (it never builds the exec.Cmd itself), so it cannot set
a kernel parent-death signal at the spawn site. If LocalAI is SIGKILLed, nothing
tells the backend to exit and it is reparented to init.
Fix: add a best-effort, backend-side safety net at the one shared choke point
every out-of-process Go backend routes through — grpc.StartServer / RunServer in
pkg/grpc. On startup it captures getppid() and polls; when the process is
reparented (getppid changes / becomes 1 — the standard POSIX signal the original
parent died) it logs and self-terminates. getppid() reparent detection is
portable (Linux + macOS), unlike Linux-only PR_SET_PDEATHSIG. Toggle via
LOCALAI_BACKEND_PARENT_WATCH (default on; off on Windows) and
LOCALAI_BACKEND_PARENT_WATCH_INTERVAL. This is strictly a backstop alongside the
existing graceful SIGTERM->grace->SIGKILL teardown, which is unchanged.
Scope/limitations: covers Go-based backends (everything using pkg/grpc). The
C++ backends (e.g. llama-cpp) and Python backends do not route through
pkg/grpc and are not covered by this mechanism — they would each need an
equivalent parent-death check (follow-up). The fully general fix is for
go-processmanager to expose SysProcAttr injection so LocalAI can set Pdeathsig
at spawn for every backend regardless of language (suggested upstream follow-up;
out of scope for this LocalAI-only PR).
Test: pkg/grpc/parentwatch_test.go builds a real test -> middle -> grandchild
process tree, lets the middle process exit to orphan the grandchild running the
real watchParentDeath, and asserts it detects the reparent and self-terminates.
Unix-only (build-tagged), runs in CI (Linux).
Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(process): extend parent-death backstop to C++ and Python backends
The Go parent-death watcher (pkg/grpc/parentwatch.go, commit 772b435d5)
only protects backends that route through pkg/grpc. C++ and Python
backends don't, so the originally-reported case — the llama.cpp gRPC
worker surviving a non-graceful LocalAI death — was still uncovered.
Extend the same best-effort backstop to both languages, reusing the
exact mechanism and semantics:
- capture getppid() at startup, skip if already orphaned (<=1)
- a background thread polls getppid() and self-exits on reparenting
(getppid() != orig || == 1), portable across Linux/macOS, no-op on
Windows
- same env vars: LOCALAI_BACKEND_PARENT_WATCH (default on; falsy
false/0/no/off disable) and LOCALAI_BACKEND_PARENT_WATCH_INTERVAL
(default 2s; accepts Go-style durations like 500ms/2s/1m)
C++: implemented in backend/cpp/llama-cpp (the reported, most-used C++
backend) as a dependency-free header parent_watch.h, wired into
grpc-server.cpp's main() and copied at build time via prepare.sh. C++
backends have no shared server scaffolding, so other C++ backends
(ds4, ik-llama-cpp, privacy-filter, ...) are not yet covered and would
each need the same one-line include+call as follow-ups.
Python: implemented once in the shared common/parent_watch.py and armed
from common/grpc_auth.py's get_auth_interceptors() — the single helper
every one of the 35 Python backends invokes while building its gRPC
server — so all Python backends (and future ones) are covered with no
per-backend edits and no duplicated implementation.
Tests (real process-tree reparent detection, mirroring the Go test):
- backend/cpp/llama-cpp/parent_watch_test.cpp (via run-unit-tests.sh)
- backend/python/common/parent_watch_test.py (python -m unittest)
Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Sonnet 5 <noreply@anthropic.com>
164 lines
5.1 KiB
Go
164 lines
5.1 KiB
Go
//go:build !windows
|
|
|
|
package grpc
|
|
|
|
import (
|
|
"os"
|
|
"os/exec"
|
|
"path/filepath"
|
|
"runtime"
|
|
"strconv"
|
|
"syscall"
|
|
"time"
|
|
|
|
. "github.com/onsi/ginkgo/v2"
|
|
. "github.com/onsi/gomega"
|
|
)
|
|
|
|
// These env vars drive the helper roles this test binary re-executes itself as
|
|
// (see the init() dispatcher). They are only set for the spawned child/
|
|
// grandchild processes, never for the normal `go test` invocation.
|
|
const (
|
|
envRole = "LOCALAI_PARENTWATCH_TEST_ROLE"
|
|
envReady = "LOCALAI_PARENTWATCH_TEST_READY" // grandchild writes its PID here once the watcher is armed
|
|
envExited = "LOCALAI_PARENTWATCH_TEST_EXITED" // grandchild writes here when it detects reparenting
|
|
)
|
|
|
|
// init dispatches the helper roles when this test binary is re-executed with a
|
|
// role set. It runs before the testing/Ginkgo machinery, and is a no-op during
|
|
// a normal test run (role unset).
|
|
func init() {
|
|
switch os.Getenv(envRole) {
|
|
case "middle":
|
|
runMiddleRole()
|
|
case "grandchild":
|
|
runGrandchildRole()
|
|
}
|
|
}
|
|
|
|
// childEnv returns the current environment with the parentwatch test role set
|
|
// to the given value (replacing any inherited role), leaving the ready/exited
|
|
// file paths inherited.
|
|
func childEnv(role string) []string {
|
|
out := make([]string, 0, len(os.Environ())+1)
|
|
for _, kv := range os.Environ() {
|
|
if len(kv) > len(envRole) && kv[:len(envRole)+1] == envRole+"=" {
|
|
continue
|
|
}
|
|
out = append(out, kv)
|
|
}
|
|
return append(out, envRole+"="+role)
|
|
}
|
|
|
|
// runGrandchildRole arms the REAL watchParentDeath against its current parent
|
|
// (the "middle" process), signals readiness, then blocks. When middle exits and
|
|
// we are reparented, the watcher fires and we record it before exiting.
|
|
func runGrandchildRole() {
|
|
exitedFile := os.Getenv(envExited)
|
|
readyFile := os.Getenv(envReady)
|
|
|
|
origPPID := os.Getppid()
|
|
go watchParentDeath(origPPID, 50*time.Millisecond, func() {
|
|
_ = os.WriteFile(exitedFile, []byte("1"), 0o644)
|
|
os.Exit(7)
|
|
})
|
|
|
|
// Safety valve: never linger if something goes wrong with the test.
|
|
go func() {
|
|
time.Sleep(30 * time.Second)
|
|
os.Exit(2)
|
|
}()
|
|
|
|
// Signal readiness only after the watcher captured origPPID, so middle
|
|
// won't exit before we've recorded it as our original parent.
|
|
_ = os.WriteFile(readyFile, []byte(strconv.Itoa(os.Getpid())), 0o644)
|
|
|
|
select {} // block until the watcher terminates us
|
|
}
|
|
|
|
// runMiddleRole spawns the grandchild (which arms the watcher against us),
|
|
// waits until it is ready, then exits — orphaning the grandchild so it gets
|
|
// reparented, which is what the watcher must detect.
|
|
func runMiddleRole() {
|
|
readyFile := os.Getenv(envReady)
|
|
|
|
self, err := os.Executable()
|
|
if err != nil {
|
|
os.Exit(3)
|
|
}
|
|
cmd := exec.Command(self)
|
|
cmd.Env = childEnv("grandchild")
|
|
// Own process group, mirroring how real backends are spawned, and discard
|
|
// std streams so the grandchild doesn't keep any parent pipe open.
|
|
cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
|
|
if err := cmd.Start(); err != nil {
|
|
os.Exit(4)
|
|
}
|
|
|
|
if !waitForFile(readyFile, 10*time.Second) {
|
|
os.Exit(5)
|
|
}
|
|
os.Exit(0) // orphan the grandchild
|
|
}
|
|
|
|
func waitForFile(path string, timeout time.Duration) bool {
|
|
deadline := time.Now().Add(timeout)
|
|
for time.Now().Before(deadline) {
|
|
if _, err := os.Stat(path); err == nil {
|
|
return true
|
|
}
|
|
time.Sleep(20 * time.Millisecond)
|
|
}
|
|
return false
|
|
}
|
|
|
|
// This spec builds a genuine two-level process tree (test -> middle ->
|
|
// grandchild), lets the middle process die, and asserts the grandchild's
|
|
// watchParentDeath detects the reparenting and self-terminates.
|
|
var _ = Describe("watchParentDeath", func() {
|
|
It("detects reparenting and self-terminates the orphaned process", func() {
|
|
if runtime.GOOS == "windows" {
|
|
Skip("parent-death watcher is not supported on windows")
|
|
}
|
|
|
|
dir := GinkgoT().TempDir()
|
|
readyFile := filepath.Join(dir, "ready")
|
|
exitedFile := filepath.Join(dir, "exited")
|
|
|
|
self, err := os.Executable()
|
|
Expect(err).NotTo(HaveOccurred(), "cannot resolve test executable")
|
|
|
|
middle := exec.Command(self)
|
|
middle.Env = append(childEnv("middle"),
|
|
envReady+"="+readyFile,
|
|
envExited+"="+exitedFile,
|
|
)
|
|
// Discard the helpers' output; keep the test log clean.
|
|
middle.Stdout = nil
|
|
middle.Stderr = nil
|
|
|
|
Expect(middle.Start()).To(Succeed(), "failed to start middle helper")
|
|
// Wait only for the middle process; the grandchild is intentionally left
|
|
// orphaned. No pipes are shared, so this returns as soon as middle exits.
|
|
Expect(middle.Wait()).To(Succeed(), "middle helper exited with error")
|
|
|
|
// The grandchild must have armed the watcher (and thus captured middle as
|
|
// its parent) before middle exited.
|
|
_, err = os.Stat(readyFile)
|
|
Expect(err).NotTo(HaveOccurred(), "grandchild never signaled readiness")
|
|
|
|
// Best-effort cleanup in case the watcher somehow doesn't fire.
|
|
DeferCleanup(func() {
|
|
if b, err := os.ReadFile(readyFile); err == nil {
|
|
if pid, err := strconv.Atoi(string(b)); err == nil {
|
|
_ = syscall.Kill(pid, syscall.SIGKILL)
|
|
}
|
|
}
|
|
})
|
|
|
|
// Now that middle is gone, the grandchild has been reparented; the watcher
|
|
// must notice and write the exited marker.
|
|
Expect(waitForFile(exitedFile, 10*time.Second)).To(BeTrue(), "watcher did not detect parent death within timeout")
|
|
})
|
|
})
|