Files
LocalAI/core
Ettore Di Giacinto a867b3d2a8 fix(distributed): per-request node ID rendezvous for streaming header
Streaming chat and completion handlers used to set the X-LocalAI-Node
response header on the request goroutine, BEFORE the worker goroutine
called ml.Load. The header therefore reflected the previous request's
routing decision (or nothing on a cold cache), not THIS request's. Under
two replicas serving one model, the header consistently mis-attributed
the served-by node, defeating the whole purpose of the feature for the
most-debugged code path.

Fix: introduce a per-request `nodeIDCh chan string` (buffered, size 1)
plumbed from the handler into processStream / processStreamWithTools and
into the completion process closure. The worker publishes the picked
node ID exactly once, on the first token callback invocation, which is
guaranteed to fire AFTER backend.ModelInferenceFunc has returned (and
thus AFTER ml.Load has set the per-modelID node stamp via
ModelRouterAdapter.Route). The handler reads nodeIDCh non-blockingly
before every response writer Write/Flush and attaches the header before
the first Flush() locks the headers. Per-request state means two
concurrent requests for the same model cannot clobber each other's
header value.

The eager "role=assistant" initial chunk emitted at the top of
processStream had to move into the first token callback as well: that
chunk was previously sent before ml.Load ran, so its responses-channel
push raced ahead of the node ID signal and caused the handler to flush
headers too early. processStreamWithTools already deferred its initial
role chunk behind sentInitialRole, so only processStream changed shape.

Best-effort caveats:
  - The model store keeps only the latest routing decision per modelID,
    so two routes interleaving still admit a small read-after-overwrite
    window between worker B's Load returning and worker B reading
    NodeID. This is a fundamental data-model limit; further tightening
    would require returning the node ID directly from ml.Load.
  - The chan publish is non-blocking; if the handler hasn't reached its
    first chunk read by the time the worker publishes, the value sits in
    the buffer (capacity 1) until the handler drains it. No deadlock.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-24 20:47:26 +00:00
..
2026-03-30 00:47:27 +02:00