LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-13 03:09:03 -04:00

Files

Ettore Di Giacinto 390664ff72 fix(responses): classify streamed reasoning as a reasoning item live (#9658 )

In the /v1/responses streaming handler a reasoning model's thinking
monologue was streamed to the client as normal message text (a msg_
output item with output_text.delta) and only reclassified into a
reasoning item after the stream completed. Subsequent output_text.delta
events also kept referencing the old msg_ item id instead of the
reasoning_ id.

Root causes:

1. The live reasoning item was gated on extractor.Reasoning(), which is
   only updated by the Go-side raw-tag parser (ProcessToken). When the
   C++ autoparser drives reasoning through reasoning_content ChatDeltas,
   the reasoning delta is computed via ProcessChatDeltaReasoning into a
   separate accumulator, so extractor.Reasoning() stays empty and the
   gate never fired. The reasoning item was thus only reconstructed at
   end-of-stream.

2. The non-tool-call path created the message/msg_ output item eagerly
   before any token, forcing reasoning to a higher output index and
   making mis-split <think> text land on the pre-existing message item.

3. Neither path carried the sticky preferAutoparser flag, so a
   content-only autoparser (the non-jinja pure-content fallback, #9985)
   could leak <think>...</think> tokens into content.

Extract the per-token reasoning-vs-message classification into a pure,
unit-tested streamReasoningRouter (mirroring chooseDeferredReasoning and
processStream in the chat streaming worker): it gates the reasoning item
on the reasoning delta, opens the message item lazily on the first
content delta, and keeps a sticky preferAutoparser fallback. Both
streaming paths now route reasoning deltas to the reasoning_ id and order
the reasoning item ahead of the message at completion.

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-12 21:39:42 +00:00