LocalAI/backend/cpp/llama-cpp/grpc-server.cpp at 7edd3ea96fa33af6d2560e7875898df55a3ed404

mirror of https://github.com/mudler/LocalAI.git synced 2026-04-16 21:08:16 -04:00

Files

Ettore Di Giacinto 9748a1cbc6 fix(streaming): skip chat deltas for role-init elements to prevent first token duplication (#9299 )

When TASK_RESPONSE_TYPE_OAI_CHAT is used, the first streaming token
produces a JSON array with two elements: a role-init chunk and the
actual content chunk. The grpc-server loop called attach_chat_deltas
for both elements with the same raw_result pointer, stamping the first
token's ChatDelta.Content on both replies. The Go side accumulated both,
emitting the first content token twice to SSE clients.

Fix: in the array iteration loops in PredictStream, detect role-init
elements (delta has "role" key) and skip attach_chat_deltas for them.
Only content/reasoning elements get chat deltas attached.

Reasoning models are unaffected because their first token goes into
reasoning_content, not content.

2026-04-10 08:45:47 +02:00

155 KiB

Raw Blame History

View Raw

155 KiB Raw Blame History

155 KiB

Raw Blame History