mirror of https://github.com/mudler/LocalAI.git synced 2026-06-22 07:39:02 -04:00

Files

pos-ei-don b4c0dc67fe feat(vllm): progressive streaming via parser.extract_tool_calls_streaming (follow-up to #10346 ) (#10351 )

* fix(vllm): don't stream raw tool-call markup as content when a tool parser is active

When a tool_parser is configured and the request carries tools, the streaming
loop emitted every text delta as delta.content — including the model's raw
tool-call markup (e.g. <tool_call>...) — because extract_tool_calls only runs
on the full output after the stream. Clients streaming a tool call therefore
saw the unparsed tool-call syntax as assistant content.

Buffer the text while a tool parser is active for the request; the existing
end-of-stream chat_delta already carries the parsed tool_calls (or the cleaned
content), which the Go side converts to SSE deltas. Non-tool-parser streaming
is unchanged.

Add a server-less regression test covering both the tool-call case (no raw
markup leaked as content) and the plain-text case (content delivered exactly
once — guards against double-emitting the buffered content).

Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>

* test(vllm): add expectedFailure test for progressive streaming with tool parser (Case 3, #582)

Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>

* test(vllm): add Cases 4+5 — marker split across chunks + false-positive prefix (TDD, Option B state machine, #582)

Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>

* feat(vllm): progressive streaming via parser.extract_tool_calls_streaming

When a tool parser is active for a tool-enabled streaming request,
#10346 buffers the entire generation and surfaces it on the final
chunk to prevent raw tool-call markup from leaking as delta.content.
This is correct but turns the request into effectively non-streaming
for plain-text responses — the client sees nothing until the model
stops.

Every concrete tool parser shipped with vLLM 0.23+ already implements
extract_tool_calls_streaming (Granite4, Qwen3Coder, DeepSeekV31, Jamba,
Ernie45, Hermes2Pro, llama3_json, mistral, …). Use it: instantiate
the parser before the streaming loop and call its streaming method per
delta, emitting DeltaMessage(content=…) or DeltaMessage(tool_calls=[…])
when the parser is ready.

Falls back to the existing #10346 buffer path when:
  - the parser does not have extract_tool_calls_streaming, OR
  - extract_tool_calls_streaming raises mid-stream (logged, the
    rest of the request finishes via post-loop extract_tool_calls).

Tests (TestStreamingToolParser):
  1. Buffer path: no markup leaked, no content duplication
  2. Native streaming: plain-text response streams progressively
  3. Native streaming: tool_call structured, no markup leaked
  4. Native streaming exception → graceful fallback, no markup, no crash
  5. No tool parser → unchanged per-delta content stream

E2E verified against qwen3_coder on vLLM 0.23.0 (NVIDIA GB10 / arm64 / CUDA 13).

Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>

* docs(vllm): add server-side TTFT benchmark for the streaming tool-parser path

Self-contained stdlib-only script that measures time-to-first-token (TTFT)
for the vLLM backend's two streaming scenarios:

  - tool_call:  request mentions a tool; model is expected to call it
  - plain_text: request offers a tool but explicitly asks for prose

Use this to compare:
  - the buffer-all path (#10346)         → plain_text TTFT ≈ total response time
  - the native-streaming path (this PR)  → plain_text TTFT ≈ true first-token time

  python examples/vllm-bench/ttft_streaming_tool_parser.py \\
      --url http://localhost:8080 --model my-coder --runs 3

Lives under examples/ so it does not interfere with the test suite.

Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>

* examples/vllm-bench: add long-text scenario (8 paragraphs, 1500 tokens)

The long-text scenario shows the buffering vs streaming difference most
dramatically: with the buffer-all path, the client receives nothing for
20+ seconds and then the entire 1500-token response at once. With native
streaming, the first token arrives in tens of milliseconds and the
response flows progressively.

Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>

---------

Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
Co-authored-by: Philipp Wacker <philipp.wacker@ibf-solutions.com>

2026-06-21 17:07:15 +02:00

2.4 KiB

Raw Blame History

vLLM streaming + tool-parser benchmark

A small, self-contained Python script (stdlib only) that measures time-to-first-token (TTFT) for the vLLM backend's streaming path with a tool parser configured.

Why this exists

When a vLLM tool parser is active and a streaming chat completion is requested, LocalAI used to buffer the full generation to prevent raw tool-call markup (e.g. <tool_call>...) from leaking as delta.content. That was correct for tool-call responses, but it turned plain-text responses into effectively non-streaming — the client received nothing until the model finished.

With native parser-side streaming (parser.extract_tool_calls_streaming, implemented by every concrete vLLM 0.23+ tool parser), each delta can be classified per-token: emit as content, emit as a structured tool_call, or suppress.

Three scenarios

Scenario	Request	Expected outcome
`tool_call`	"What is the weather in Paris? Please use the tool."	Model calls `get_weather`. `delta.tool_calls` chunks; no content leak.
`plain_text_short`	"Explain in 3 short sentences what a hash table is. Do NOT call any tool."	Model writes ~3 sentences.
`plain_text_long`	"Write a thorough 8-paragraph explanation of how Python's GIL works…"	Model writes ~1500 tokens of prose.

The long scenario is where the streaming/buffering difference is most dramatic: with the buffer-all path, the client sees nothing for 20+ seconds and then everything at once; with native streaming, the first token arrives in <100ms and the response flows progressively.

What the script reports

For each scenario, across N runs:

ttf_content_s — time until the first delta.content chunk
ttf_tool_s — time until the first delta.tool_calls chunk
n_content_chunks — total content deltas (1 = bundled, >>1 = streamed)
n_tool_chunks — total tool_call deltas
total_s — total wall-clock until [DONE]
finish_reason — tool_calls / stop / length

The big tell is n_content_chunks vs total_s ratio:

Buffer-all: n_content_chunks ≈ 1, ttf_content_s ≈ total_s (one chunk at end)
Streaming: n_content_chunks ≈ token count, ttf_content_s ≈ first-token latency

Usage

python ttft_streaming_tool_parser.py --url http://localhost:8080 --model my-coder --runs 3

JSON results are written to ttft_bench_<label>.json (default label: run).

2.4 KiB Raw Blame History