mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-27 01:47:18 -04:00
* fix(vllm): don't stream raw tool-call markup as content when a tool parser is active When a tool_parser is configured and the request carries tools, the streaming loop emitted every text delta as delta.content — including the model's raw tool-call markup (e.g. <tool_call>...) — because extract_tool_calls only runs on the full output after the stream. Clients streaming a tool call therefore saw the unparsed tool-call syntax as assistant content. Buffer the text while a tool parser is active for the request; the existing end-of-stream chat_delta already carries the parsed tool_calls (or the cleaned content), which the Go side converts to SSE deltas. Non-tool-parser streaming is unchanged. Add a server-less regression test covering both the tool-call case (no raw markup leaked as content) and the plain-text case (content delivered exactly once — guards against double-emitting the buffered content). Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com> * test(vllm): add expectedFailure test for progressive streaming with tool parser (Case 3, #582) Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com> * test(vllm): add Cases 4+5 — marker split across chunks + false-positive prefix (TDD, Option B state machine, #582) Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com> * feat(vllm): progressive streaming via parser.extract_tool_calls_streaming When a tool parser is active for a tool-enabled streaming request, #10346 buffers the entire generation and surfaces it on the final chunk to prevent raw tool-call markup from leaking as delta.content. This is correct but turns the request into effectively non-streaming for plain-text responses — the client sees nothing until the model stops. Every concrete tool parser shipped with vLLM 0.23+ already implements extract_tool_calls_streaming (Granite4, Qwen3Coder, DeepSeekV31, Jamba, Ernie45, Hermes2Pro, llama3_json, mistral, …). Use it: instantiate the parser before the streaming loop and call its streaming method per delta, emitting DeltaMessage(content=…) or DeltaMessage(tool_calls=[…]) when the parser is ready. Falls back to the existing #10346 buffer path when: - the parser does not have extract_tool_calls_streaming, OR - extract_tool_calls_streaming raises mid-stream (logged, the rest of the request finishes via post-loop extract_tool_calls). Tests (TestStreamingToolParser): 1. Buffer path: no markup leaked, no content duplication 2. Native streaming: plain-text response streams progressively 3. Native streaming: tool_call structured, no markup leaked 4. Native streaming exception → graceful fallback, no markup, no crash 5. No tool parser → unchanged per-delta content stream E2E verified against qwen3_coder on vLLM 0.23.0 (NVIDIA GB10 / arm64 / CUDA 13). Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com> * docs(vllm): add server-side TTFT benchmark for the streaming tool-parser path Self-contained stdlib-only script that measures time-to-first-token (TTFT) for the vLLM backend's two streaming scenarios: - tool_call: request mentions a tool; model is expected to call it - plain_text: request offers a tool but explicitly asks for prose Use this to compare: - the buffer-all path (#10346) → plain_text TTFT ≈ total response time - the native-streaming path (this PR) → plain_text TTFT ≈ true first-token time python examples/vllm-bench/ttft_streaming_tool_parser.py \\ --url http://localhost:8080 --model my-coder --runs 3 Lives under examples/ so it does not interfere with the test suite. Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com> * examples/vllm-bench: add long-text scenario (8 paragraphs, 1500 tokens) The long-text scenario shows the buffering vs streaming difference most dramatically: with the buffer-all path, the client receives nothing for 20+ seconds and then the entire 1500-token response at once. With native streaming, the first token arrives in tens of milliseconds and the response flows progressively. Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com> --------- Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com> Co-authored-by: Philipp Wacker <philipp.wacker@ibf-solutions.com>
55 lines
2.4 KiB
Markdown
55 lines
2.4 KiB
Markdown
# vLLM streaming + tool-parser benchmark
|
|
|
|
A small, self-contained Python script (stdlib only) that measures
|
|
time-to-first-token (TTFT) for the vLLM backend's streaming path with
|
|
a tool parser configured.
|
|
|
|
## Why this exists
|
|
|
|
When a vLLM tool parser is active and a streaming chat completion is requested,
|
|
LocalAI used to buffer the full generation to prevent raw tool-call markup
|
|
(e.g. `<tool_call>...`) from leaking as `delta.content`. That was correct
|
|
for tool-call responses, but it turned plain-text responses into effectively
|
|
non-streaming — the client received nothing until the model finished.
|
|
|
|
With native parser-side streaming (`parser.extract_tool_calls_streaming`,
|
|
implemented by every concrete vLLM 0.23+ tool parser), each delta can be
|
|
classified per-token: emit as content, emit as a structured tool_call, or
|
|
suppress.
|
|
|
|
## Three scenarios
|
|
|
|
| Scenario | Request | Expected outcome |
|
|
|---|---|---|
|
|
| `tool_call` | "What is the weather in Paris? Please use the tool." | Model calls `get_weather`. `delta.tool_calls` chunks; no content leak. |
|
|
| `plain_text_short` | "Explain in 3 short sentences what a hash table is. Do NOT call any tool." | Model writes ~3 sentences. |
|
|
| `plain_text_long` | "Write a thorough 8-paragraph explanation of how Python's GIL works…" | Model writes ~1500 tokens of prose. |
|
|
|
|
The **long scenario** is where the streaming/buffering difference is most
|
|
dramatic: with the buffer-all path, the client sees nothing for 20+ seconds
|
|
and then everything at once; with native streaming, the first token arrives
|
|
in <100ms and the response flows progressively.
|
|
|
|
## What the script reports
|
|
|
|
For each scenario, across N runs:
|
|
|
|
- `ttf_content_s` — time until the first `delta.content` chunk
|
|
- `ttf_tool_s` — time until the first `delta.tool_calls` chunk
|
|
- `n_content_chunks` — total content deltas (1 = bundled, >>1 = streamed)
|
|
- `n_tool_chunks` — total tool_call deltas
|
|
- `total_s` — total wall-clock until `[DONE]`
|
|
- `finish_reason` — `tool_calls` / `stop` / `length`
|
|
|
|
The big tell is **`n_content_chunks` vs `total_s` ratio**:
|
|
- Buffer-all: `n_content_chunks` ≈ 1, `ttf_content_s` ≈ `total_s` (one chunk at end)
|
|
- Streaming: `n_content_chunks` ≈ token count, `ttf_content_s` ≈ first-token latency
|
|
|
|
## Usage
|
|
|
|
```bash
|
|
python ttft_streaming_tool_parser.py --url http://localhost:8080 --model my-coder --runs 3
|
|
```
|
|
|
|
JSON results are written to `ttft_bench_<label>.json` (default label: `run`).
|