feat: refactor shared helpers and enhance MLX backend functionality (#9335)

* refactor(backends): extract python_utils + add mlx_utils shared helpers Move parse_options() and messages_to_dicts() out of vllm_utils.py into a new framework-agnostic python_utils.py, and re-export them from vllm_utils so existing vllm / vllm-omni imports keep working. Add mlx_utils.py with split_reasoning() and parse_tool_calls() — ported from mlx_vlm/server.py's process_tool_calls. These work with any mlx-lm / mlx-vlm tool module (anything exposing tool_call_start, tool_call_end, parse_tool_call). Used by the mlx and mlx-vlm backends in later commits to emit structured ChatDelta.tool_calls without reimplementing per-model parsing. Shared smoke tests confirm: - parse_options round-trips bool/int/float/string - vllm_utils re-exports are identity-equal to python_utils originals - mlx_utils parse_tool_calls handles <tool_call>...</tool_call> with a shim module and produces a correctly-indexed list with JSON arguments - mlx_utils split_reasoning extracts <think> blocks and leaves clean content * feat(mlx): wire native tool parsers + ChatDelta + token usage + logprobs Bring the MLX backend up to the same structured-output contract as vLLM and llama.cpp: emit Reply.chat_deltas so the OpenAI HTTP layer sees tool_calls and reasoning_content, not just raw text. Key insight: mlx_lm.load() returns a TokenizerWrapper that already auto- detects the right tool parser from the model's chat template (_infer_tool_parser in mlx_lm/tokenizer_utils.py). The wrapper exposes has_tool_calling, has_thinking, tool_parser, tool_call_start, tool_call_end, think_start, think_end — no user configuration needed, unlike vLLM. Changes in backend/python/mlx/backend.py: - Imports: replace inline parse_options / messages_to_dicts with the shared helpers from python_utils. Pull split_reasoning / parse_tool_calls from the new mlx_utils shared module. - LoadModel: log the auto-detected has_tool_calling / has_thinking / tool_parser_type for observability. Drop the local is_float / is_int duplicates. - _prepare_prompt: run request.Messages through messages_to_dicts so tool_call_id / tool_calls / reasoning_content survive the conversion, and pass tools=json.loads(request.Tools) + enable_thinking=True (when request.Metadata says so) to apply_chat_template. Falls back on TypeError for tokenizers whose template doesn't accept those kwargs. - _build_generation_params: return an additional (logits_params, stop_words) pair. Maps RepetitionPenalty / PresencePenalty / FrequencyPenalty to mlx_lm.sample_utils.make_logits_processors and threads StopPrompts through to post-decode truncation. - New _tool_module_from_tokenizer / _finalize_output / _truncate_at_stop helpers. _finalize_output runs split_reasoning when has_thinking is true and parse_tool_calls (using a SimpleNamespace shim around the wrapper's tool_parser callable) when has_tool_calling is true, then extracts prompt_tokens, generation_tokens and (best-effort) logprobs from the last GenerationResponse chunk. - Predict: use make_logits_processors, accumulate text + last_response, finalize into a structured Reply carrying chat_deltas, prompt_tokens, tokens, logprobs. Early-stops on user stop sequences. - PredictStream: per-chunk Reply still carries raw message bytes for back-compat but now also emits chat_deltas=[ChatDelta(content=delta)]. On loop exit, emit a terminal Reply with structured reasoning_content / tool_calls / token counts / logprobs — so the Go side sees tool calls without needing the regex fallback. - TokenizeString RPC: uses the TokenizerWrapper's encode(); returns length + tokens or FAILED_PRECONDITION if the model isn't loaded. - Free RPC: drops model / tokenizer / lru_cache, runs gc.collect(), calls mx.metal.clear_cache() when available, and best-effort clears torch.cuda as a belt-and-suspenders. * feat(mlx-vlm): mirror MLX parity (tool parsers + ChatDelta + samplers) Same treatment as the MLX backend: emit structured Reply.chat_deltas, tool_calls, reasoning_content, token counts and logprobs, and extend sampling parameter coverage beyond the temp/top_p pair the backend used to handle. - Imports: drop the inline is_float/is_int helpers, pull parse_options / messages_to_dicts from python_utils and split_reasoning / parse_tool_calls from mlx_utils. Also import make_sampler and make_logits_processors from mlx_lm.sample_utils — mlx-vlm re-uses them. - LoadModel: use parse_options; call mlx_vlm.tool_parsers._infer_tool_parser / load_tool_module to auto-detect a tool module from the processor's chat_template. Stash think_start / think_end / has_thinking so later finalisation can split reasoning blocks without duck-typing on each call. Logs the detected parser type. - _prepare_prompt: convert proto Messages via messages_to_dicts (so tool_call_id / tool_calls survive), pass tools=json.loads(request.Tools) and enable_thinking=True to apply_chat_template when present, fall back on TypeError for older mlx-vlm versions. Also handle the prompt-only + media and empty-prompt + media paths consistently. - _build_generation_params: return (max_tokens, sampler_params, logits_params, stop_words). Maps repetition_penalty / presence_penalty / frequency_penalty and passes them through make_logits_processors. - _finalize_output / _truncate_at_stop: common helper used by Predict and PredictStream to split reasoning, run parse_tool_calls against the auto-detected tool module, build ToolCallDelta list, and extract token counts + logprobs from the last GenerationResult. - Predict / PredictStream: switch from mlx_vlm.generate to mlx_vlm.stream_generate in both paths, accumulate text + last_response, pass sampler and logits_processors through, emit content-only ChatDelta per streaming chunk followed by a terminal Reply carrying reasoning_content, tool_calls, prompt_tokens, tokens and logprobs. Non-streaming Predict returns the same structured Reply shape. - New helper _collect_media extracted from the duplicated base64 image / audio decode loop. - New TokenizeString RPC using the processor's tokenizer.encode and Free RPC that drops model/processor/config, runs gc + Metal cache clear + best-effort torch.cuda cache clear. * feat(importer/mlx): auto-set tool_parser/reasoning_parser on import Mirror what core/gallery/importers/vllm.go does: after applying the shared inference defaults, look up the model URI in parser_defaults.json and append matching tool_parser:/reasoning_parser: entries to Options. The MLX backends auto-detect tool parsers from the chat template at runtime so they don't actually consume these options — but surfacing them in the generated YAML: - keeps the import experience consistent with vllm - gives users a single visible place to override - documents the intended parser for a given model family * test(mlx): add helper unit tests + TokenizeString/Free + e2e make targets - backend/python/mlx/test.py: add TestSharedHelpers with server-less unit tests for parse_options, messages_to_dicts, split_reasoning and parse_tool_calls (using a SimpleNamespace shim to fake a tool module without requiring a model). Plus test_tokenize_string and test_free RPC tests that load a tiny MLX-quantized Llama and exercise the new RPCs end-to-end. - backend/python/mlx-vlm/test.py: same helper unit tests + cleanup of the duplicated import block at the top of the file. - Makefile: register BACKEND_MLX and BACKEND_MLX_VLM (they were missing from the docker-build-target eval list — only mlx-distributed had a generated target before). Add test-extra-backend-mlx and test-extra-backend-mlx-vlm convenience targets that build the respective image and run tests/e2e-backends with the tools capability against mlx-community/Qwen2.5-0.5B-Instruct-4bit. The MLX backend auto-detects the tool parser from the chat template so no BACKEND_TEST_OPTIONS is needed (unlike vllm). * fix(libbackend): don't pass --copies to venv unless PORTABLE_PYTHON=true backend/python/common/libbackend.sh:ensureVenv() always invoked 'python -m venv --copies', but macOS system python (and some other builds) refuses with: Error: This build of python cannot create venvs without using symlinks --copies only matters when _makeVenvPortable later relocates the venv, which only happens when PORTABLE_PYTHON=true. Make --copies conditional on that flag and fall back to default (symlinked) venv otherwise. Caught while bringing up the mlx backend on Apple Silicon — the same build path is used by every Python backend with USE_PIP=true. * fix(mlx): support mlx-lm 0.29.x tool calling + drop deprecated clear_cache The released mlx-lm 0.29.x ships a much simpler tool-calling API than HEAD: TokenizerWrapper detects the <tool_call>...</tool_call> markers from the tokenizer vocab and exposes has_tool_calling / tool_call_start / tool_call_end, but does NOT expose a tool_parser callable on the wrapper and does NOT ship a mlx_lm.tool_parsers subpackage at all (those only exist on main). Caught while running the smoke test on Apple Silicon with the released mlx-lm 0.29.1: tokenizer.tool_parser raised AttributeError (falling through to the underlying HF tokenizer), so _tool_module_from_tokenizer always returned None and tool calls slipped through as raw <tool_call>...</tool_call> text in Reply.message instead of being parsed into ChatDelta.tool_calls. Fix: when has_tool_calling is True but tokenizer.tool_parser is missing, default the parse_tool_call callable to json.loads(body.strip()) — that's exactly what mlx_lm.tool_parsers.json_tools.parse_tool_call does on HEAD and covers the only format 0.29 detects (<tool_call>JSON</tool_call>). Future mlx-lm releases that ship more parsers will be picked up automatically via the tokenizer.tool_parser attribute when present. Also tighten the LoadModel logging — the old log line read init_kwargs.get('tool_parser_type') which doesn't exist on 0.29 and showed None even when has_tool_calling was True. Log the actual tool_call_start / tool_call_end markers instead. While here, switch Free()'s Metal cache clear from the deprecated mx.metal.clear_cache to mx.clear_cache (mlx >= 0.30), with a fallback for older releases. Mirrored to the mlx-vlm backend. * feat(mlx-distributed): mirror MLX parity (tool calls + ChatDelta + sampler) Same treatment as the mlx and mlx-vlm backends: emit Reply.chat_deltas with structured tool_calls / reasoning_content / token counts / logprobs, expand sampling parameter coverage beyond temp+top_p, and add the missing TokenizeString and Free RPCs. Notes specific to mlx-distributed: - Rank 0 is the only rank that owns a sampler — workers participate in the pipeline-parallel forward pass via mx.distributed and don't re-implement sampling. So the new logits_params (repetition_penalty, presence_penalty, frequency_penalty) and stop_words apply on rank 0 only; we don't need to extend coordinator.broadcast_generation_params, which still ships only max_tokens / temperature / top_p to workers (everything else is a rank-0 concern). - Free() now broadcasts CMD_SHUTDOWN to workers when a coordinator is active, so they release the model on their end too. The constant is already defined and handled by the existing worker loop in backend.py:633 (CMD_SHUTDOWN = -1). - Drop the locally-defined is_float / is_int / parse_options trio in favor of python_utils.parse_options, re-exported under the module name for back-compat with anything that imported it directly. - _prepare_prompt: route through messages_to_dicts so tool_call_id / tool_calls / reasoning_content survive, pass tools=json.loads( request.Tools) and enable_thinking=True to apply_chat_template, fall back on TypeError for templates that don't accept those kwargs. - New _tool_module_from_tokenizer (with the json.loads fallback for mlx-lm 0.29.x), _finalize_output, _truncate_at_stop helpers — same contract as the mlx backend. - LoadModel logs the auto-detected has_tool_calling / has_thinking / tool_call_start / tool_call_end so users can see what the wrapper picked up for the loaded model. - backend/python/mlx-distributed/test.py: add the same TestSharedHelpers unit tests (parse_options, messages_to_dicts, split_reasoning, parse_tool_calls) that exist for mlx and mlx-vlm.
2026-07-16 19:23:53 -04:00 · 2026-04-13 18:44:03 +02:00
parent daa0272f2e
commit 016da02845
12 changed files with 1380 additions and 398 deletions
--- a/backend/python/common/libbackend.sh
+++ b/backend/python/common/libbackend.sh
@@ -344,7 +344,16 @@ function ensureVenv() {

    if [ ! -d "${EDIR}/venv" ]; then
        if [ "x${USE_PIP}" == "xtrue" ]; then
-            "${interpreter}" -m venv --copies "${EDIR}/venv"
+            # --copies is only needed when we will later relocate the venv via
+            # _makeVenvPortable (PORTABLE_PYTHON=true). Some Python builds —
+            # notably macOS system Python — refuse to create a venv with
+            # --copies because the build doesn't support it. Fall back to
+            # symlinks in that case.
+            local venv_args=""
+            if [ "x${PORTABLE_PYTHON}" == "xtrue" ]; then
+                venv_args="--copies"
+            fi
+            "${interpreter}" -m venv ${venv_args} "${EDIR}/venv"
            source "${EDIR}/venv/bin/activate"
            "${interpreter}" -m pip install --upgrade pip
        else
--- a/backend/python/common/mlx_utils.py
+++ b/backend/python/common/mlx_utils.py
@@ -0,0 +1,100 @@
+"""Shared utilities for the mlx and mlx-vlm gRPC backends.
+
+These helpers wrap mlx-lm's and mlx-vlm's native tool-parser modules, which
+auto-detect the right parser from the model's chat template. Each tool
+module exposes ``tool_call_start``, ``tool_call_end`` and
+``parse_tool_call(text, tools) -> dict | list[dict]``.
+
+The split-reasoning helper is generic enough to work with any think-start /
+think-end delimiter pair.
+"""
+import json
+import re
+import sys
+import uuid
+
+
+def split_reasoning(text, think_start, think_end):
+    """Split ``<think>...</think>`` blocks out of ``text``.
+
+    Returns ``(reasoning_content, remaining_text)``. When ``think_start`` is
+    empty or not found, returns ``("", text)`` unchanged.
+    """
+    if not think_start or not text or think_start not in text:
+        return "", text
+    pattern = re.compile(
+        re.escape(think_start) + r"(.*?)" + re.escape(think_end or ""),
+        re.DOTALL,
+    )
+    reasoning_parts = pattern.findall(text)
+    if not reasoning_parts:
+        return "", text
+    remaining = pattern.sub("", text).strip()
+    return "\n".join(p.strip() for p in reasoning_parts), remaining
+
+
+def parse_tool_calls(text, tool_module, tools):
+    """Extract tool calls from ``text`` using a mlx-lm tool module.
+
+    Ports the ``process_tool_calls`` logic from
+    ``mlx_vlm/server.py`` (v0.10 onwards). ``tool_module`` must expose
+    ``tool_call_start``, ``tool_call_end`` and ``parse_tool_call``.
+
+    Returns ``(calls, remaining_text)`` where ``calls`` is a list of dicts:
+
+        [{"index": int, "id": str, "name": str, "arguments": str (JSON)}]
+
+    and ``remaining_text`` is the free-form text with the tool call blocks
+    removed. ``(calls, text)`` is returned unchanged if ``tool_module`` is
+    ``None`` or the start delimiter isn't present.
+    """
+    if tool_module is None or not text:
+        return [], text
+    start = getattr(tool_module, "tool_call_start", None)
+    end = getattr(tool_module, "tool_call_end", None)
+    parse_fn = getattr(tool_module, "parse_tool_call", None)
+    if not start or parse_fn is None or start not in text:
+        return [], text
+
+    if end == "" or end is None:
+        pattern = re.compile(
+            re.escape(start) + r".*?(?:\n|$)",
+            re.DOTALL,
+        )
+    else:
+        pattern = re.compile(
+            re.escape(start) + r".*?" + re.escape(end),
+            re.DOTALL,
+        )
+
+    matches = pattern.findall(text)
+    if not matches:
+        return [], text
+
+    remaining = pattern.sub(" ", text).strip()
+    calls = []
+    for match in matches:
+        call_body = match.strip().removeprefix(start)
+        if end:
+            call_body = call_body.removesuffix(end)
+        call_body = call_body.strip()
+        try:
+            parsed = parse_fn(call_body, tools)
+        except Exception as e:
+            print(
+                f"[mlx_utils] Invalid tool call: {call_body!r} ({e})",
+                file=sys.stderr,
+            )
+            continue
+        if not isinstance(parsed, list):
+            parsed = [parsed]
+        for tc in parsed:
+            calls.append(
+                {
+                    "index": len(calls),
+                    "id": str(uuid.uuid4()),
+                    "name": (tc.get("name") or "").strip(),
+                    "arguments": json.dumps(tc.get("arguments", {}), ensure_ascii=False),
+                }
+            )
+    return calls, remaining
--- a/backend/python/common/python_utils.py
+++ b/backend/python/common/python_utils.py
@@ -0,0 +1,65 @@
+"""Generic utilities shared across Python gRPC backends.
+
+These helpers don't depend on any specific inference framework and can be
+imported by any backend that needs to parse LocalAI gRPC options or build a
+chat-template-compatible message list from proto Message objects.
+"""
+import json
+
+
+def parse_options(options_list):
+    """Parse Options[] list of ``key:value`` strings into a dict.
+
+    Supports type inference for common cases (bool, int, float). Unknown or
+    mixed-case values are returned as strings.
+
+    Used by LoadModel to extract backend-specific options passed via
+    ``ModelOptions.Options`` in ``backend.proto``.
+    """
+    opts = {}
+    for opt in options_list:
+        if ":" not in opt:
+            continue
+        key, value = opt.split(":", 1)
+        key = key.strip()
+        value = value.strip()
+        # Try type conversion
+        if value.lower() in ("true", "false"):
+            opts[key] = value.lower() == "true"
+        else:
+            try:
+                opts[key] = int(value)
+            except ValueError:
+                try:
+                    opts[key] = float(value)
+                except ValueError:
+                    opts[key] = value
+    return opts
+
+
+def messages_to_dicts(proto_messages):
+    """Convert proto ``Message`` objects to dicts suitable for ``apply_chat_template``.
+
+    Handles: ``role``, ``content``, ``name``, ``tool_call_id``,
+    ``reasoning_content``, ``tool_calls`` (JSON string → Python list).
+
+    HuggingFace chat templates (and their MLX/vLLM wrappers) expect a list of
+    plain dicts — proto Message objects don't work directly with Jinja, so
+    this conversion is needed before every ``apply_chat_template`` call.
+    """
+    result = []
+    for msg in proto_messages:
+        d = {"role": msg.role, "content": msg.content or ""}
+        if msg.name:
+            d["name"] = msg.name
+        if msg.tool_call_id:
+            d["tool_call_id"] = msg.tool_call_id
+        if msg.reasoning_content:
+            d["reasoning_content"] = msg.reasoning_content
+        if msg.tool_calls:
+            try:
+                d["tool_calls"] = json.loads(msg.tool_calls)
+            except json.JSONDecodeError:
+                pass
+        result.append(d)
+    return result
--- a/backend/python/common/vllm_utils.py
+++ b/backend/python/common/vllm_utils.py
@@ -1,63 +1,22 @@
-"""Shared utilities for vLLM-based backends."""
-import json
+"""vLLM-specific helpers for the vllm and vllm-omni gRPC backends.
+
+Generic helpers (``parse_options``, ``messages_to_dicts``) live in
+``python_utils`` and are re-exported here for backwards compatibility with
+existing imports in both backends.
+"""
 import sys

+from python_utils import messages_to_dicts, parse_options

-def parse_options(options_list):
-    """Parse Options[] list of 'key:value' strings into a dict.
-
-    Supports type inference for common cases (bool, int, float).
-    Used by LoadModel to extract backend-specific options.
-    """
-    opts = {}
-    for opt in options_list:
-        if ":" not in opt:
-            continue
-        key, value = opt.split(":", 1)
-        key = key.strip()
-        value = value.strip()
-        # Try type conversion
-        if value.lower() in ("true", "false"):
-            opts[key] = value.lower() == "true"
-        else:
-            try:
-                opts[key] = int(value)
-            except ValueError:
-                try:
-                    opts[key] = float(value)
-                except ValueError:
-                    opts[key] = value
-    return opts
-
-
-def messages_to_dicts(proto_messages):
-    """Convert proto Message objects to list of dicts for apply_chat_template().
-
-    Handles: role, content, name, tool_call_id, reasoning_content, tool_calls (JSON string -> list).
-    """
-    result = []
-    for msg in proto_messages:
-        d = {"role": msg.role, "content": msg.content or ""}
-        if msg.name:
-            d["name"] = msg.name
-        if msg.tool_call_id:
-            d["tool_call_id"] = msg.tool_call_id
-        if msg.reasoning_content:
-            d["reasoning_content"] = msg.reasoning_content
-        if msg.tool_calls:
-            try:
-                d["tool_calls"] = json.loads(msg.tool_calls)
-            except json.JSONDecodeError:
-                pass
-        result.append(d)
-    return result
+__all__ = ["parse_options", "messages_to_dicts", "setup_parsers"]


 def setup_parsers(opts):
-    """Return (tool_parser_cls, reasoning_parser_cls) tuple from opts dict.
+    """Return ``(tool_parser_cls, reasoning_parser_cls)`` from an opts dict.

-    Uses vLLM's native ToolParserManager and ReasoningParserManager.
-    Returns (None, None) if vLLM is not installed or parsers not available.
+    Uses vLLM's native ``ToolParserManager`` / ``ReasoningParserManager``.
+    Returns ``(None, None)`` if vLLM isn't installed or the requested
+    parser name can't be resolved.
    """
    tool_parser_cls = None
    reasoning_parser_cls = None