mirror of
https://github.com/mudler/LocalAI.git
synced 2026-04-17 05:18:53 -04:00
* refactor(backends): extract python_utils + add mlx_utils shared helpers
Move parse_options() and messages_to_dicts() out of vllm_utils.py into a
new framework-agnostic python_utils.py, and re-export them from vllm_utils
so existing vllm / vllm-omni imports keep working.
Add mlx_utils.py with split_reasoning() and parse_tool_calls() — ported
from mlx_vlm/server.py's process_tool_calls. These work with any
mlx-lm / mlx-vlm tool module (anything exposing tool_call_start,
tool_call_end, parse_tool_call). Used by the mlx and mlx-vlm backends in
later commits to emit structured ChatDelta.tool_calls without
reimplementing per-model parsing.
Shared smoke tests confirm:
- parse_options round-trips bool/int/float/string
- vllm_utils re-exports are identity-equal to python_utils originals
- mlx_utils parse_tool_calls handles <tool_call>...</tool_call> with a
shim module and produces a correctly-indexed list with JSON arguments
- mlx_utils split_reasoning extracts <think> blocks and leaves clean
content
* feat(mlx): wire native tool parsers + ChatDelta + token usage + logprobs
Bring the MLX backend up to the same structured-output contract as vLLM
and llama.cpp: emit Reply.chat_deltas so the OpenAI HTTP layer sees
tool_calls and reasoning_content, not just raw text.
Key insight: mlx_lm.load() returns a TokenizerWrapper that already auto-
detects the right tool parser from the model's chat template
(_infer_tool_parser in mlx_lm/tokenizer_utils.py). The wrapper exposes
has_tool_calling, has_thinking, tool_parser, tool_call_start,
tool_call_end, think_start, think_end — no user configuration needed,
unlike vLLM.
Changes in backend/python/mlx/backend.py:
- Imports: replace inline parse_options / messages_to_dicts with the
shared helpers from python_utils. Pull split_reasoning / parse_tool_calls
from the new mlx_utils shared module.
- LoadModel: log the auto-detected has_tool_calling / has_thinking /
tool_parser_type for observability. Drop the local is_float / is_int
duplicates.
- _prepare_prompt: run request.Messages through messages_to_dicts so
tool_call_id / tool_calls / reasoning_content survive the conversion,
and pass tools=json.loads(request.Tools) + enable_thinking=True (when
request.Metadata says so) to apply_chat_template. Falls back on
TypeError for tokenizers whose template doesn't accept those kwargs.
- _build_generation_params: return an additional (logits_params,
stop_words) pair. Maps RepetitionPenalty / PresencePenalty /
FrequencyPenalty to mlx_lm.sample_utils.make_logits_processors and
threads StopPrompts through to post-decode truncation.
- New _tool_module_from_tokenizer / _finalize_output / _truncate_at_stop
helpers. _finalize_output runs split_reasoning when has_thinking is
true and parse_tool_calls (using a SimpleNamespace shim around the
wrapper's tool_parser callable) when has_tool_calling is true, then
extracts prompt_tokens, generation_tokens and (best-effort) logprobs
from the last GenerationResponse chunk.
- Predict: use make_logits_processors, accumulate text + last_response,
finalize into a structured Reply carrying chat_deltas,
prompt_tokens, tokens, logprobs. Early-stops on user stop sequences.
- PredictStream: per-chunk Reply still carries raw message bytes for
back-compat but now also emits chat_deltas=[ChatDelta(content=delta)].
On loop exit, emit a terminal Reply with structured
reasoning_content / tool_calls / token counts / logprobs — so the Go
side sees tool calls without needing the regex fallback.
- TokenizeString RPC: uses the TokenizerWrapper's encode(); returns
length + tokens or FAILED_PRECONDITION if the model isn't loaded.
- Free RPC: drops model / tokenizer / lru_cache, runs gc.collect(),
calls mx.metal.clear_cache() when available, and best-effort clears
torch.cuda as a belt-and-suspenders.
* feat(mlx-vlm): mirror MLX parity (tool parsers + ChatDelta + samplers)
Same treatment as the MLX backend: emit structured Reply.chat_deltas,
tool_calls, reasoning_content, token counts and logprobs, and extend
sampling parameter coverage beyond the temp/top_p pair the backend
used to handle.
- Imports: drop the inline is_float/is_int helpers, pull parse_options /
messages_to_dicts from python_utils and split_reasoning /
parse_tool_calls from mlx_utils. Also import make_sampler and
make_logits_processors from mlx_lm.sample_utils — mlx-vlm re-uses them.
- LoadModel: use parse_options; call mlx_vlm.tool_parsers._infer_tool_parser
/ load_tool_module to auto-detect a tool module from the processor's
chat_template. Stash think_start / think_end / has_thinking so later
finalisation can split reasoning blocks without duck-typing on each
call. Logs the detected parser type.
- _prepare_prompt: convert proto Messages via messages_to_dicts (so
tool_call_id / tool_calls survive), pass tools=json.loads(request.Tools)
and enable_thinking=True to apply_chat_template when present, fall
back on TypeError for older mlx-vlm versions. Also handle the
prompt-only + media and empty-prompt + media paths consistently.
- _build_generation_params: return (max_tokens, sampler_params,
logits_params, stop_words). Maps repetition_penalty / presence_penalty /
frequency_penalty and passes them through make_logits_processors.
- _finalize_output / _truncate_at_stop: common helper used by Predict
and PredictStream to split reasoning, run parse_tool_calls against the
auto-detected tool module, build ToolCallDelta list, and extract token
counts + logprobs from the last GenerationResult.
- Predict / PredictStream: switch from mlx_vlm.generate to mlx_vlm.stream_generate
in both paths, accumulate text + last_response, pass sampler and
logits_processors through, emit content-only ChatDelta per streaming
chunk followed by a terminal Reply carrying reasoning_content,
tool_calls, prompt_tokens, tokens and logprobs. Non-streaming Predict
returns the same structured Reply shape.
- New helper _collect_media extracted from the duplicated base64 image /
audio decode loop.
- New TokenizeString RPC using the processor's tokenizer.encode and
Free RPC that drops model/processor/config, runs gc + Metal cache
clear + best-effort torch.cuda cache clear.
* feat(importer/mlx): auto-set tool_parser/reasoning_parser on import
Mirror what core/gallery/importers/vllm.go does: after applying the
shared inference defaults, look up the model URI in parser_defaults.json
and append matching tool_parser:/reasoning_parser: entries to Options.
The MLX backends auto-detect tool parsers from the chat template at
runtime so they don't actually consume these options — but surfacing
them in the generated YAML:
- keeps the import experience consistent with vllm
- gives users a single visible place to override
- documents the intended parser for a given model family
* test(mlx): add helper unit tests + TokenizeString/Free + e2e make targets
- backend/python/mlx/test.py: add TestSharedHelpers with server-less
unit tests for parse_options, messages_to_dicts, split_reasoning and
parse_tool_calls (using a SimpleNamespace shim to fake a tool module
without requiring a model). Plus test_tokenize_string and test_free
RPC tests that load a tiny MLX-quantized Llama and exercise the new
RPCs end-to-end.
- backend/python/mlx-vlm/test.py: same helper unit tests + cleanup of
the duplicated import block at the top of the file.
- Makefile: register BACKEND_MLX and BACKEND_MLX_VLM (they were missing
from the docker-build-target eval list — only mlx-distributed had a
generated target before). Add test-extra-backend-mlx and
test-extra-backend-mlx-vlm convenience targets that build the
respective image and run tests/e2e-backends with the tools capability
against mlx-community/Qwen2.5-0.5B-Instruct-4bit. The MLX backend
auto-detects the tool parser from the chat template so no
BACKEND_TEST_OPTIONS is needed (unlike vllm).
* fix(libbackend): don't pass --copies to venv unless PORTABLE_PYTHON=true
backend/python/common/libbackend.sh:ensureVenv() always invoked
'python -m venv --copies', but macOS system python (and some other
builds) refuses with:
Error: This build of python cannot create venvs without using symlinks
--copies only matters when _makeVenvPortable later relocates the venv,
which only happens when PORTABLE_PYTHON=true. Make --copies conditional
on that flag and fall back to default (symlinked) venv otherwise.
Caught while bringing up the mlx backend on Apple Silicon — the same
build path is used by every Python backend with USE_PIP=true.
* fix(mlx): support mlx-lm 0.29.x tool calling + drop deprecated clear_cache
The released mlx-lm 0.29.x ships a much simpler tool-calling API than
HEAD: TokenizerWrapper detects the <tool_call>...</tool_call> markers
from the tokenizer vocab and exposes has_tool_calling /
tool_call_start / tool_call_end, but does NOT expose a tool_parser
callable on the wrapper and does NOT ship a mlx_lm.tool_parsers
subpackage at all (those only exist on main).
Caught while running the smoke test on Apple Silicon with the
released mlx-lm 0.29.1: tokenizer.tool_parser raised AttributeError
(falling through to the underlying HF tokenizer), so
_tool_module_from_tokenizer always returned None and tool calls slipped
through as raw <tool_call>...</tool_call> text in Reply.message instead
of being parsed into ChatDelta.tool_calls.
Fix: when has_tool_calling is True but tokenizer.tool_parser is missing,
default the parse_tool_call callable to json.loads(body.strip()) — that's
exactly what mlx_lm.tool_parsers.json_tools.parse_tool_call does on HEAD
and covers the only format 0.29 detects (<tool_call>JSON</tool_call>).
Future mlx-lm releases that ship more parsers will be picked up
automatically via the tokenizer.tool_parser attribute when present.
Also tighten the LoadModel logging — the old log line read
init_kwargs.get('tool_parser_type') which doesn't exist on 0.29 and
showed None even when has_tool_calling was True. Log the actual
tool_call_start / tool_call_end markers instead.
While here, switch Free()'s Metal cache clear from the deprecated
mx.metal.clear_cache to mx.clear_cache (mlx >= 0.30), with a
fallback for older releases. Mirrored to the mlx-vlm backend.
* feat(mlx-distributed): mirror MLX parity (tool calls + ChatDelta + sampler)
Same treatment as the mlx and mlx-vlm backends: emit Reply.chat_deltas
with structured tool_calls / reasoning_content / token counts /
logprobs, expand sampling parameter coverage beyond temp+top_p, and
add the missing TokenizeString and Free RPCs.
Notes specific to mlx-distributed:
- Rank 0 is the only rank that owns a sampler — workers participate in
the pipeline-parallel forward pass via mx.distributed and don't
re-implement sampling. So the new logits_params (repetition_penalty,
presence_penalty, frequency_penalty) and stop_words apply on rank 0
only; we don't need to extend coordinator.broadcast_generation_params,
which still ships only max_tokens / temperature / top_p to workers
(everything else is a rank-0 concern).
- Free() now broadcasts CMD_SHUTDOWN to workers when a coordinator is
active, so they release the model on their end too. The constant is
already defined and handled by the existing worker loop in
backend.py:633 (CMD_SHUTDOWN = -1).
- Drop the locally-defined is_float / is_int / parse_options trio in
favor of python_utils.parse_options, re-exported under the module
name for back-compat with anything that imported it directly.
- _prepare_prompt: route through messages_to_dicts so tool_call_id /
tool_calls / reasoning_content survive, pass tools=json.loads(
request.Tools) and enable_thinking=True to apply_chat_template, fall
back on TypeError for templates that don't accept those kwargs.
- New _tool_module_from_tokenizer (with the json.loads fallback for
mlx-lm 0.29.x), _finalize_output, _truncate_at_stop helpers — same
contract as the mlx backend.
- LoadModel logs the auto-detected has_tool_calling / has_thinking /
tool_call_start / tool_call_end so users can see what the wrapper
picked up for the loaded model.
- backend/python/mlx-distributed/test.py: add the same TestSharedHelpers
unit tests (parse_options, messages_to_dicts, split_reasoning,
parse_tool_calls) that exist for mlx and mlx-vlm.
199 lines
7.5 KiB
Python
199 lines
7.5 KiB
Python
import os
|
|
import sys
|
|
import types
|
|
import unittest
|
|
import subprocess
|
|
import time
|
|
|
|
import grpc
|
|
import backend_pb2
|
|
import backend_pb2_grpc
|
|
|
|
# Make the shared helpers importable so we can unit-test them without a
|
|
# running gRPC server.
|
|
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'common'))
|
|
from python_utils import messages_to_dicts, parse_options
|
|
from mlx_utils import parse_tool_calls, split_reasoning
|
|
|
|
class TestBackendServicer(unittest.TestCase):
|
|
"""
|
|
TestBackendServicer is the class that tests the gRPC service.
|
|
|
|
This class contains methods to test the startup and shutdown of the gRPC service.
|
|
"""
|
|
def setUp(self):
|
|
self.service = subprocess.Popen(["python", "backend.py", "--addr", "localhost:50051"])
|
|
time.sleep(10)
|
|
|
|
def tearDown(self) -> None:
|
|
self.service.terminate()
|
|
self.service.wait()
|
|
|
|
def test_server_startup(self):
|
|
try:
|
|
self.setUp()
|
|
with grpc.insecure_channel("localhost:50051") as channel:
|
|
stub = backend_pb2_grpc.BackendStub(channel)
|
|
response = stub.Health(backend_pb2.HealthMessage())
|
|
self.assertEqual(response.message, b'OK')
|
|
except Exception as err:
|
|
print(err)
|
|
self.fail("Server failed to start")
|
|
finally:
|
|
self.tearDown()
|
|
def test_load_model(self):
|
|
"""
|
|
This method tests if the model is loaded successfully
|
|
"""
|
|
try:
|
|
self.setUp()
|
|
with grpc.insecure_channel("localhost:50051") as channel:
|
|
stub = backend_pb2_grpc.BackendStub(channel)
|
|
response = stub.LoadModel(backend_pb2.ModelOptions(Model="facebook/opt-125m"))
|
|
self.assertTrue(response.success)
|
|
self.assertEqual(response.message, "Model loaded successfully")
|
|
except Exception as err:
|
|
print(err)
|
|
self.fail("LoadModel service failed")
|
|
finally:
|
|
self.tearDown()
|
|
|
|
def test_text(self):
|
|
"""
|
|
This method tests if the embeddings are generated successfully
|
|
"""
|
|
try:
|
|
self.setUp()
|
|
with grpc.insecure_channel("localhost:50051") as channel:
|
|
stub = backend_pb2_grpc.BackendStub(channel)
|
|
response = stub.LoadModel(backend_pb2.ModelOptions(Model="facebook/opt-125m"))
|
|
self.assertTrue(response.success)
|
|
req = backend_pb2.PredictOptions(Prompt="The capital of France is")
|
|
resp = stub.Predict(req)
|
|
self.assertIsNotNone(resp.message)
|
|
except Exception as err:
|
|
print(err)
|
|
self.fail("text service failed")
|
|
finally:
|
|
self.tearDown()
|
|
|
|
def test_sampling_params(self):
|
|
"""
|
|
This method tests if all sampling parameters are correctly processed
|
|
NOTE: this does NOT test for correctness, just that we received a compatible response
|
|
"""
|
|
try:
|
|
self.setUp()
|
|
with grpc.insecure_channel("localhost:50051") as channel:
|
|
stub = backend_pb2_grpc.BackendStub(channel)
|
|
response = stub.LoadModel(backend_pb2.ModelOptions(Model="facebook/opt-125m"))
|
|
self.assertTrue(response.success)
|
|
|
|
req = backend_pb2.PredictOptions(
|
|
Prompt="The capital of France is",
|
|
TopP=0.8,
|
|
Tokens=50,
|
|
Temperature=0.7,
|
|
TopK=40,
|
|
PresencePenalty=0.1,
|
|
FrequencyPenalty=0.2,
|
|
RepetitionPenalty=1.1,
|
|
MinP=0.05,
|
|
Seed=42,
|
|
StopPrompts=["\n"],
|
|
StopTokenIds=[50256],
|
|
BadWords=["badword"],
|
|
IncludeStopStrInOutput=True,
|
|
IgnoreEOS=True,
|
|
MinTokens=5,
|
|
Logprobs=5,
|
|
PromptLogprobs=5,
|
|
SkipSpecialTokens=True,
|
|
SpacesBetweenSpecialTokens=True,
|
|
TruncatePromptTokens=10,
|
|
GuidedDecoding=True,
|
|
N=2,
|
|
)
|
|
resp = stub.Predict(req)
|
|
self.assertIsNotNone(resp.message)
|
|
self.assertIsNotNone(resp.logprobs)
|
|
except Exception as err:
|
|
print(err)
|
|
self.fail("sampling params service failed")
|
|
finally:
|
|
self.tearDown()
|
|
|
|
|
|
def test_embedding(self):
|
|
"""
|
|
This method tests if the embeddings are generated successfully
|
|
"""
|
|
try:
|
|
self.setUp()
|
|
with grpc.insecure_channel("localhost:50051") as channel:
|
|
stub = backend_pb2_grpc.BackendStub(channel)
|
|
response = stub.LoadModel(backend_pb2.ModelOptions(Model="intfloat/e5-mistral-7b-instruct"))
|
|
self.assertTrue(response.success)
|
|
embedding_request = backend_pb2.PredictOptions(Embeddings="This is a test sentence.")
|
|
embedding_response = stub.Embedding(embedding_request)
|
|
self.assertIsNotNone(embedding_response.embeddings)
|
|
# assert that is a list of floats
|
|
self.assertIsInstance(embedding_response.embeddings, list)
|
|
# assert that the list is not empty
|
|
self.assertTrue(len(embedding_response.embeddings) > 0)
|
|
except Exception as err:
|
|
print(err)
|
|
self.fail("Embedding service failed")
|
|
finally:
|
|
self.tearDown()
|
|
|
|
|
|
class TestSharedHelpers(unittest.TestCase):
|
|
"""Server-less unit tests for the helpers the mlx-vlm backend depends on."""
|
|
|
|
def test_parse_options_typed(self):
|
|
opts = parse_options(["temperature:0.7", "max_tokens:128", "trust:true", "name:hello"])
|
|
self.assertEqual(opts["temperature"], 0.7)
|
|
self.assertEqual(opts["max_tokens"], 128)
|
|
self.assertIs(opts["trust"], True)
|
|
self.assertEqual(opts["name"], "hello")
|
|
|
|
def test_messages_to_dicts_roundtrip(self):
|
|
msgs = [
|
|
backend_pb2.Message(role="user", content="hi"),
|
|
backend_pb2.Message(
|
|
role="assistant",
|
|
content="",
|
|
tool_calls='[{"id":"call_1","type":"function","function":{"name":"f","arguments":"{}"}}]',
|
|
),
|
|
backend_pb2.Message(
|
|
role="tool",
|
|
content="42",
|
|
tool_call_id="call_1",
|
|
name="f",
|
|
),
|
|
]
|
|
out = messages_to_dicts(msgs)
|
|
self.assertEqual(out[0], {"role": "user", "content": "hi"})
|
|
self.assertEqual(out[1]["tool_calls"][0]["function"]["name"], "f")
|
|
self.assertEqual(out[2]["tool_call_id"], "call_1")
|
|
|
|
def test_split_reasoning(self):
|
|
r, c = split_reasoning("<think>plan</think>final", "<think>", "</think>")
|
|
self.assertEqual(r, "plan")
|
|
self.assertEqual(c, "final")
|
|
|
|
def test_parse_tool_calls_with_shim(self):
|
|
tm = types.SimpleNamespace(
|
|
tool_call_start="<tool_call>",
|
|
tool_call_end="</tool_call>",
|
|
parse_tool_call=lambda body, tools: {"name": "get_weather", "arguments": {"location": body.strip()}},
|
|
)
|
|
calls, remaining = parse_tool_calls(
|
|
"<tool_call>Paris</tool_call>",
|
|
tm,
|
|
tools=None,
|
|
)
|
|
self.assertEqual(len(calls), 1)
|
|
self.assertEqual(calls[0]["name"], "get_weather")
|
|
self.assertEqual(calls[0]["arguments"], '{"location": "Paris"}') |