fix(mlx): route vision-language models to the mlx-vlm backend (#10274)

Vision-language checkpoints such as mlx-community/gemma-4-E4B-it-qat-4bit declare the "image-text-to-text" pipeline tag on HuggingFace. The mlx importer hardcoded backend "mlx" for every mlx-community model, so these VLMs were served by the text-only mlx-lm backend whose tokenizer does not carry the processor chat template. The template was never applied and the model produced degenerate, looping output that echoed the prompt. Detect the "image-text-to-text" pipeline tag in the importer and route those models to mlx-vlm, which applies the processor-aware chat template. An explicit backend preference still wins. As a defensive backstop, the mlx backend now warns loudly when the loaded model has no chat template, so a misrouted VLM surfaces the problem instead of silently looping. Fixes #10269 Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-13 03:09:03 -04:00 · 2026-06-12 23:12:42 +02:00
parent cec93d2e00
commit a7a7bd646b
3 changed files with 83 additions and 0 deletions
--- a/backend/python/mlx/backend.py
+++ b/backend/python/mlx/backend.py
@@ -407,6 +407,24 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
        if not request.Prompt and request.UseTokenizerTemplate and request.Messages:
            messages = messages_to_dicts(request.Messages)

+            # The mlx-lm tokenizer only carries a text-LM chat template. A
+            # vision-language checkpoint (e.g. gemma-4 E4B) loaded here has no
+            # usable template, so apply_chat_template silently passes the raw
+            # text through and the model just echoes/loops (issue #10269).
+            # Warn loudly so the misroute is visible; such models belong on the
+            # mlx-vlm backend.
+            chat_template = getattr(self.tokenizer, "chat_template", None)
+            if not chat_template:
+                underlying = getattr(self.tokenizer, "_tokenizer", None)
+                chat_template = getattr(underlying, "chat_template", None)
+            if not chat_template:
+                print(
+                    "WARNING: this model has no chat template; output may be "
+                    "degenerate. Vision-language models (e.g. gemma-4 E4B) must "
+                    "use the 'mlx-vlm' backend instead of 'mlx'.",
+                    file=sys.stderr,
+                )
+
            kwargs = {"tokenize": False, "add_generation_prompt": True}
            if request.Tools:
                try: