mirror/exo - exo - Gitea: Git with a cup of tea

mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2026-04-17 20:40:35 -04:00

Author	SHA1	Message	Date
mlpy0	01598960bd	Add model card for Qwen3.6-35B-A3B-8bit (#1917 ) Adds the 8bit variant missing from #1907 — the safetensors index is now live on HF. - `mlx-community/Qwen3.6-35B-A3B-8bit` (~35 GB) Architectural fields match the existing 4bit/5bit/bf16 cards. `storage_size.in_bytes` is taken from `metadata.total_size` of the upstream `model.safetensors.index.json`.	2026-04-17 10:06:23 +00:00
Alex Cheema	63b8e64715	Add model cards for Qwen3.6-35B-A3B variants (#1907 ) ## Motivation `mlx-community` has just published the new Qwen3.6-35B-A3B multimodal MoE family on HuggingFace. Without static model cards exo doesn't surface these models in the dashboard picker or match its placement / prefill logic, so users can't one-click launch them. This PR adds cards for the three quants whose safetensors indexes are already live on HF (4bit / 5bit / bf16). ## Changes Three new TOML files in `resources/inference_model_cards/`: - `mlx-community--Qwen3.6-35B-A3B-4bit.toml` (~19 GB) - `mlx-community--Qwen3.6-35B-A3B-5bit.toml` (~23 GB) - `mlx-community--Qwen3.6-35B-A3B-bf16.toml` (~65 GB) All three share the same architectural fields (`n_layers = 40`, `hidden_size = 2048`, `num_key_value_heads = 2`, `context_length = 262144`, capabilities `text, thinking, thinking_toggle, vision`, `base_model = "Qwen3.6 35B A3B"`) — only `model_id`, `quantization`, and `storage_size.in_bytes` differ between variants. ## Why It Works - Qwen3.6-35B-A3B reuses the `qwen3_5_moe` architecture (`Qwen3_5MoeForConditionalGeneration`) — the same one already wired into exo's MLX runner at `src/exo/worker/engines/mlx/auto_parallel.py:47` via `Qwen3_5MoeModel`. The architectural fields are taken verbatim from the HF `config.json.text_config` and match the existing `Qwen3.5-35B-A3B-` cards. - Storage sizes are the exact `metadata.total_size` read from each variant's `model.safetensors.index.json` on HF, so download progress and cluster-memory-fit checks are accurate. - Vision support is flagged in `capabilities`; the `[vision]` block is auto-detected by `ModelCard._autodetect_vision` from the upstream `config.json`, so no hand-written vision config is required. - The card loader (`_refresh_card_cache` in `src/exo/shared/models/model_cards.py`) globs every `.toml` in `resources/inference_model_cards/` on startup, so nothing else needs to change — the `/models` endpoint and the dashboard picker pick them up automatically. The `mxfp4` / `mxfp8` / `nvfp4` variants are still uploading upstream (index JSONs currently 404) and can be added in a follow-up PR once HF completes. ## Test Plan ### Manual Testing Hardware: MacBook Pro M4 Max, 48 GB unified memory. - Built the dashboard, ran `uv run exo`, waited for the API to come up on `http://localhost:52415`. - `curl -s http://localhost:52415/models` returns the three new model ids (`mlx-community/Qwen3.6-35B-A3B-{4bit,5bit,bf16}`) alongside existing models. - Opened the dashboard, clicked SELECT MODEL, typed "Qwen3.6" into the search box. A single "Qwen3.6 35B A3B"* group appears showing `3 variants (19GB-65GB)`. Expanding it lists the `4bit` / `5bit` / `bf16` quants with sizes `19GB` / `23GB` / `65GB`, exactly as expected: ![Qwen3.6 35B A3B in model picker](`127119f703/qwen36-picker.png`) - Programmatically loaded each TOML via `ModelCard.load_from_path(...)` and confirmed the parsed fields (layers / hidden / KV heads / context / quant / base_model / caps / bytes) match what's written in the files. ### Automated Testing No code paths were touched — these are pure TOML data files that plug into the existing model-card loader. The existing pytest suite covers TOML parsing and card serving; adding new TOMLs doesn't require new test scaffolding. `uv run ruff check` and `nix fmt` are clean. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Ryuichi Leo Takashige <rl.takashige@gmail.com>	2026-04-16 23:25:26 +01:00
rltakashige	b8eaf707a8	Add gemma 4 tensor parallelism (#1891 )	2026-04-14 20:31:59 +01:00
Alex Cheema	eb9228615f	models: add MiniMax M2.7 cards (#1884 ) ## Motivation The mlx-community [MiniMax-M2.7 collection](https://huggingface.co/collections/mlx-community/minimax-m27) landed but exo didn't have model cards for any of the variants yet, so they weren't selectable from the dashboard model picker. Adding cards also makes them discoverable under the existing MiniMax family entry. ## Changes Added 6 new model cards in `resources/inference_model_cards/`, one per quant of MiniMax M2.7: - `mlx-community--MiniMax-M2.7.toml` (bf16, full precision — 457 GB) - `mlx-community--MiniMax-M2.7-4bit.toml` (128 GB) - `mlx-community--MiniMax-M2.7-4bit-mxfp4.toml` (121 GB) - `mlx-community--MiniMax-M2.7-5bit.toml` (157 GB) - `mlx-community--MiniMax-M2.7-6bit.toml` (185 GB) - `mlx-community--MiniMax-M2.7-8bit.toml` (243 GB) All six use `family = "minimax"` and share `base_model = "MiniMax M2.7"` so they collapse into a single group in the picker with the existing MiniMax logo. Architecture fields (`n_layers = 62`, `hidden_size = 3072`, `num_key_value_heads = 8`, `context_length = 196608`) were read from each repo's `config.json`; `storage_size.in_bytes` was summed from the HF tree API per repo. `capabilities = ["text", "thinking"]` follows the existing MiniMax M2.5 cards — the chat template always emits `<think>` tags (no toggle), matching M2.5 behavior. ## Why It Works Model cards in `resources/inference_model_cards/` are auto-loaded by `src/exo/shared/models/model_cards.py::get_model_cards`. The dashboard picker groups by `base_model` and filters by `family`, so sharing both across all six variants gives a single "MiniMax M2.7" group under the MiniMax sidebar entry, with the quant variants exposed as selectable sub-options. ## Test Plan ### Manual Testing <!-- Hardware: MacBook Pro M3 Max --> - Ran `uv run python -c "…await get_model_cards()…"` and confirmed all 6 new cards load with `family=minimax`, `base_model="MiniMax M2.7"`, and correct quant + byte sizes. - `cd dashboard && npm run build` then `uv run exo`, opened the model picker → MiniMax family → MiniMax M2.7 group shows all six quant variants. ### Automated Testing - No new automated tests — these are data files validated by the existing Pydantic `ModelCard` schema at load time. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 14:47:09 +01:00
rltakashige	196543ce69	Add Gemma 4 + VLM fixes + thinking parsing updates (#1851 ) ## Motivation Add support for Gemma 4, including VLM! ## Changes - Add auto parallel strategies and model cards for Gemma 4 - Normalise Gemma 4's special Vision Transformer handling to be in line with the rest of our vision processors. - Also adds reprs to messages and b64 hashes to prevent log spam. ## Test Plan ### Manual Testing Tested manually on 4bit E2B and 8bit 26B ### Automated Testing Model onboarding shows small logit diffs. --------- Co-authored-by: Evan <evanev7@gmail.com>	2026-04-11 12:29:33 +01:00
rltakashige	39c39e8199	Integrations helpers (#1810 ) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-03-30 14:28:41 +01:00
rltakashige	635801d515	Add multimodality! (#1802 ) ## Motivation Images! TODO (in a future PR): Add audio and video support. ## Test Plan ### Manual Testing <img width="2652" height="1900" alt="image" src="https://github.com/user-attachments/assets/7d3a7137-542f-4f94-9193-2c73b7c4a5ec" /> <img width="2770" height="1956" alt="image" src="https://github.com/user-attachments/assets/e3c3a096-8029-4409-97a6-aca31a9a3f24" /> <img width="2738" height="1768" alt="image" src="https://github.com/user-attachments/assets/d70ea37f-cd1d-4a4c-ad08-3beb9fafa380" /> (And batching also works) --------- Co-authored-by: David Hind <davehind@yahoo.co.uk>	2026-03-30 11:52:19 +01:00
vskiwi	fc1ae90111	fix: DeepSeek V3.2 warmup crash and tool calling + add catalog cards (#1769 ) ## Summary DeepSeek V3.2 (`DeepseekV32ForCausalLM`) is already supported by exo's inference engine (architecture whitelisted in `model_cards.py`, DSML encoding added in #1548), but doesn't work out of the box due to two bugs: ### Bug 1: `warmup_inference` passes empty model ID `warmup_inference()` in `generate.py` accepts `model_id: ModelId` as a parameter but creates `TextGenerationTaskParams(model=ModelId(""), ...)` instead of using it. Since `_needs_dsml_encoding()` checks `"deepseek-v3.2" in task_params.model.lower()`, the empty string never matches → falls back to `tokenizer.apply_chat_template()` → ValueError because V3.2 has no Jinja chat template. Fix: `model=ModelId("")` → `model=model_id` (one line). ### Bug 2: `_needs_dsml_encoding` limited to tool calling `_needs_dsml_encoding()` returns `True` only when `task_params.tools` is present or tool messages exist in `chat_template_messages`. For warmup and regular chat requests without tools → `return False` → Jinja fallback → ValueError. Unlike V3.1 (which has a `.jinja` chat template file that transformers picks up automatically), V3.2 has no Jinja template at all — it uses Python-based DSML encoding for all message types. Fix: For V3.2, always return `True` — DSML encoding handles all message types. ### Catalog cards Added inference model cards for: - `mlx-community/DeepSeek-V3.2-8bit` - `mlx-community/DeepSeek-V3.2-4bit` Parameters taken from model `config.json` on HuggingFace, storage sizes from HF API. Capabilities include `thinking_toggle` (related: #1456). ## Notes - The model ID string matching approach (`"deepseek-v3.2" in model.lower()`) is acknowledged tech debt — see #1371 for the planned architecture-based approach. ## Test plan - [x] Start exo with DeepSeek V3.2 model → warmup should complete without crash - [x] Send a regular chat message (no tools) → should get a response - [x] Send a chat message with tools → should work as before - [x] V3.2 cards should appear in the dashboard model catalog --------- Co-authored-by: user <user@m1.note> Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net> Co-authored-by: Evan <evanev7@gmail.com>	2026-03-25 16:20:35 +00:00
Mustafa Alp Yılmaz	2994b41089	fix: validate num_key_value_heads in tensor sharding placement (#1669 ) ## Problem Models with fewer KV heads than nodes crash during tensor parallelism. For example, Qwen3.5 MoE models have only 2 KV heads — trying to shard across 4 nodes produces empty tensors and a reshape error at runtime. The placement system already validates `hidden_size % num_nodes == 0` but doesn't check KV heads, so it creates configurations that look valid but blow up when the worker tries to split the attention heads. Affected models include Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B, Qwen3-Next-80B-A3B, and Qwen3-Coder-Next (all have 2 KV heads). ## Changes Placement validation (`src/exo/master/placement.py`): - Combined KV heads divisibility check with the existing hidden_size filter in a single pass - Cycles where `num_key_value_heads % len(cycle) != 0` are now excluded for tensor sharding - Error message includes both constraints when no valid cycle is found Model card schema (`src/exo/shared/models/model_cards.py`): - Added optional `num_key_value_heads` field to `ModelCard` and `ConfigData` - Extracted from HuggingFace `config.json` (handles both top-level and `text_config` nesting) - Passed through in `fetch_from_hf()` for dynamically fetched cards All 68 inference model cards (`resources/inference_model_cards/.toml`): - Populated `num_key_value_heads` from each model's HuggingFace config Utility script* (`scripts/fetch_kv_heads.py`): - Fetches `num_key_value_heads` from HuggingFace and updates TOML cards - `--missing`: only fills in cards that don't have the field yet - `--all`: re-fetches and overwrites everything - Uses tomlkit for safe TOML editing and ThreadPoolExecutor for parallel fetches ## Behavior - Instance previews no longer show tensor options for models that can't split their KV heads across the cluster size - `place_instance()` rejects with a clear error instead of crash-looping - Pipeline parallelism is unaffected - 2-node tensor still works for 2-KV-head models (2 ÷ 2 = 1) - Field is optional — existing custom cards without it continue to work (validation is skipped when `None`)	2026-03-11 13:46:33 +00:00
rltakashige	82c54dd6d6	Add support for Nemotron sharding (#1693 ) ### Automated Testing tested logits match	2026-03-10 15:51:07 +00:00
Daiz	28817d3ee3	Add support for Qwen3.5 (#1644 ) ## Motivation Qwen3.5 MoE models (e.g., `Qwen3.5-397B-A17B-6bit`) are now supported by `mlx-lm` via `qwen3_5_moe` model type, but exo lacks tensor parallel sharding support for this architecture. This prevents running large Qwen3.5 models across multiple nodes. Qwen3.5 uses a GatedDeltaNet hybrid attention mechanism similar to Qwen3-Next, but with a different projection layout — separate `in_proj_qkv`, `in_proj_z`, `in_proj_b`, `in_proj_a` instead of Qwen3-Next's combined `in_proj_qkvz` and `in_proj_ba`. This requires architecture-aware sharding logic. ## Changes (evan summary) - enable qwen3_5 dense + moe tensor parallelism from config - defensively skip evalling _cache.keys if it doesn't exist - ignore kwargs in qwen35 pipeline masking and ensure pipeline segments match global model parameters for mask creation - add sharding for qwen3_5 moe linear attention - added another 6 million model cards ## Why It Works Qwen3.5's GatedDeltaNet has an `in_proj_qkv` linear layer with three concatenated sections: `[q(key_dim), k(key_dim), v(value_dim)]`. A naive contiguous split (`segments=1`) would slice across section boundaries, corrupting q/k/v values and producing garbled output. By passing `segments=[key_dim, key_dim + key_dim]` to `shard_linear()`, each section is split independently before distributing across devices. This ensures every rank receives correctly aligned q, k, and v components. The remaining separate projections (`in_proj_z`, `in_proj_b`, `in_proj_a`) and the MoE layers follow the same `all_to_sharded` / `sharded_to_all` pattern already used for Qwen3-Next. Some pipeline splits didn't include an ssm layer or a linear layer resulting in a subset of the model acting like it shouldn't create the appropriate masks for the next layer - we patch the model to manually create such masks. ## Test Plan tensor sharded 2,3,4 models & pipeline sharded 2,3,4 with simple eval. --------- Co-authored-by: hw <hw@hwStudio1.local> Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net> Co-authored-by: Evan <evanev7@gmail.com>	2026-03-03 14:31:57 +00:00
Jake Hillion	71e48c0f62	model-cards: add missing metadata for Qwen3 Coder Next variants (#1576 ) The Qwen3-Coder-Next model card TOML files were missing family, quantization, base_model, and capabilities fields. This caused them not to appear under the Qwen family filter in the dashboard's model picker. Added the missing metadata to all five variants (4bit, 5bit, 6bit, 8bit, bf16), matching the format used by the existing Qwen3-Coder-480B model cards. Test plan: - Eyeballs	2026-02-20 18:25:49 +00:00
Alex Cheema	ce5a65d3b9	Add MiniMax M2.5 model cards (#1514 ) ## Summary - Adds model cards for MiniMax M2.5 in three quantizations: 4bit (~129 GB), 6bit (~186 GB), 8bit (~243 GB) - No code changes needed — `MiniMaxM2ForCausalLM` is already in the tensor parallel whitelist and `MiniMaxShardingStrategy` is already implemented in `auto_parallel.py` - Credit to @vskiwi for confirming MiniMax M2.5 works out of the box with existing code Closes #1480 ## Test plan - [x] `basedpyright` passes with 0 errors - [x] `ruff check` passes - [x] `pytest` passes (260 passed, 1 skipped) - [ ] Verify MiniMax M2.5 models appear in model selector on dashboard 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: rltakashige <rl.takashige@gmail.com>	2026-02-18 21:11:13 +00:00
Alex Cheema	6c322ebb72	feat: only show thinking toggle for models that support it (#1497 ) ## Summary - Adds `thinking_toggle` capability to 26 model cards that support toggling thinking mode on/off - GPT-OSS models (20b, 120b) excluded — they always think and don't support toggling - Dashboard UI updated to check for `thinking_toggle` capability before showing the toggle button ## Test plan - [x] `uv run basedpyright` — 0 errors - [x] `uv run ruff check` — all checks passed - [x] `nix fmt` — 0 files changed - [x] `uv run pytest` — 188 passed, 0 failed - [x] Security review passed (no secrets, eval/exec, innerHTML, or dep changes) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 17:05:00 +00:00
rltakashige	48b8f86395	Add support for GLM 5 (#1526 ) ## Motivation Add GLM 5 support in favor of #1513 ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-02-18 14:04:06 +00:00
rltakashige	5a28642790	Add support for Step 3.5 flash! (#1460 ) ## Motivation Working version of #1366 ## Changes Add Step 3.5 Flash ## Test Plan ### Manual Testing Works! ### Automated Testing Running two processes tensor/pipeline sharded gives same logits as single process.	2026-02-13 12:10:18 +00:00
ciaranbor	9dc4f786bd	Ciaran/image model listing (#1417 ) ## Motivation Image models (FLUX, Qwen Image) had no family grouping or quantization metadata in the dashboard ## Changes - Added family, quantization, base_model, and capabilities fields to all 18 image model TOML cards (FLUX.1 variants + Qwen Image variants) - Added FLUX and Qwen Image SVG logos to FamilyLogos.svelte - Added "flux" and "qwen-image" families to the sidebar and family sort order - Added "Image Gen" and "Image Edit" capability filters in ModelFilterPopover.svelte - Added image edit icon/badge to ModelPickerGroup.svelte - Made the model category sidebar scrollable to accommodate the new entries - Hidden scrollbars on model list panels ## Why It Works Reuses the existing family/quantization grouping infrastructure that LLMs already use, extending it to image models with appropriate metadata and icons ## Test Plan ### Manual Testing Verified image models behave like text models in the model list dialog --------- Co-authored-by: Alex Cheema <41707476+AlexCheema@users.noreply.github.com>	2026-02-06 16:08:57 -08:00
ciaranbor	6f0cb99919	Ciaran/flux1 kontext (#1394 ) ## Motivation Add support for FLUX.1-Kontext-dev, an image editing variant of FLUX.1-dev ## Changes - New FluxKontextModelAdapter: Handles Kontext's image-to-image workflow - encodes input image as conditioning latents with special position IDs, generates from pure noise - Model config: 57 transformer blocks (19 joint + 38 single), guidance scale 4.0, ImageToImage task - Pipeline updates: Added kontext_image_ids property to PromptData interface, passed through diffusion runner - Model cards: Added TOML configs for base, 4-bit, and 8-bit variants - Dependency: mflux 0.15.4 → 0.15.5 - Utility: tmp/quantize_and_upload.py for quantizing and uploading models to HuggingFace ## Test Plan ### Manual Testing Works better than Qwen-Image-Edit	2026-02-06 16:20:31 +00:00
Alex Cheema	ffe6396c91	Add Qwen3-Coder-Next model cards (#1367 ) ## Motivation Qwen3-Coder-Next just dropped on mlx-community in several quantizations. It's an 80B MoE model (Qwen3NextForCausalLM) which we already have tensor parallelism support for via QwenShardingStrategy — just needs model cards. ## Changes Added model cards for all 5 available quantizations: - `mlx-community/Qwen3-Coder-Next-4bit` (~46GB) - `mlx-community/Qwen3-Coder-Next-5bit` (~58GB) - `mlx-community/Qwen3-Coder-Next-6bit` (~69GB) - `mlx-community/Qwen3-Coder-Next-8bit` (~89GB) - `mlx-community/Qwen3-Coder-Next-bf16` (~158GB) All with `supports_tensor = true` since the architecture is already supported. ## Why It Works `Qwen3NextForCausalLM` is already handled by QwenShardingStrategy in auto_parallel.py and is in the supports_tensor allowlist in model_cards.py. No code changes needed — just the TOML card files. ## Test Plan ### Manual Testing <!-- n/a - model card addition only --> ### Automated Testing - `basedpyright` — 0 errors - `ruff check` — passes - `nix fmt` — no changes - `pytest` — 173 passed, 1 skipped 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 13:37:18 +00:00
ciaranbor	6177550c34	Ciaran/parallel cfg (#1361 ) ## Motivation Enable parallel classifier-free guidance (CFG) for Qwen image models. CFG requires two forward passes (positive/negative prompts) - this allows them to run on separate nodes simultaneously, reducing latency. ## Changes - Added uses_cfg flag to ModelCard to identify CFG-based models - Extended PipelineShardMetadata with CFG topology fields (cfg_rank, cfg_world_size, peer device info) - Updated placement to create two CFG groups with reversed ordering (places CFG peers as ring neighbors) - Refactored DiffusionRunner to process CFG branches separately with exchange at last pipeline stage - Added get_cfg_branch_data() to PromptData for single-branch embeddings - Fixed seed handling in API for distributed consistency - Fixed image yield to only emit from CFG rank 0 at last stage - Increased num_sync_steps_factor from 0.125 to 0.25 for Qwen ## Why It Works - 2 nodes + CFG: Both run all layers, process different CFG branches in parallel - 4+ even nodes + CFG: Hybrid - 2 CFG groups × N/2 pipeline stages - Odd nodes or non-CFG: Falls back to pure pipeline parallelism Ring topology places CFG peers as neighbors to enable direct exchange. ## Test Plan ### Manual Testing Verified performance gain for Qwen-Image for 2 node and 4 node cluster. Non-CFG models still work ### Automated Testing Added tests in test_placement_utils.py covering 2-node CFG parallel, 4-node hybrid, odd-node fallback, and non-CFG pipeline modes.	2026-02-04 21:16:35 +00:00
Alex Cheema	41ed7afb3b	feat: add model picker modal with grouped models and HF Hub search (#1369 ) ## Motivation Reimplements the model picker modal from #1191 on top of the custom model support branch. Replaces the inline model dropdown with a full-featured modal that groups models by base model, supports filtering, favorites, and HuggingFace Hub search. ## Changes Backend: - Add `family`, `quantization`, `base_model`, `capabilities` metadata fields to `ModelCard` and all 40 TOML model cards - Pass new fields through `ModelListModel` and `get_models()` API response - Add `GET /models/search` endpoint using `huggingface_hub.list_models()` Dashboard (7 new files): - `ModelPickerModal.svelte` — Main modal with search, family filtering, HuggingFace Hub tab - `ModelPickerGroup.svelte` — Expandable model group row with quantization variants - `FamilySidebar.svelte` — Vertical sidebar with family icons (All, Favorites, Hub, model families) - `FamilyLogos.svelte` — SVG icons for each model family - `ModelFilterPopover.svelte` — Capability and size range filters - `HuggingFaceResultItem.svelte` — HF search result item with download/like counts - `favorites.svelte.ts` — localStorage-backed favorites store Integration: - Replace inline dropdown in `+page.svelte` with button that opens `ModelPickerModal` - Custom models shown in Hub tab with delete support Polish: - Real brand logos (Meta, Qwen, DeepSeek, OpenAI, GLM, MiniMax, Kimi, HuggingFace) from Simple Icons / LobeHub - Clean SVG stroke icons for capabilities (thinking, code, vision, image gen) - Consistent `border-exo-yellow/10` borders, descriptive tooltips throughout - Cluster memory (used/total) shown in modal header - Selected model highlight with checkmark for both single and multi-variant groups - Cursor pointer on all interactive elements, fix filter popover click-outside bug - Custom models now appear in All tab alongside built-in models ## Bug Fix: Gemma 3 EOS tokens Also included in this branch: fix for Gemma 3 models generating infinite `<end_of_turn>` tokens. The tokenizer's `eos_token_ids` was missing token ID 106 (`<end_of_turn>`), so generation never stopped. The fix appends this token to the EOS list after loading the tokenizer. Also handles `eos_token_ids` being a `set` (not just a `list`). ## Why It Works Model metadata (family, capabilities, etc.) is stored directly in TOML cards rather than derived from heuristics, ensuring accuracy. The modal groups models by `base_model` field so quantization variants appear together. Custom models are separated into the Hub tab since they lack grouping metadata. ## Test Plan ### Manual Testing - Open dashboard, click model selector to open modal - Browse models by family sidebar, search, and filters - Expand model groups to see quantization variants - Star favorites and verify persistence across page reloads - Navigate to Hub tab, search and add models - Verify error messages shown for invalid model IDs - Run a Gemma 3 model and verify generation stops at `<end_of_turn>` ### Automated Testing - `uv run basedpyright` — 0 errors - `uv run ruff check` — passes - `nix fmt` — clean - `uv run pytest src/` — 173 passed - `cd dashboard && npm run build` — builds successfully --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 05:56:23 -08:00
Evan Quiney	d90605f198	migrate model cards to .toml files (#1354 )	2026-02-03 12:32:06 +00:00

22 Commits