Compare commits

...

18 Commits

Author SHA1 Message Date
Evan
2a0b9dc727 move runner ack after status update 2026-02-05 00:07:43 +00:00
ciaranbor
6177550c34 Ciaran/parallel cfg (#1361)
## Motivation

Enable parallel classifier-free guidance (CFG) for Qwen image models.
CFG requires two forward passes (positive/negative prompts) - this
allows them to run on separate nodes simultaneously, reducing latency.

## Changes

  - Added uses_cfg flag to ModelCard to identify CFG-based models
- Extended PipelineShardMetadata with CFG topology fields (cfg_rank,
cfg_world_size, peer device info)
- Updated placement to create two CFG groups with reversed ordering
(places CFG peers as ring neighbors)
- Refactored DiffusionRunner to process CFG branches separately with
exchange at last pipeline stage
- Added get_cfg_branch_data() to PromptData for single-branch embeddings
  - Fixed seed handling in API for distributed consistency
  - Fixed image yield to only emit from CFG rank 0 at last stage
  - Increased num_sync_steps_factor from 0.125 to 0.25 for Qwen

## Why It Works

- 2 nodes + CFG: Both run all layers, process different CFG branches in
parallel
  - 4+ even nodes + CFG: Hybrid - 2 CFG groups × N/2 pipeline stages
  - Odd nodes or non-CFG: Falls back to pure pipeline parallelism
 
 Ring topology places CFG peers as neighbors to enable direct exchange.

## Test Plan

### Manual Testing

Verified performance gain for Qwen-Image for 2 node and 4 node cluster.
Non-CFG models still work

### Automated Testing
 
Added tests in test_placement_utils.py covering 2-node CFG parallel,
4-node hybrid, odd-node fallback, and non-CFG pipeline modes.
2026-02-04 21:16:35 +00:00
Evan Quiney
7b6cad94c6 add resources dir to nix (#1376)
add resources directory to the nix exo package, and fixes the env for the dashboard dir
2026-02-04 16:38:43 +00:00
Alex Cheema
41ed7afb3b feat: add model picker modal with grouped models and HF Hub search (#1369)
## Motivation

Reimplements the model picker modal from #1191 on top of the custom
model support branch. Replaces the inline model dropdown with a
full-featured modal that groups models by base model, supports
filtering, favorites, and HuggingFace Hub search.

## Changes

**Backend:**
- Add `family`, `quantization`, `base_model`, `capabilities` metadata
fields to `ModelCard` and all 40 TOML model cards
- Pass new fields through `ModelListModel` and `get_models()` API
response
- Add `GET /models/search` endpoint using
`huggingface_hub.list_models()`

**Dashboard (7 new files):**
- `ModelPickerModal.svelte` — Main modal with search, family filtering,
HuggingFace Hub tab
- `ModelPickerGroup.svelte` — Expandable model group row with
quantization variants
- `FamilySidebar.svelte` — Vertical sidebar with family icons (All,
Favorites, Hub, model families)
- `FamilyLogos.svelte` — SVG icons for each model family
- `ModelFilterPopover.svelte` — Capability and size range filters
- `HuggingFaceResultItem.svelte` — HF search result item with
download/like counts
- `favorites.svelte.ts` — localStorage-backed favorites store

**Integration:**
- Replace inline dropdown in `+page.svelte` with button that opens
`ModelPickerModal`
- Custom models shown in Hub tab with delete support

**Polish:**
- Real brand logos (Meta, Qwen, DeepSeek, OpenAI, GLM, MiniMax, Kimi,
HuggingFace) from Simple Icons / LobeHub
- Clean SVG stroke icons for capabilities (thinking, code, vision, image
gen)
- Consistent `border-exo-yellow/10` borders, descriptive tooltips
throughout
- Cluster memory (used/total) shown in modal header
- Selected model highlight with checkmark for both single and
multi-variant groups
- Cursor pointer on all interactive elements, fix filter popover
click-outside bug
- Custom models now appear in All tab alongside built-in models

## Bug Fix: Gemma 3 EOS tokens

Also included in this branch: fix for Gemma 3 models generating infinite
`<end_of_turn>` tokens. The tokenizer's `eos_token_ids` was missing
token ID 106 (`<end_of_turn>`), so generation never stopped. The fix
appends this token to the EOS list after loading the tokenizer. Also
handles `eos_token_ids` being a `set` (not just a `list`).

## Why It Works

Model metadata (family, capabilities, etc.) is stored directly in TOML
cards rather than derived from heuristics, ensuring accuracy. The modal
groups models by `base_model` field so quantization variants appear
together. Custom models are separated into the Hub tab since they lack
grouping metadata.

## Test Plan

### Manual Testing
- Open dashboard, click model selector to open modal
- Browse models by family sidebar, search, and filters
- Expand model groups to see quantization variants
- Star favorites and verify persistence across page reloads
- Navigate to Hub tab, search and add models
- Verify error messages shown for invalid model IDs
- Run a Gemma 3 model and verify generation stops at `<end_of_turn>`

### Automated Testing
- `uv run basedpyright` — 0 errors
- `uv run ruff check` — passes
- `nix fmt` — clean
- `uv run pytest src/` — 173 passed
- `cd dashboard && npm run build` — builds successfully

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 05:56:23 -08:00
Alex Cheema
2063278906 feat: add custom HuggingFace model support (#1368)
## Motivation

Users should be able to run any HuggingFace model, not just the ones we
ship TOML cards for. Continues the aim of #1191 with a minimal
implementation on top of the current TOML model card system.

Custom cards are saved to `~/.exo/custom_model_cards/` rather than the
bundled `resources/inference_model_cards/` because `RESOURCES_DIR` is
read-only in PyInstaller bundles (`sys._MEIPASS`). This also fixes
`fetch_from_hf` which was saving cards to the wrong path (`resources/`
root instead of `resources/inference_model_cards/`).

## Changes

- Add `EXO_CUSTOM_MODEL_CARDS_DIR` constant
(`~/.exo/custom_model_cards/`)
- Update `model_cards.py`: add custom dir to search path, fix
`save_to_custom_dir`, add `delete_custom_card`/`is_custom_card`
- Add `POST /models/add` and `DELETE /models/custom/{model_id}` API
endpoints
- Add `is_custom` field to `ModelListModel` API response
- Dashboard: add custom model input form in dropdown, delete button for
custom models, show actual API errors, auto-select newly added model

## Why It Works

Two separate directories for model cards: the bundled read-only
`resources/inference_model_cards/` for built-in cards, and user-writable
`~/.exo/custom_model_cards/` for custom cards. Both are scanned when
listing models. This works in all environments including PyInstaller
bundles where `RESOURCES_DIR` points to `sys._MEIPASS`.

## Test Plan

### Manual Testing
- Add a custom model via the dropdown (e.g.
`mlx-community/Llama-3.2-1B-Instruct-4bit`)
- Verify it appears in the model list with the delete (x) button
- Delete it and verify it disappears
- Try adding an invalid model ID and verify the actual error is shown

### Automated Testing
- `uv run basedpyright` — 0 errors
- `uv run ruff check` — passes
- `uv run pytest src/` — passes
- `cd dashboard && npm run build` — builds

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 05:06:15 -08:00
rltakashige
a0f4f36355 Reduce reliance on internet (#1363)
## Motivation

Offline users currently have to wait for every retry to fail before
being able to launch a model.
For users that restart clusters often or share API keys between devices,
we also spam HuggingFace with downloads every 5 minutes.
These issues are caused by _emit_existing_download_progress being
inefficient.

## Changes

- Only query HuggingFace once while EXO is running (assumption being
that a change should only be reflected on a new EXO session)
- Only query HuggingFace when there is an internet connection (polling
connectivity every 10 seconds)
- Request download progress if we switch from no connectivity ->
connected to reduce the wait.
- Reduce download progress sleep as it's no longer expensive (queries
cache most of the time).
- Reduce retries as 30 is way too many.

## Test Plan

### Manual Testing
Manually tested the behaviour.

### Automated Testing
None, should I add any? We do have some tests for this folder, but they
are probably not too helpful.
2026-02-03 20:03:29 +00:00
Alex Cheema
acb97127bf Normalize TextGenerationTaskParams.input to list[InputMessage] (#1360)
## Motivation

With the addition of the Responses API, we introduced `str |
list[InputMessage]` as the type for `TextGenerationTaskParams.input`
since the Responses API supports sending input as a plain string. But
there was no reason to leak that flexibility past the API adapter
boundary — it just meant every downstream consumer had to do `if
isinstance(messages, str):` checks, adding complexity for no benefit.

## Changes

- Changed `TextGenerationTaskParams.input` from `str |
list[InputMessage]` to `list[InputMessage]`
- Each API adapter (Chat Completions, Claude Messages, Responses) now
normalizes to `list[InputMessage]` at the boundary
- Removed `isinstance(task_params.input, str)` branches in
`utils_mlx.py` and `runner.py`
- Wrapped string inputs in `[InputMessage(role="user", content=...)]` in
the warmup path and all test files

## Why It Works

The API adapters are the only place where we deal with raw user input
formats. By normalizing there, all downstream code (worker, runner, MLX
engine) can just assume `list[InputMessage]` and skip the type-checking
branches. The type system (`basedpyright`) catches any missed call sites
at compile time.

## Test Plan

### Automated Testing
- `uv run basedpyright` — 0 errors
- `uv run ruff check` — passes
- `nix fmt` — applied
- `uv run pytest` — 174 passed, 1 skipped

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 06:01:56 -08:00
Evan Quiney
d90605f198 migrate model cards to .toml files (#1354) 2026-02-03 12:32:06 +00:00
Evan Quiney
f400b4d7c5 fix InstanceViewModel.swift (#1359)
wasn't caught when we merged the API changes
2026-02-02 18:43:27 +00:00
Evan Quiney
d97bca88e6 improve distributed testing (#1300)
Our distributed test now does a full query cycle for every model loaded
onto the relevant machine. This will help find bugs early, as it already
has found one with Qwen3 Next! I didn't write down what the error was
though. Gooooooood luck with that!

Co-authored-by: rltakashige <rl.takashige@gmail.com>
2026-02-02 18:25:39 +00:00
Alex Cheema
dfce188d99 fix: handle unclosed tool calls and GLM arg parsing edge cases (#1344)
## Motivation

Tool-call requests can hang indefinitely when `max_tokens` truncates
generation mid-tool-call.

## Reproduction

1. Send a chat completion with `tools` and a low `max_tokens` (e.g. 65)
to Qwen3-0.6B
2. Model generates `<think>...</think>` then starts `<tool_call>` but
`max_tokens` cuts it off before `</tool_call>`
3. **Before this fix:** `parse_tool_calls` buffers tokens after
`<tool_call>`, generator exhausts, buffered tokens (including
`finish_reason`) are silently dropped → stream hangs forever
4. **After this fix:** buffered tokens are flushed as regular text with
`finish_reason` propagated → response returns normally with
`finish_reason: "length"`

Confirmed with fresh local testing: 4 unclosed tool call flushes
triggered in a single session. Also confirmed via production logs from
Jan 29 (2 occurrences).

## Changes

1. **`parse_tool_calls` unclosed tool call flush** — when the generator
exhausts inside an open `<tool_call>` block, flush buffered tokens as
regular text and propagate `finish_reason`
2. **GLM regex fix** — match literal `\n` (not escaped `\\n`) between
arg tags; handle missing `</arg_value>` via lookahead
3. **7 new unit tests** for `parse_tool_calls` covering unclosed,
closed, passthrough, and failed-parse scenarios

## Why It Works

- `parse_tool_calls` now has a post-loop check: if `in_tool_call` is
still true, it yields the buffered text with the tracked `finish_reason`
instead of silently dropping it
- The GLM regex now matches real-world output where newlines appear
between tags and `</arg_value>` may be absent

## Test Plan

### Manual Testing
- Qwen3-0.6B-4bit with `tools` + various `max_tokens` values (61-75)
- Confirmed responses return with `finish_reason: "length"` instead of
hanging
- Log output shows `"generator exhausted inside unclosed tool call,
flushing buffered text"`

### Automated Testing
- 7 new tests in `test_parse_tool_calls.py`
- Full test suite passes (`uv run pytest`)

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>
Co-authored-by: Evan <evanev7@gmail.com>
Co-authored-by: Jake Hillion <jake@hillion.co.uk>
Co-authored-by: rltakashige <rl.takashige@gmail.com>
2026-02-02 17:45:51 +00:00
Evan Quiney
54b19879a0 create config home when checking for config file (#1353)
we didn't check before, raising a critical exception.
now we create ~/.config/exo on linux systems before touching config.toml.

this wasn't caught before since everything lives in ~/.exo on macos, and we no longer write the keypair to CONFIG_HOME, so config.toml has to do init work it avoided before.
2026-02-02 17:36:51 +00:00
ciaranbor
19965c7ba5 Ciaran/profiling (#1345)
## Motivation

Know what the hell is going on

## Changes

- Tracing library (src/exo/shared/tracing.py): trace() context manager,
Chrome Trace Format export, statistics computation
- Runner instrumentation
(src/exo/worker/engines/image/pipeline/runner.py): Wrapped sync/async
steps, compute blocks, and send/recv operations
- Trace collection: Workers send traces to master after task completion;
merged into ~/.exo/traces/trace_{task_id}.json
  - API endpoints: List, fetch, stats, and raw download at /v1/traces/*
  - Dashboard: Trace list and detail pages with Perfetto integration

## Why It Works

<img width="1236" height="767" alt="Screenshot 2026-01-30 at 19 00 09"
src="https://github.com/user-attachments/assets/73e6e46d-ba10-4e83-ba99-ff1c3f62ab05"
/>
<img width="1659" height="89" alt="Screenshot 2026-01-30 at 19 00 58"
src="https://github.com/user-attachments/assets/c0fd0e65-e4fc-4fd5-920d-b43b2887d591"
/>
2026-02-02 17:19:45 +00:00
Evan Quiney
3e27ead705 remove mdns discovered peers from appearing in state (#1312)
## motivation
eagerly discovered peers through gossipsub were added to state. this left things looking broken from one-sided connections

## changes
the worker no longer writes topology edges from these gossipsub messages
we now strictly rely on http-discovered topology, which tends to be more reflective of the actual state of the systems connectivity
2026-02-02 16:58:53 +00:00
Alex Cheema
d826d309b3 chore: gitignore hosts_*.json files (#1343)
## Motivation

`hosts_*.json` files are local host configuration snapshots that
shouldn't be tracked in version control.

## Changes

Added `hosts_*.json` pattern to `.gitignore`.

## Why It Works

The glob pattern `hosts_*.json` matches any file starting with `hosts_`
and ending with `.json` in the repo root.

## Test Plan

### Manual Testing
- Verified that `hosts_*.json` files are ignored by git after this
change.

### Automated Testing
- No automated tests needed for a `.gitignore` change.

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 16:14:11 +00:00
Alex Cheema
c3537980bd feat: add Claude Messages API and OpenAI Responses API support (#1167)
## Motivation

Add support for Claude Messages API and OpenAI Responses API to allow
users to interact with exo using these popular API formats. This enables
broader compatibility with existing tooling and SDKs that expect these
API formats.

## Architecture

Adapter logic lives exclusively in the API layer
(`src/exo/master/adapters/`). On the way in, each adapter converts its
API-specific request type (`ChatCompletionRequest`,
`ClaudeMessagesRequest`, `ResponsesRequest`) into
`TextGenerationTaskParams`. On the way out, each adapter converts the
`TokenChunk` stream back into its API-specific response format.
Everything inside the application — commands, worker, runner, event
sourcing — only sees `TextGenerationTaskParams` and `TokenChunk`. No
API-specific types cross the boundary.

```
                          API layer                         │  Application internals
                                                            │
Chat Completions → [adapter] → TextGenerationTaskParams ──→ │ ──→ TextGeneration command → Runner → TokenChunk ──→ │ ──→ [adapter] → ChatCompletionResponse
Claude Messages  → [adapter] → TextGenerationTaskParams ──→ │ ──→ TextGeneration command → Runner → TokenChunk ──→ │ ──→ [adapter] → ClaudeMessagesResponse
Responses API    → [adapter] → TextGenerationTaskParams ──→ │ ──→ TextGeneration command → Runner → TokenChunk ──→ │ ──→ [adapter] → ResponsesResponse
```

## Changes

### New Files
- `src/exo/shared/types/claude_api.py` - Pydantic types for Claude
Messages API
- `src/exo/shared/types/openai_responses.py` - Pydantic types for OpenAI
Responses API
- `src/exo/shared/types/text_generation.py` - Shared
`TextGenerationTaskParams` internal type
- `src/exo/master/adapters/chat_completions.py` - Chat Completions
adapter (streaming/non-streaming)
- `src/exo/master/adapters/claude.py` - Claude Messages adapter
(streaming/non-streaming)
- `src/exo/master/adapters/responses.py` - OpenAI Responses adapter
(streaming/non-streaming)

### Modified Files
- `src/exo/master/api.py` - Refactored to use adapters uniformly for all
endpoints; extracted `_resolve_and_validate_text_model` helper to
deduplicate model validation across all text endpoints; removed ad-hoc
`try/except ValueError` blocks from non-streaming paths

### New Endpoints
- `POST /v1/messages` - Claude Messages API (streaming and
non-streaming)
- `POST /v1/responses` - OpenAI Responses API (streaming and
non-streaming)

## Why It Works

All APIs are implemented as pure conversion adapters at the edge of the
application:
1. Adapter functions in `src/exo/master/adapters/` convert incoming
requests to `TextGenerationTaskParams`
2. `api.py` wraps the params in a `TextGeneration` command and sends it
through the existing command/event flow
3. The worker, runner, and event sourcing layers only handle
`TextGenerationTaskParams` and `TokenChunk` — they have no awareness of
Chat Completions, Claude, or Responses API formats
4. On response, adapter functions convert the `TokenChunk` stream back
to the caller's expected format
5. Model validation is handled by a single shared helper
(`_resolve_and_validate_text_model`), mirroring the existing
`_validate_image_model` pattern for image endpoints

No changes to core inference logic were needed.

### Streaming Formats
- **Chat Completions**: Uses `data: {...}\n\n` with `[DONE]` terminator
- **Claude**: Uses event types `message_start`, `content_block_start`,
`content_block_delta`, `content_block_stop`, `message_delta`,
`message_stop`
- **OpenAI Responses**: Uses event types `response.created`,
`response.in_progress`, `response.output_item.added`,
`response.content_part.added`, `response.output_text.delta`,
`response.output_text.done`, `response.content_part.done`,
`response.output_item.done`, `response.completed`

## Test Plan

### Manual Testing
Hardware: MacBook Pro M3 Max

**Non-streaming tests:**
```bash
# Chat Completions API
curl -X POST http://localhost:52415/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3.2-1b", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 20}'

# Claude Messages API
curl -X POST http://localhost:52415/v1/messages \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3.2-1b", "max_tokens": 50, "messages": [{"role": "user", "content": "Hello"}]}'

# OpenAI Responses API
curl -X POST http://localhost:52415/v1/responses \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3.2-1b", "input": "Hello", "max_output_tokens": 20}'
```

**Streaming tests:**
```bash
# Chat Completions API (streaming)
curl -N -X POST http://localhost:52415/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3.2-1b", "messages": [{"role": "user", "content": "Hello"}], "stream": true, "max_tokens": 20}'

# Claude Messages API (streaming)
curl -N -X POST http://localhost:52415/v1/messages \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3.2-1b", "max_tokens": 50, "messages": [{"role": "user", "content": "Hello"}], "stream": true}'

# OpenAI Responses API (streaming)
curl -N -X POST http://localhost:52415/v1/responses \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3.2-1b", "input": "Hello", "stream": true, "max_output_tokens": 20}'
```

All endpoints tested successfully with proper response formats and
streaming events.

### Automated Testing
- Tests in `src/exo/master/tests/` all pass (85 tests)
- Type checker (basedpyright) passes with 0 errors
- Linter (ruff) passes

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Evan <evanev7@gmail.com>
2026-02-02 15:58:37 +00:00
rltakashige
21d477f1cb Update exo bench (#1357)
## Motivation

Make exo bench faster for longer prompts, lengthen default timeouts and
use pairs for pp and tg.

## Changes

- Uses binary search to find correct prompt
- Flag to force all combinations if that is desired
2026-02-02 15:46:15 +00:00
Jake Hillion
b2579c78fe nix: add macmon to PATH in wrapper scripts on Darwin
`nix run .#exo` couldn't find `macmon` because the Nix wrapper scripts
didn't include it in PATH, causing `shutil.which("macmon")` to fail.

Added `--prefix PATH : ${pkgs.macmon}/bin` to the `makeWrapper` call,
conditional on Darwin via `lib.optionalString`, so macmon's binary is
available at runtime without modifying the user's system PATH.

Test plan:
- Verified `nix build .#exo` succeeds
- Checked wrapper script contains macmon store path in PATH prefix
2026-02-02 13:42:36 +00:00
131 changed files with 8568 additions and 2029 deletions

View File

@@ -142,4 +142,6 @@ jobs:
# Run pytest outside sandbox (needs GPU access for MLX)
export HOME="$RUNNER_TEMP"
export EXO_TESTS=1
export EXO_DASHBOARD_DIR="$PWD/dashboard/"
export EXO_RESOURCES_DIR="$PWD/resources"
$TEST_ENV/bin/python -m pytest src -m "not slow" --import-mode=importlib

4
.gitignore vendored
View File

@@ -28,3 +28,7 @@ target/
dashboard/build/
dashboard/node_modules/
dashboard/.svelte-kit/
# host config snapshots
hosts_*.json
.swp

View File

@@ -108,6 +108,7 @@ class TokenizerWrapper:
_tokenizer: PreTrainedTokenizerFast
eos_token_id: int | None
eos_token: str | None
eos_token_ids: list[int] | set[int] | None
bos_token_id: int | None
bos_token: str | None
vocab_size: int
@@ -117,7 +118,7 @@ class TokenizerWrapper:
self,
tokenizer: Any,
detokenizer_class: Any = ...,
eos_token_ids: list[int] | None = ...,
eos_token_ids: list[int] | set[int] | None = ...,
chat_template: Any = ...,
tool_parser: Any = ...,
tool_call_start: str | None = ...,

View File

@@ -293,7 +293,7 @@ struct ClusterTask {
let modelName: String?
let promptPreview: String?
let errorMessage: String?
let parameters: ChatCompletionTaskParameters?
let parameters: TextGenerationTaskParameters?
var sortPriority: Int {
switch status {
@@ -330,12 +330,12 @@ struct ClusterTaskPayload: Decodable {
let taskStatus: TaskStatus?
let instanceId: String?
let commandId: String?
let taskParams: ChatCompletionTaskParameters?
let taskParams: TextGenerationTaskParameters?
let errorType: String?
let errorMessage: String?
}
struct ChatCompletionTaskParameters: Decodable, Equatable {
struct TextGenerationTaskParameters: Decodable, Equatable {
let model: String?
let messages: [ChatCompletionMessage]?
let maxTokens: Int?
@@ -374,7 +374,7 @@ extension ClusterTask {
guard let id = payload.taskId else { return nil }
let status = payload.taskStatus ?? .unknown
switch kindKey {
case "ChatCompletion":
case "TextGeneration":
self.init(
id: id,
status: status,

View File

@@ -216,7 +216,7 @@ struct InstanceTaskViewModel: Identifiable, Equatable {
let promptPreview: String?
let errorMessage: String?
let subtitle: String?
let parameters: ChatCompletionTaskParameters?
let parameters: TextGenerationTaskParameters?
var title: String {
switch kind {

View File

@@ -5,10 +5,13 @@ from __future__ import annotations
import argparse
import contextlib
import http.client
import itertools
import json
import os
import sys
import time
from collections.abc import Callable
from pathlib import Path
from statistics import mean
from typing import Any
from urllib.parse import urlencode
@@ -16,6 +19,84 @@ from urllib.parse import urlencode
from loguru import logger
from transformers import AutoTokenizer
# Monkey-patch for transformers 5.x compatibility
# Kimi's tokenization_kimi.py imports bytes_to_unicode from the old location
# which was moved in transformers 5.0.0rc2
try:
import transformers.models.gpt2.tokenization_gpt2 as gpt2_tokenization
from transformers.convert_slow_tokenizer import bytes_to_unicode
if not hasattr(gpt2_tokenization, "bytes_to_unicode"):
gpt2_tokenization.bytes_to_unicode = bytes_to_unicode # type: ignore[attr-defined]
except ImportError:
pass # transformers < 5.0 or bytes_to_unicode not available
def load_tokenizer_for_bench(model_id: str) -> Any:
"""
Load tokenizer for benchmarking, with special handling for Kimi models.
Kimi uses a custom TikTokenTokenizer that transformers 5.x can't load via AutoTokenizer.
This function replicates the logic from utils_mlx.py for bench compatibility.
"""
model_id_lower = model_id.lower()
if "kimi-k2" in model_id_lower:
import importlib.util
import types
from huggingface_hub import snapshot_download
# Download/get the model path
model_path = Path(
snapshot_download(
model_id,
allow_patterns=["*.json", "*.py", "*.tiktoken"],
)
)
sys.path.insert(0, str(model_path))
# Load tool_declaration_ts first (tokenization_kimi imports it with relative import)
tool_decl_path = model_path / "tool_declaration_ts.py"
if tool_decl_path.exists():
spec = importlib.util.spec_from_file_location(
"tool_declaration_ts", tool_decl_path
)
if spec and spec.loader:
tool_decl_module = importlib.util.module_from_spec(spec)
sys.modules["tool_declaration_ts"] = tool_decl_module
spec.loader.exec_module(tool_decl_module)
# Load tokenization_kimi with patched source (convert relative to absolute import)
tok_path = model_path / "tokenization_kimi.py"
source = tok_path.read_text()
source = source.replace("from .tool_declaration_ts", "from tool_declaration_ts")
spec = importlib.util.spec_from_file_location("tokenization_kimi", tok_path)
if spec:
tok_module = types.ModuleType("tokenization_kimi")
tok_module.__file__ = str(tok_path)
sys.modules["tokenization_kimi"] = tok_module
exec(compile(source, tok_path, "exec"), tok_module.__dict__) # noqa: S102
TikTokenTokenizer = tok_module.TikTokenTokenizer # noqa: N806
else:
from tokenization_kimi import TikTokenTokenizer # type: ignore[import-not-found] # noqa: I001
hf_tokenizer: Any = TikTokenTokenizer.from_pretrained(model_path)
# Patch encode to use internal tiktoken model directly
# transformers 5.x has a bug in the encode->pad path for slow tokenizers
def _patched_encode(text: str, **kwargs: object) -> list[int]:
# Pass allowed_special="all" to handle special tokens like <|im_user|>
return list(hf_tokenizer.model.encode(text, allowed_special="all"))
hf_tokenizer.encode = _patched_encode
return hf_tokenizer
# Default: use AutoTokenizer
return AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
class ExoHttpError(RuntimeError):
def __init__(self, status: int, reason: str, body_preview: str):
@@ -24,7 +105,7 @@ class ExoHttpError(RuntimeError):
class ExoClient:
def __init__(self, host: str, port: int, timeout_s: float = 600.0):
def __init__(self, host: str, port: int, timeout_s: float = 7200.0):
self.host = host
self.port = port
self.timeout_s = timeout_s
@@ -180,14 +261,7 @@ def parse_int_list(values: list[str]) -> list[int]:
part = part.strip()
if part:
items.append(int(part))
seen: set[int] = set()
out: list[int] = []
for x in items:
if x not in seen:
out.append(x)
seen.add(x)
return out
return items
def resolve_model_short_id(client: ExoClient, model_arg: str) -> tuple[str, str]:
@@ -240,7 +314,11 @@ def run_one_completion(
stats = out.get("generation_stats")
preview = (out.get("choices") or [{}])[0]["message"]["content"][:200]
# Extract preview, handling None content (common for thinking models)
choices = out.get("choices") or [{}]
message = choices[0].get("message", {}) if choices else {}
content = message.get("content") or ""
preview = content[:200] if content else ""
return {
"elapsed_s": elapsed,
@@ -277,12 +355,29 @@ class PromptSizer:
f"Target ({target}) is smaller than template overhead ({self.base_tokens})."
)
content = ""
tok = self.count_fn(content)
# Estimate tokens per atom using a sample
sample_count = 100
sample_content = self.atom * sample_count
sample_tokens = self.count_fn(sample_content) - self.base_tokens
tokens_per_atom = sample_tokens / sample_count
while tok < target:
content += self.atom
tok = self.count_fn(content)
# Estimate starting point
needed_tokens = target - self.base_tokens
estimated_atoms = int(needed_tokens / tokens_per_atom)
# Binary search to find exact atom count
low, high = 0, estimated_atoms * 2 + 100
while low < high:
mid = (low + high) // 2
tok = self.count_fn(self.atom * mid)
if tok < target:
low = mid + 1
else:
high = mid
content = self.atom * low
tok = self.count_fn(content)
logger.info(f"{tok=}")
if tok != target:
raise RuntimeError(
@@ -348,7 +443,7 @@ def main() -> int:
help="Warmup runs per placement (uses first pp/tg).",
)
ap.add_argument(
"--timeout", type=float, default=600.0, help="HTTP timeout (seconds)."
"--timeout", type=float, default=7200.0, help="HTTP timeout (seconds)."
)
ap.add_argument(
"--json-out",
@@ -358,6 +453,11 @@ def main() -> int:
ap.add_argument(
"--dry-run", action="store_true", help="List selected placements and exit."
)
ap.add_argument(
"--all-combinations",
action="store_true",
help="Force all pp×tg combinations (cartesian product) even when lists have equal length.",
)
args = ap.parse_args()
pp_list = parse_int_list(args.pp)
@@ -369,6 +469,15 @@ def main() -> int:
logger.error("--repeat must be >= 1")
return 2
# Log pairing mode
use_combinations = args.all_combinations or len(pp_list) != len(tg_list)
if use_combinations:
logger.info(
f"pp/tg mode: combinations (product) - {len(pp_list) * len(tg_list)} pairs"
)
else:
logger.info(f"pp/tg mode: tandem (zip) - {len(pp_list)} pairs")
client = ExoClient(args.host, args.port, timeout_s=args.timeout)
short_id, full_model_id = resolve_model_short_id(client, args.model)
@@ -377,10 +486,7 @@ def main() -> int:
)
previews = previews_resp.get("previews") or []
tokenizer = AutoTokenizer.from_pretrained(
full_model_id,
trust_remote_code=True,
)
tokenizer = load_tokenizer_for_bench(full_model_id)
if tokenizer is None:
raise RuntimeError("[exo-bench] tokenizer load failed")
@@ -486,60 +592,55 @@ def main() -> int:
)
logger.debug(f" warmup {i + 1}/{args.warmup} done")
for pp in pp_list:
# if (
# pp * n_nodes > 2048
# and "ring" in instance_meta.lower()
# and "tensor" in sharding.lower()
# ):
# model_card = MODEL_CARDS[short_id]
# if model_card.metadata.storage_size > Memory.from_gb(10):
# logger.info(
# f"Skipping tensor ring as this is too slow for model of size {model_card.metadata.storage_size} on {n_nodes=}"
# )
# continue
for tg in tg_list:
runs: list[dict[str, Any]] = []
for r in range(args.repeat):
time.sleep(3)
try:
row, actual_pp_tokens = run_one_completion(
client, full_model_id, pp, tg, prompt_sizer
)
except Exception as e:
logger.error(e)
continue
row.update(
{
"model_short_id": short_id,
"model_id": full_model_id,
"placement_sharding": sharding,
"placement_instance_meta": instance_meta,
"placement_nodes": n_nodes,
"instance_id": instance_id,
"pp_tokens": actual_pp_tokens,
"tg": tg,
"repeat_index": r,
}
)
runs.append(row)
all_rows.append(row)
# If pp and tg lists have same length, run in tandem (zip)
# Otherwise (or if --all-combinations), run all combinations (cartesian product)
if use_combinations:
pp_tg_pairs = list(itertools.product(pp_list, tg_list))
else:
pp_tg_pairs = list(zip(pp_list, tg_list, strict=True))
if runs:
prompt_tps = mean(x["stats"]["prompt_tps"] for x in runs)
gen_tps = mean(x["stats"]["generation_tps"] for x in runs)
ptok = mean(x["stats"]["prompt_tokens"] for x in runs)
gtok = mean(x["stats"]["generation_tokens"] for x in runs)
peak = mean(
x["stats"]["peak_memory_usage"]["inBytes"] for x in runs
for pp, tg in pp_tg_pairs:
runs: list[dict[str, Any]] = []
for r in range(args.repeat):
time.sleep(3)
try:
row, actual_pp_tokens = run_one_completion(
client, full_model_id, pp, tg, prompt_sizer
)
except Exception as e:
logger.error(e)
continue
row.update(
{
"model_short_id": short_id,
"model_id": full_model_id,
"placement_sharding": sharding,
"placement_instance_meta": instance_meta,
"placement_nodes": n_nodes,
"instance_id": instance_id,
"pp_tokens": actual_pp_tokens,
"tg": tg,
"repeat_index": r,
}
)
runs.append(row)
all_rows.append(row)
logger.info(
f"prompt_tps={prompt_tps:.2f} gen_tps={gen_tps:.2f} "
f"prompt_tokens={ptok} gen_tokens={gtok} "
f"peak_memory={format_peak_memory(peak)}\n"
)
time.sleep(2)
if runs:
prompt_tps = mean(x["stats"]["prompt_tps"] for x in runs)
gen_tps = mean(x["stats"]["generation_tps"] for x in runs)
ptok = mean(x["stats"]["prompt_tokens"] for x in runs)
gtok = mean(x["stats"]["generation_tokens"] for x in runs)
peak = mean(
x["stats"]["peak_memory_usage"]["inBytes"] for x in runs
)
logger.info(
f"prompt_tps={prompt_tps:.2f} gen_tps={gen_tps:.2f} "
f"prompt_tokens={ptok} gen_tokens={gtok} "
f"peak_memory={format_peak_memory(peak)}\n"
)
time.sleep(2)
finally:
try:
client.request_json("DELETE", f"/instance/{instance_id}")

View File

@@ -0,0 +1,73 @@
<script lang="ts">
type FamilyLogoProps = {
family: string;
class?: string;
};
let { family, class: className = "" }: FamilyLogoProps = $props();
</script>
{#if family === "favorites"}
<svg class="w-6 h-6 {className}" viewBox="0 0 24 24" fill="currentColor">
<path
d="M12 2l3.09 6.26L22 9.27l-5 4.87 1.18 6.88L12 17.77l-6.18 3.25L7 14.14 2 9.27l6.91-1.01L12 2z"
/>
</svg>
{:else if family === "llama" || family === "meta"}
<svg class="w-6 h-6 {className}" viewBox="0 0 24 24" fill="currentColor">
<path
d="M6.915 4.03c-1.968 0-3.683 1.28-4.871 3.113C.704 9.208 0 11.883 0 14.449c0 .706.07 1.369.21 1.973a6.624 6.624 0 0 0 .265.86 5.297 5.297 0 0 0 .371.761c.696 1.159 1.818 1.927 3.593 1.927 1.497 0 2.633-.671 3.965-2.444.76-1.012 1.144-1.626 2.663-4.32l.756-1.339.186-.325c.061.1.121.196.183.3l2.152 3.595c.724 1.21 1.665 2.556 2.47 3.314 1.046.987 1.992 1.22 3.06 1.22 1.075 0 1.876-.355 2.455-.843a3.743 3.743 0 0 0 .81-.973c.542-.939.861-2.127.861-3.745 0-2.72-.681-5.357-2.084-7.45-1.282-1.912-2.957-2.93-4.716-2.93-1.047 0-2.088.467-3.053 1.308-.652.57-1.257 1.29-1.82 2.05-.69-.875-1.335-1.547-1.958-2.056-1.182-.966-2.315-1.303-3.454-1.303zm10.16 2.053c1.147 0 2.188.758 2.992 1.999 1.132 1.748 1.647 4.195 1.647 6.4 0 1.548-.368 2.9-1.839 2.9-.58 0-1.027-.23-1.664-1.004-.496-.601-1.343-1.878-2.832-4.358l-.617-1.028a44.908 44.908 0 0 0-1.255-1.98c.07-.109.141-.224.211-.327 1.12-1.667 2.118-2.602 3.358-2.602zm-10.201.553c1.265 0 2.058.791 2.675 1.446.307.327.737.871 1.234 1.579l-1.02 1.566c-.757 1.163-1.882 3.017-2.837 4.338-1.191 1.649-1.81 1.817-2.486 1.817-.524 0-1.038-.237-1.383-.794-.263-.426-.464-1.13-.464-2.046 0-2.221.63-4.535 1.66-6.088.454-.687.964-1.226 1.533-1.533a2.264 2.264 0 0 1 1.088-.285z"
/>
</svg>
{:else if family === "qwen"}
<svg class="w-6 h-6 {className}" viewBox="0 0 24 24" fill="currentColor">
<path
d="M12.604 1.34c.393.69.784 1.382 1.174 2.075a.18.18 0 00.157.091h5.552c.174 0 .322.11.446.327l1.454 2.57c.19.337.24.478.024.837-.26.43-.513.864-.76 1.3l-.367.658c-.106.196-.223.28-.04.512l2.652 4.637c.172.301.111.494-.043.77-.437.785-.882 1.564-1.335 2.34-.159.272-.352.375-.68.37-.777-.016-1.552-.01-2.327.016a.099.099 0 00-.081.05 575.097 575.097 0 01-2.705 4.74c-.169.293-.38.363-.725.364-.997.003-2.002.004-3.017.002a.537.537 0 01-.465-.271l-1.335-2.323a.09.09 0 00-.083-.049H4.982c-.285.03-.553-.001-.805-.092l-1.603-2.77a.543.543 0 01-.002-.54l1.207-2.12a.198.198 0 000-.197 550.951 550.951 0 01-1.875-3.272l-.79-1.395c-.16-.31-.173-.496.095-.965.465-.813.927-1.625 1.387-2.436.132-.234.304-.334.584-.335a338.3 338.3 0 012.589-.001.124.124 0 00.107-.063l2.806-4.895a.488.488 0 01.422-.246c.524-.001 1.053 0 1.583-.006L11.704 1c.341-.003.724.032.9.34zm-3.432.403a.06.06 0 00-.052.03L6.254 6.788a.157.157 0 01-.135.078H3.253c-.056 0-.07.025-.041.074l5.81 10.156c.025.042.013.062-.034.063l-2.795.015a.218.218 0 00-.2.116l-1.32 2.31c-.044.078-.021.118.068.118l5.716.008c.046 0 .08.02.104.061l1.403 2.454c.046.081.092.082.139 0l5.006-8.76.783-1.382a.055.055 0 01.096 0l1.424 2.53a.122.122 0 00.107.062l2.763-.02a.04.04 0 00.035-.02.041.041 0 000-.04l-2.9-5.086a.108.108 0 010-.113l.293-.507 1.12-1.977c.024-.041.012-.062-.035-.062H9.2c-.059 0-.073-.026-.043-.077l1.434-2.505a.107.107 0 000-.114L9.225 1.774a.06.06 0 00-.053-.031zm6.29 8.02c.046 0 .058.02.034.06l-.832 1.465-2.613 4.585a.056.056 0 01-.05.029.058.058 0 01-.05-.029L8.498 9.841c-.02-.034-.01-.052.028-.054l.216-.012 6.722-.012z"
/>
</svg>
{:else if family === "deepseek"}
<svg class="w-6 h-6 {className}" viewBox="0 0 24 24" fill="currentColor">
<path
d="M23.748 4.482c-.254-.124-.364.113-.512.234-.051.039-.094.09-.137.136-.372.397-.806.657-1.373.626-.829-.046-1.537.214-2.163.848-.133-.782-.575-1.248-1.247-1.548-.352-.156-.708-.311-.955-.65-.172-.241-.219-.51-.305-.774-.055-.16-.11-.323-.293-.35-.2-.031-.278.136-.356.276-.313.572-.434 1.202-.422 1.84.027 1.436.633 2.58 1.838 3.393.137.093.172.187.129.323-.082.28-.18.552-.266.833-.055.179-.137.217-.329.14a5.526 5.526 0 01-1.736-1.18c-.857-.828-1.631-1.742-2.597-2.458a11.365 11.365 0 00-.689-.471c-.985-.957.13-1.743.388-1.836.27-.098.093-.432-.779-.428-.872.004-1.67.295-2.687.684a3.055 3.055 0 01-.465.137 9.597 9.597 0 00-2.883-.102c-1.885.21-3.39 1.102-4.497 2.623C.082 8.606-.231 10.684.152 12.85c.403 2.284 1.569 4.175 3.36 5.653 1.858 1.533 3.997 2.284 6.438 2.14 1.482-.085 3.133-.284 4.994-1.86.47.234.962.327 1.78.397.63.059 1.236-.03 1.705-.128.735-.156.684-.837.419-.961-2.155-1.004-1.682-.595-2.113-.926 1.096-1.296 2.746-2.642 3.392-7.003.05-.347.007-.565 0-.845-.004-.17.035-.237.23-.256a4.173 4.173 0 001.545-.475c1.396-.763 1.96-2.015 2.093-3.517.02-.23-.004-.467-.247-.588zM11.581 18c-2.089-1.642-3.102-2.183-3.52-2.16-.392.024-.321.471-.235.763.09.288.207.486.371.739.114.167.192.416-.113.603-.673.416-1.842-.14-1.897-.167-1.361-.802-2.5-1.86-3.301-3.307-.774-1.393-1.224-2.887-1.298-4.482-.02-.386.093-.522.477-.592a4.696 4.696 0 011.529-.039c2.132.312 3.946 1.265 5.468 2.774.868.86 1.525 1.887 2.202 2.891.72 1.066 1.494 2.082 2.48 2.914.348.292.625.514.891.677-.802.09-2.14.11-3.054-.614zm1-6.44a.306.306 0 01.415-.287.302.302 0 01.2.288.306.306 0 01-.31.307.303.303 0 01-.304-.308zm3.11 1.596c-.2.081-.399.151-.59.16a1.245 1.245 0 01-.798-.254c-.274-.23-.47-.358-.552-.758a1.73 1.73 0 01.016-.588c.07-.327-.008-.537-.239-.727-.187-.156-.426-.199-.688-.199a.559.559 0 01-.254-.078c-.11-.054-.2-.19-.114-.358.028-.054.16-.186.192-.21.356-.202.767-.136 1.146.016.352.144.618.408 1.001.782.391.451.462.576.685.914.176.265.336.537.445.848.067.195-.019.354-.25.452z"
/>
</svg>
{:else if family === "openai" || family === "gpt-oss"}
<svg class="w-6 h-6 {className}" viewBox="0 0 24 24" fill="currentColor">
<path
d="M22.2819 9.8211a5.9847 5.9847 0 0 0-.5157-4.9108 6.0462 6.0462 0 0 0-6.5098-2.9A6.0651 6.0651 0 0 0 4.9807 4.1818a5.9847 5.9847 0 0 0-3.9977 2.9 6.0462 6.0462 0 0 0 .7427 7.0966 5.98 5.98 0 0 0 .511 4.9107 6.051 6.051 0 0 0 6.5146 2.9001A5.9847 5.9847 0 0 0 13.2599 24a6.0557 6.0557 0 0 0 5.7718-4.2058 5.9894 5.9894 0 0 0 3.9977-2.9001 6.0557 6.0557 0 0 0-.7475-7.0729zm-9.022 12.6081a4.4755 4.4755 0 0 1-2.8764-1.0408l.1419-.0804 4.7783-2.7582a.7948.7948 0 0 0 .3927-.6813v-6.7369l2.02 1.1686a.071.071 0 0 1 .038.052v5.5826a4.504 4.504 0 0 1-4.4945 4.4944zm-9.6607-4.1254a4.4708 4.4708 0 0 1-.5346-3.0137l.142.0852 4.783 2.7582a.7712.7712 0 0 0 .7806 0l5.8428-3.3685v2.3324a.0804.0804 0 0 1-.0332.0615L9.74 19.9502a4.4992 4.4992 0 0 1-6.1408-1.6464zM2.3408 7.8956a4.485 4.485 0 0 1 2.3655-1.9728V11.6a.7664.7664 0 0 0 .3879.6765l5.8144 3.3543-2.0201 1.1685a.0757.0757 0 0 1-.071 0l-4.8303-2.7865A4.504 4.504 0 0 1 2.3408 7.872zm16.5963 3.8558L13.1038 8.364 15.1192 7.2a.0757.0757 0 0 1 .071 0l4.8303 2.7913a4.4944 4.4944 0 0 1-.6765 8.1042v-5.6772a.79.79 0 0 0-.407-.667zm2.0107-3.0231l-.142-.0852-4.7735-2.7818a.7759.7759 0 0 0-.7854 0L9.409 9.2297V6.8974a.0662.0662 0 0 1 .0284-.0615l4.8303-2.7866a4.4992 4.4992 0 0 1 6.6802 4.66zM8.3065 12.863l-2.02-1.1638a.0804.0804 0 0 1-.038-.0567V6.0742a4.4992 4.4992 0 0 1 7.3757-3.4537l-.142.0805L8.704 5.459a.7948.7948 0 0 0-.3927.6813zm1.0976-2.3654l2.602-1.4998 2.6069 1.4998v2.9994l-2.5974 1.4997-2.6067-1.4997Z"
/>
</svg>
{:else if family === "glm"}
<svg class="w-6 h-6 {className}" viewBox="0 0 24 24" fill="currentColor">
<path
d="M11.991 23.503a.24.24 0 00-.244.248.24.24 0 00.244.249.24.24 0 00.245-.249.24.24 0 00-.22-.247l-.025-.001zM9.671 5.365a1.697 1.697 0 011.099 2.132l-.071.172-.016.04-.018.054c-.07.16-.104.32-.104.498-.035.71.47 1.279 1.186 1.314h.366c1.309.053 2.338 1.173 2.286 2.523-.052 1.332-1.152 2.38-2.478 2.327h-.174c-.715.018-1.274.64-1.239 1.368 0 .124.018.23.053.337.209.373.54.658.96.8.75.23 1.517-.125 1.9-.782l.018-.035c.402-.64 1.17-.96 1.92-.711.854.284 1.378 1.226 1.099 2.167a1.661 1.661 0 01-2.077 1.102 1.711 1.711 0 01-.907-.711l-.017-.035c-.2-.323-.463-.58-.851-.711l-.056-.018a1.646 1.646 0 00-1.954.746 1.66 1.66 0 01-1.065.764 1.677 1.677 0 01-1.989-1.279c-.209-.906.332-1.83 1.257-2.043a1.51 1.51 0 01.296-.035h.018c.68-.071 1.151-.622 1.116-1.333a1.307 1.307 0 00-.227-.693 2.515 2.515 0 01-.366-1.403 2.39 2.39 0 01.366-1.208c.14-.195.21-.444.227-.693.018-.71-.506-1.261-1.186-1.332l-.07-.018a1.43 1.43 0 01-.299-.07l-.05-.019a1.7 1.7 0 01-1.047-2.114 1.68 1.68 0 012.094-1.101zm-5.575 10.11c.26-.264.639-.367.994-.27.355.096.633.379.728.74.095.362-.007.748-.267 1.013-.402.41-1.053.41-1.455 0a1.062 1.062 0 010-1.482zm14.845-.294c.359-.09.738.024.992.297.254.274.344.665.237 1.025-.107.36-.396.634-.756.718-.551.128-1.1-.22-1.23-.781a1.05 1.05 0 01.757-1.26zm-.064-4.39c.314.32.49.753.49 1.206 0 .452-.176.886-.49 1.206-.315.32-.74.5-1.185.5-.444 0-.87-.18-1.184-.5a1.727 1.727 0 010-2.412 1.654 1.654 0 012.369 0zm-11.243.163c.364.484.447 1.128.218 1.691a1.665 1.665 0 01-2.188.923c-.855-.36-1.26-1.358-.907-2.228a1.68 1.68 0 011.33-1.038c.593-.08 1.183.169 1.547.652zm11.545-4.221c.368 0 .708.2.892.524.184.324.184.724 0 1.048a1.026 1.026 0 01-.892.524c-.568 0-1.03-.47-1.03-1.048 0-.579.462-1.048 1.03-1.048zm-14.358 0c.368 0 .707.2.891.524.184.324.184.724 0 1.048a1.026 1.026 0 01-.891.524c-.569 0-1.03-.47-1.03-1.048 0-.579.461-1.048 1.03-1.048zm10.031-1.475c.925 0 1.675.764 1.675 1.706s-.75 1.705-1.675 1.705-1.674-.763-1.674-1.705c0-.942.75-1.706 1.674-1.706zm-2.626-.684c.362-.082.653-.356.761-.718a1.062 1.062 0 00-.238-1.028 1.017 1.017 0 00-.996-.294c-.547.14-.881.7-.752 1.257.13.558.675.907 1.225.783zm0 16.876c.359-.087.644-.36.75-.72a1.062 1.062 0 00-.237-1.019 1.018 1.018 0 00-.985-.301 1.037 1.037 0 00-.762.717c-.108.361-.017.754.239 1.028.245.263.606.377.953.305l.043-.01zM17.19 3.5a.631.631 0 00.628-.64c0-.355-.279-.64-.628-.64a.631.631 0 00-.628.64c0 .355.28.64.628.64zm-10.38 0a.631.631 0 00.628-.64c0-.355-.28-.64-.628-.64a.631.631 0 00-.628.64c0 .355.279.64.628.64zm-5.182 7.852a.631.631 0 00-.628.64c0 .354.28.639.628.639a.63.63 0 00.627-.606l.001-.034a.62.62 0 00-.628-.64zm5.182 9.13a.631.631 0 00-.628.64c0 .355.279.64.628.64a.631.631 0 00.628-.64c0-.355-.28-.64-.628-.64zm10.38.018a.631.631 0 00-.628.64c0 .355.28.64.628.64a.631.631 0 00.628-.64c0-.355-.279-.64-.628-.64zm5.182-9.148a.631.631 0 00-.628.64c0 .354.279.639.628.639a.631.631 0 00.628-.64c0-.355-.28-.64-.628-.64zm-.384-4.992a.24.24 0 00.244-.249.24.24 0 00-.244-.249.24.24 0 00-.244.249c0 .142.122.249.244.249zM11.991.497a.24.24 0 00.245-.248A.24.24 0 0011.99 0a.24.24 0 00-.244.249c0 .133.108.236.223.247l.021.001zM2.011 6.36a.24.24 0 00.245-.249.24.24 0 00-.244-.249.24.24 0 00-.244.249.24.24 0 00.244.249zm0 11.263a.24.24 0 00-.243.248.24.24 0 00.244.249.24.24 0 00.244-.249.252.252 0 00-.244-.248zm19.995-.018a.24.24 0 00-.245.248.24.24 0 00.245.25.24.24 0 00.244-.25.252.252 0 00-.244-.248z"
/>
</svg>
{:else if family === "minimax"}
<svg class="w-6 h-6 {className}" viewBox="0 0 24 24" fill="currentColor">
<path
d="M16.278 2c1.156 0 2.093.927 2.093 2.07v12.501a.74.74 0 00.744.709.74.74 0 00.743-.709V9.099a2.06 2.06 0 012.071-2.049A2.06 2.06 0 0124 9.1v6.561a.649.649 0 01-.652.645.649.649 0 01-.653-.645V9.1a.762.762 0 00-.766-.758.762.762 0 00-.766.758v7.472a2.037 2.037 0 01-2.048 2.026 2.037 2.037 0 01-2.048-2.026v-12.5a.785.785 0 00-.788-.753.785.785 0 00-.789.752l-.001 15.904A2.037 2.037 0 0113.441 22a2.037 2.037 0 01-2.048-2.026V18.04c0-.356.292-.645.652-.645.36 0 .652.289.652.645v1.934c0 .263.142.506.372.638.23.131.514.131.744 0a.734.734 0 00.372-.638V4.07c0-1.143.937-2.07 2.093-2.07zm-5.674 0c1.156 0 2.093.927 2.093 2.07v11.523a.648.648 0 01-.652.645.648.648 0 01-.652-.645V4.07a.785.785 0 00-.789-.78.785.785 0 00-.789.78v14.013a2.06 2.06 0 01-2.07 2.048 2.06 2.06 0 01-2.071-2.048V9.1a.762.762 0 00-.766-.758.762.762 0 00-.766.758v3.8a2.06 2.06 0 01-2.071 2.049A2.06 2.06 0 010 12.9v-1.378c0-.357.292-.646.652-.646.36 0 .653.29.653.646V12.9c0 .418.343.757.766.757s.766-.339.766-.757V9.099a2.06 2.06 0 012.07-2.048 2.06 2.06 0 012.071 2.048v8.984c0 .419.343.758.767.758.423 0 .766-.339.766-.758V4.07c0-1.143.937-2.07 2.093-2.07z"
/>
</svg>
{:else if family === "kimi"}
<svg class="w-6 h-6 {className}" viewBox="0 0 24 24" fill="currentColor">
<path
d="M19.738 5.776c.163-.209.306-.4.457-.585.07-.087.064-.153-.004-.244-.655-.861-.717-1.817-.34-2.787.283-.73.909-1.072 1.674-1.145.477-.045.945.004 1.379.236.57.305.902.77 1.01 1.412.086.512.07 1.012-.075 1.508-.257.878-.888 1.333-1.753 1.448-.718.096-1.446.108-2.17.157-.056.004-.113 0-.178 0z"
/>
<path
d="M17.962 1.844h-4.326l-3.425 7.81H5.369V1.878H1.5V22h3.87v-8.477h6.824a3.025 3.025 0 002.743-1.75V22h3.87v-8.477a3.87 3.87 0 00-3.588-3.86v-.01h-2.125a3.94 3.94 0 002.323-2.12l2.545-5.689z"
/>
</svg>
{:else if family === "huggingface"}
<svg class="w-6 h-6 {className}" viewBox="0 0 24 24" fill="currentColor">
<path
d="M12.025 1.13c-5.77 0-10.449 4.647-10.449 10.378 0 1.112.178 2.181.503 3.185.064-.222.203-.444.416-.577a.96.96 0 0 1 .524-.15c.293 0 .584.124.84.284.278.173.48.408.71.694.226.282.458.611.684.951v-.014c.017-.324.106-.622.264-.874s.403-.487.762-.543c.3-.047.596.06.787.203s.31.313.4.467c.15.257.212.468.233.542.01.026.653 1.552 1.657 2.54.616.605 1.01 1.223 1.082 1.912.055.537-.096 1.059-.38 1.572.637.121 1.294.187 1.967.187.657 0 1.298-.063 1.921-.178-.287-.517-.44-1.041-.384-1.581.07-.69.465-1.307 1.081-1.913 1.004-.987 1.647-2.513 1.657-2.539.021-.074.083-.285.233-.542.09-.154.208-.323.4-.467a1.08 1.08 0 0 1 .787-.203c.359.056.604.29.762.543s.247.55.265.874v.015c.225-.34.457-.67.683-.952.23-.286.432-.52.71-.694.257-.16.547-.284.84-.285a.97.97 0 0 1 .524.151c.228.143.373.388.43.625l.006.04a10.3 10.3 0 0 0 .534-3.273c0-5.731-4.678-10.378-10.449-10.378M8.327 6.583a1.5 1.5 0 0 1 .713.174 1.487 1.487 0 0 1 .617 2.013c-.183.343-.762-.214-1.102-.094-.38.134-.532.914-.917.71a1.487 1.487 0 0 1 .69-2.803m7.486 0a1.487 1.487 0 0 1 .689 2.803c-.385.204-.536-.576-.916-.71-.34-.12-.92.437-1.103.094a1.487 1.487 0 0 1 .617-2.013 1.5 1.5 0 0 1 .713-.174m-10.68 1.55a.96.96 0 1 1 0 1.921.96.96 0 0 1 0-1.92m13.838 0a.96.96 0 1 1 0 1.92.96.96 0 0 1 0-1.92M8.489 11.458c.588.01 1.965 1.157 3.572 1.164 1.607-.007 2.984-1.155 3.572-1.164.196-.003.305.12.305.454 0 .886-.424 2.328-1.563 3.202-.22-.756-1.396-1.366-1.63-1.32q-.011.001-.02.006l-.044.026-.01.008-.03.024q-.018.017-.035.036l-.032.04a1 1 0 0 0-.058.09l-.014.025q-.049.088-.11.19a1 1 0 0 1-.083.116 1.2 1.2 0 0 1-.173.18q-.035.029-.075.058a1.3 1.3 0 0 1-.251-.243 1 1 0 0 1-.076-.107c-.124-.193-.177-.363-.337-.444-.034-.016-.104-.008-.2.022q-.094.03-.216.087-.06.028-.125.063l-.13.074q-.067.04-.136.086a3 3 0 0 0-.135.096 3 3 0 0 0-.26.219 2 2 0 0 0-.12.121 2 2 0 0 0-.106.128l-.002.002a2 2 0 0 0-.09.132l-.001.001a1.2 1.2 0 0 0-.105.212q-.013.036-.024.073c-1.139-.875-1.563-2.317-1.563-3.203 0-.334.109-.457.305-.454m.836 10.354c.824-1.19.766-2.082-.365-3.194-1.13-1.112-1.789-2.738-1.789-2.738s-.246-.945-.806-.858-.97 1.499.202 2.362c1.173.864-.233 1.45-.685.64-.45-.812-1.683-2.896-2.322-3.295s-1.089-.175-.938.647 2.822 2.813 2.562 3.244-1.176-.506-1.176-.506-2.866-2.567-3.49-1.898.473 1.23 2.037 2.16c1.564.932 1.686 1.178 1.464 1.53s-3.675-2.511-4-1.297c-.323 1.214 3.524 1.567 3.287 2.405-.238.839-2.71-1.587-3.216-.642-.506.946 3.49 2.056 3.522 2.064 1.29.33 4.568 1.028 5.713-.624m5.349 0c-.824-1.19-.766-2.082.365-3.194 1.13-1.112 1.789-2.738 1.789-2.738s.246-.945.806-.858.97 1.499-.202 2.362c-1.173.864.233 1.45.685.64.451-.812 1.683-2.896 2.322-3.295s1.089-.175.938.647-2.822 2.813-2.562 3.244 1.176-.506 1.176-.506 2.866-2.567 3.49-1.898-.473 1.23-2.037 2.16c-1.564.932-1.686 1.178-1.464 1.53s3.675-2.511 4-1.297c.323 1.214-3.524 1.567-3.287 2.405.238.839 2.71-1.587 3.216-.642.506.946-3.49 2.056-3.522 2.064-1.29.33-4.568 1.028-5.713-.624"
/>
</svg>
{:else}
<svg class="w-6 h-6 {className}" viewBox="0 0 24 24" fill="currentColor">
<path
d="M12 2C6.48 2 2 6.48 2 12s4.48 10 10 10 10-4.48 10-10S17.52 2 12 2zm-2 15l-5-5 1.41-1.41L10 14.17l7.59-7.59L19 8l-9 9z"
/>
</svg>
{/if}

View File

@@ -0,0 +1,142 @@
<script lang="ts">
import FamilyLogos from "./FamilyLogos.svelte";
type FamilySidebarProps = {
families: string[];
selectedFamily: string | null;
hasFavorites: boolean;
onSelect: (family: string | null) => void;
};
let { families, selectedFamily, hasFavorites, onSelect }: FamilySidebarProps =
$props();
// Family display names
const familyNames: Record<string, string> = {
favorites: "Favorites",
huggingface: "Hub",
llama: "Meta",
qwen: "Qwen",
deepseek: "DeepSeek",
"gpt-oss": "OpenAI",
glm: "GLM",
minimax: "MiniMax",
kimi: "Kimi",
};
function getFamilyName(family: string): string {
return (
familyNames[family] || family.charAt(0).toUpperCase() + family.slice(1)
);
}
</script>
<div
class="flex flex-col gap-1 py-2 px-1 border-r border-exo-yellow/10 bg-exo-medium-gray/30 min-w-[64px]"
>
<!-- All models (no filter) -->
<button
type="button"
onclick={() => onSelect(null)}
class="group flex flex-col items-center justify-center p-2 rounded transition-all duration-200 cursor-pointer {selectedFamily ===
null
? 'bg-exo-yellow/20 border-l-2 border-exo-yellow'
: 'hover:bg-white/5 border-l-2 border-transparent'}"
title="All models"
>
<svg
class="w-5 h-5 {selectedFamily === null
? 'text-exo-yellow'
: 'text-white/50 group-hover:text-white/70'}"
viewBox="0 0 24 24"
fill="currentColor"
>
<path
d="M4 8h4V4H4v4zm6 12h4v-4h-4v4zm-6 0h4v-4H4v4zm0-6h4v-4H4v4zm6 0h4v-4h-4v4zm6-10v4h4V4h-4zm-6 4h4V4h-4v4zm6 6h4v-4h-4v4zm0 6h4v-4h-4v4z"
/>
</svg>
<span
class="text-[9px] font-mono mt-0.5 {selectedFamily === null
? 'text-exo-yellow'
: 'text-white/40 group-hover:text-white/60'}">All</span
>
</button>
<!-- Favorites (only show if has favorites) -->
{#if hasFavorites}
<button
type="button"
onclick={() => onSelect("favorites")}
class="group flex flex-col items-center justify-center p-2 rounded transition-all duration-200 cursor-pointer {selectedFamily ===
'favorites'
? 'bg-exo-yellow/20 border-l-2 border-exo-yellow'
: 'hover:bg-white/5 border-l-2 border-transparent'}"
title="Show favorited models"
>
<FamilyLogos
family="favorites"
class={selectedFamily === "favorites"
? "text-amber-400"
: "text-white/50 group-hover:text-amber-400/70"}
/>
<span
class="text-[9px] font-mono mt-0.5 {selectedFamily === 'favorites'
? 'text-amber-400'
: 'text-white/40 group-hover:text-white/60'}">Faves</span
>
</button>
{/if}
<!-- HuggingFace Hub -->
<button
type="button"
onclick={() => onSelect("huggingface")}
class="group flex flex-col items-center justify-center p-2 rounded transition-all duration-200 cursor-pointer {selectedFamily ===
'huggingface'
? 'bg-orange-500/20 border-l-2 border-orange-400'
: 'hover:bg-white/5 border-l-2 border-transparent'}"
title="Browse and add models from Hugging Face"
>
<FamilyLogos
family="huggingface"
class={selectedFamily === "huggingface"
? "text-orange-400"
: "text-white/50 group-hover:text-orange-400/70"}
/>
<span
class="text-[9px] font-mono mt-0.5 {selectedFamily === 'huggingface'
? 'text-orange-400'
: 'text-white/40 group-hover:text-white/60'}">Hub</span
>
</button>
<div class="h-px bg-exo-yellow/10 my-1"></div>
<!-- Model families -->
{#each families as family}
<button
type="button"
onclick={() => onSelect(family)}
class="group flex flex-col items-center justify-center p-2 rounded transition-all duration-200 cursor-pointer {selectedFamily ===
family
? 'bg-exo-yellow/20 border-l-2 border-exo-yellow'
: 'hover:bg-white/5 border-l-2 border-transparent'}"
title={getFamilyName(family)}
>
<FamilyLogos
{family}
class={selectedFamily === family
? "text-exo-yellow"
: "text-white/50 group-hover:text-white/70"}
/>
<span
class="text-[9px] font-mono mt-0.5 truncate max-w-full {selectedFamily ===
family
? 'text-exo-yellow'
: 'text-white/40 group-hover:text-white/60'}"
>
{getFamilyName(family)}
</span>
</button>
{/each}
</div>

View File

@@ -0,0 +1,127 @@
<script lang="ts">
interface HuggingFaceModel {
id: string;
author: string;
downloads: number;
likes: number;
last_modified: string;
tags: string[];
}
type HuggingFaceResultItemProps = {
model: HuggingFaceModel;
isAdded: boolean;
isAdding: boolean;
onAdd: () => void;
onSelect: () => void;
};
let {
model,
isAdded,
isAdding,
onAdd,
onSelect,
}: HuggingFaceResultItemProps = $props();
function formatNumber(num: number): string {
if (num >= 1000000) {
return `${(num / 1000000).toFixed(1)}M`;
} else if (num >= 1000) {
return `${(num / 1000).toFixed(1)}k`;
}
return num.toString();
}
// Extract model name from full ID (e.g., "mlx-community/Llama-3.2-1B" -> "Llama-3.2-1B")
const modelName = $derived(model.id.split("/").pop() || model.id);
</script>
<div
class="flex items-center justify-between gap-3 px-3 py-2.5 hover:bg-white/5 transition-colors border-b border-white/5 last:border-b-0"
>
<div class="flex-1 min-w-0">
<div class="flex items-center gap-2">
<span class="text-sm font-mono text-white truncate" title={model.id}
>{modelName}</span
>
{#if isAdded}
<span
class="px-1.5 py-0.5 text-[10px] font-mono bg-green-500/20 text-green-400 rounded"
>Added</span
>
{/if}
</div>
<div class="flex items-center gap-3 mt-0.5 text-xs text-white/40">
<span class="truncate">{model.author}</span>
<span
class="flex items-center gap-1 shrink-0"
title="Downloads in the last 30 days"
>
<svg
class="w-3 h-3"
fill="none"
stroke="currentColor"
viewBox="0 0 24 24"
>
<path
stroke-linecap="round"
stroke-linejoin="round"
stroke-width="2"
d="M4 16v1a3 3 0 003 3h10a3 3 0 003-3v-1m-4-4l-4 4m0 0l-4-4m4 4V4"
/>
</svg>
{formatNumber(model.downloads)}
</span>
<span
class="flex items-center gap-1 shrink-0"
title="Community likes on Hugging Face"
>
<svg
class="w-3 h-3"
fill="none"
stroke="currentColor"
viewBox="0 0 24 24"
>
<path
stroke-linecap="round"
stroke-linejoin="round"
stroke-width="2"
d="M4.318 6.318a4.5 4.5 0 000 6.364L12 20.364l7.682-7.682a4.5 4.5 0 00-6.364-6.364L12 7.636l-1.318-1.318a4.5 4.5 0 00-6.364 0z"
/>
</svg>
{formatNumber(model.likes)}
</span>
</div>
</div>
<div class="flex items-center gap-2 shrink-0">
{#if isAdded}
<button
type="button"
onclick={onSelect}
class="px-3 py-1.5 text-xs font-mono tracking-wider uppercase bg-exo-yellow/10 text-exo-yellow border border-exo-yellow/30 hover:bg-exo-yellow/20 transition-colors rounded cursor-pointer"
>
Select
</button>
{:else}
<button
type="button"
onclick={onAdd}
disabled={isAdding}
class="px-3 py-1.5 text-xs font-mono tracking-wider uppercase bg-orange-500/10 text-orange-400 border border-orange-400/30 hover:bg-orange-500/20 transition-colors rounded cursor-pointer disabled:opacity-50 disabled:cursor-not-allowed"
>
{#if isAdding}
<span class="flex items-center gap-1.5">
<span
class="w-3 h-3 border-2 border-orange-400 border-t-transparent rounded-full animate-spin"
></span>
Adding...
</span>
{:else}
+ Add
{/if}
</button>
{/if}
</div>
</div>

View File

@@ -0,0 +1,182 @@
<script lang="ts">
import { fly } from "svelte/transition";
import { cubicOut } from "svelte/easing";
interface FilterState {
capabilities: string[];
sizeRange: { min: number; max: number } | null;
}
type ModelFilterPopoverProps = {
filters: FilterState;
onChange: (filters: FilterState) => void;
onClear: () => void;
onClose: () => void;
};
let { filters, onChange, onClear, onClose }: ModelFilterPopoverProps =
$props();
// Available capabilities
const availableCapabilities = [
{ id: "text", label: "Text" },
{ id: "thinking", label: "Thinking" },
{ id: "code", label: "Code" },
{ id: "vision", label: "Vision" },
];
// Size ranges
const sizeRanges = [
{ label: "< 10GB", min: 0, max: 10 },
{ label: "10-50GB", min: 10, max: 50 },
{ label: "50-200GB", min: 50, max: 200 },
{ label: "> 200GB", min: 200, max: 10000 },
];
function toggleCapability(cap: string) {
const next = filters.capabilities.includes(cap)
? filters.capabilities.filter((c) => c !== cap)
: [...filters.capabilities, cap];
onChange({ ...filters, capabilities: next });
}
function selectSizeRange(range: { min: number; max: number } | null) {
// Toggle off if same range is clicked
if (
filters.sizeRange &&
range &&
filters.sizeRange.min === range.min &&
filters.sizeRange.max === range.max
) {
onChange({ ...filters, sizeRange: null });
} else {
onChange({ ...filters, sizeRange: range });
}
}
function handleClickOutside(e: MouseEvent) {
const target = e.target as HTMLElement;
if (
!target.closest(".filter-popover") &&
!target.closest(".filter-toggle")
) {
onClose();
}
}
</script>
<svelte:window onclick={handleClickOutside} />
<!-- svelte-ignore a11y_no_static_element_interactions -->
<div
class="filter-popover absolute right-0 top-full mt-2 w-64 bg-exo-dark-gray border border-exo-yellow/10 rounded-lg shadow-xl z-10"
transition:fly={{ y: -10, duration: 200, easing: cubicOut }}
onclick={(e) => e.stopPropagation()}
role="dialog"
aria-label="Filter options"
>
<div class="p-3 space-y-4">
<!-- Capabilities -->
<div>
<h4 class="text-xs font-mono text-white/50 mb-2">Capabilities</h4>
<div class="flex flex-wrap gap-1.5">
{#each availableCapabilities as cap}
{@const isSelected = filters.capabilities.includes(cap.id)}
<button
type="button"
class="px-2 py-1 text-xs font-mono rounded transition-colors {isSelected
? 'bg-exo-yellow/20 text-exo-yellow border border-exo-yellow/30'
: 'bg-white/5 text-white/60 hover:bg-white/10 border border-transparent'}"
onclick={() => toggleCapability(cap.id)}
>
{#if cap.id === "text"}
<svg
class="w-3.5 h-3.5 inline-block"
viewBox="0 0 24 24"
fill="none"
stroke="currentColor"
stroke-width="1.5"
><path
d="M21 15a2 2 0 0 1-2 2H7l-4 4V5a2 2 0 0 1 2-2h14a2 2 0 0 1 2 2z"
stroke-linecap="round"
stroke-linejoin="round"
/></svg
>
{:else if cap.id === "thinking"}
<svg
class="w-3.5 h-3.5 inline-block"
viewBox="0 0 24 24"
fill="none"
stroke="currentColor"
stroke-width="1.5"
><path
d="M12 2a7 7 0 0 0-7 7c0 2.38 1.19 4.47 3 5.74V17a1 1 0 0 0 1 1h6a1 1 0 0 0 1-1v-2.26c1.81-1.27 3-3.36 3-5.74a7 7 0 0 0-7-7zM9 20h6M10 22h4"
stroke-linecap="round"
stroke-linejoin="round"
/></svg
>
{:else if cap.id === "code"}
<svg
class="w-3.5 h-3.5 inline-block"
viewBox="0 0 24 24"
fill="none"
stroke="currentColor"
stroke-width="1.5"
><path
d="M16 18l6-6-6-6M8 6l-6 6 6 6"
stroke-linecap="round"
stroke-linejoin="round"
/></svg
>
{:else if cap.id === "vision"}
<svg
class="w-3.5 h-3.5 inline-block"
viewBox="0 0 24 24"
fill="none"
stroke="currentColor"
stroke-width="1.5"
><path
d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"
stroke-linecap="round"
stroke-linejoin="round"
/><circle cx="12" cy="12" r="3" /></svg
>
{/if}
<span class="ml-1">{cap.label}</span>
</button>
{/each}
</div>
</div>
<!-- Size range -->
<div>
<h4 class="text-xs font-mono text-white/50 mb-2">Model Size</h4>
<div class="flex flex-wrap gap-1.5">
{#each sizeRanges as range}
{@const isSelected =
filters.sizeRange &&
filters.sizeRange.min === range.min &&
filters.sizeRange.max === range.max}
<button
type="button"
class="px-2 py-1 text-xs font-mono rounded transition-colors {isSelected
? 'bg-exo-yellow/20 text-exo-yellow border border-exo-yellow/30'
: 'bg-white/5 text-white/60 hover:bg-white/10 border border-transparent'}"
onclick={() => selectSizeRange(range)}
>
{range.label}
</button>
{/each}
</div>
</div>
<!-- Clear button -->
<button
type="button"
class="w-full py-1.5 text-xs font-mono text-white/50 hover:text-white/70 hover:bg-white/5 rounded transition-colors"
onclick={onClear}
>
Clear all filters
</button>
</div>
</div>

View File

@@ -0,0 +1,324 @@
<script lang="ts">
interface ModelInfo {
id: string;
name?: string;
storage_size_megabytes?: number;
base_model?: string;
quantization?: string;
supports_tensor?: boolean;
capabilities?: string[];
family?: string;
is_custom?: boolean;
}
interface ModelGroup {
id: string;
name: string;
capabilities: string[];
family: string;
variants: ModelInfo[];
smallestVariant: ModelInfo;
hasMultipleVariants: boolean;
}
type ModelPickerGroupProps = {
group: ModelGroup;
isExpanded: boolean;
isFavorite: boolean;
selectedModelId: string | null;
canModelFit: (id: string) => boolean;
onToggleExpand: () => void;
onSelectModel: (modelId: string) => void;
onToggleFavorite: (baseModelId: string) => void;
onShowInfo: (group: ModelGroup) => void;
};
let {
group,
isExpanded,
isFavorite,
selectedModelId,
canModelFit,
onToggleExpand,
onSelectModel,
onToggleFavorite,
onShowInfo,
}: ModelPickerGroupProps = $props();
// Format storage size
function formatSize(mb: number | undefined): string {
if (!mb) return "";
if (mb >= 1024) {
return `${(mb / 1024).toFixed(0)}GB`;
}
return `${mb}MB`;
}
// Check if any variant can fit
const anyVariantFits = $derived(
group.variants.some((v) => canModelFit(v.id)),
);
// Check if this group's model is currently selected (for single-variant groups)
const isMainSelected = $derived(
!group.hasMultipleVariants &&
group.variants.some((v) => v.id === selectedModelId),
);
</script>
<div
class="border-b border-white/5 last:border-b-0 {!anyVariantFits
? 'opacity-50'
: ''}"
>
<!-- Main row -->
<div
class="flex items-center gap-2 px-3 py-2.5 transition-colors {anyVariantFits
? 'hover:bg-white/5 cursor-pointer'
: 'cursor-not-allowed'} {isMainSelected
? 'bg-exo-yellow/10 border-l-2 border-exo-yellow'
: 'border-l-2 border-transparent'}"
onclick={() => {
if (group.hasMultipleVariants) {
onToggleExpand();
} else {
const modelId = group.variants[0]?.id;
if (modelId && canModelFit(modelId)) {
onSelectModel(modelId);
}
}
}}
role="button"
tabindex="0"
onkeydown={(e) => {
if (e.key === "Enter" || e.key === " ") {
e.preventDefault();
if (group.hasMultipleVariants) {
onToggleExpand();
} else {
const modelId = group.variants[0]?.id;
if (modelId && canModelFit(modelId)) {
onSelectModel(modelId);
}
}
}
}}
>
<!-- Expand/collapse chevron (for groups with variants) -->
{#if group.hasMultipleVariants}
<svg
class="w-4 h-4 text-white/40 transition-transform duration-200 flex-shrink-0 {isExpanded
? 'rotate-90'
: ''}"
viewBox="0 0 24 24"
fill="currentColor"
>
<path d="M8.59 16.59L13.17 12 8.59 7.41 10 6l6 6-6 6-1.41-1.41z" />
</svg>
{:else}
<div class="w-4 flex-shrink-0"></div>
{/if}
<!-- Model name -->
<div class="flex-1 min-w-0">
<div class="flex items-center gap-2">
<span class="font-mono text-sm text-white truncate">
{group.name}
</span>
<!-- Capability icons -->
{#each group.capabilities.filter((c) => c !== "text") as cap}
{#if cap === "thinking"}
<svg
class="w-3.5 h-3.5 text-white/40 flex-shrink-0"
viewBox="0 0 24 24"
fill="none"
stroke="currentColor"
stroke-width="1.5"
title="Supports Thinking"
>
<path
d="M12 2a7 7 0 0 0-7 7c0 2.38 1.19 4.47 3 5.74V17a1 1 0 0 0 1 1h6a1 1 0 0 0 1-1v-2.26c1.81-1.27 3-3.36 3-5.74a7 7 0 0 0-7-7zM9 20h6M10 22h4"
stroke-linecap="round"
stroke-linejoin="round"
/>
</svg>
{:else if cap === "code"}
<svg
class="w-3.5 h-3.5 text-white/40 flex-shrink-0"
viewBox="0 0 24 24"
fill="none"
stroke="currentColor"
stroke-width="1.5"
title="Supports code generation"
>
<path
d="M16 18l6-6-6-6M8 6l-6 6 6 6"
stroke-linecap="round"
stroke-linejoin="round"
/>
</svg>
{:else if cap === "vision"}
<svg
class="w-3.5 h-3.5 text-white/40 flex-shrink-0"
viewBox="0 0 24 24"
fill="none"
stroke="currentColor"
stroke-width="1.5"
title="Supports image input"
>
<path
d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"
stroke-linecap="round"
stroke-linejoin="round"
/>
<circle cx="12" cy="12" r="3" />
</svg>
{:else if cap === "image_gen"}
<svg
class="w-3.5 h-3.5 text-white/40 flex-shrink-0"
viewBox="0 0 24 24"
fill="none"
stroke="currentColor"
stroke-width="1.5"
title="Supports image generation"
>
<rect x="3" y="3" width="18" height="18" rx="2" ry="2" />
<circle cx="8.5" cy="8.5" r="1.5" />
<path d="M21 15l-5-5L5 21" />
</svg>
{/if}
{/each}
</div>
</div>
<!-- Size indicator (smallest variant) -->
{#if !group.hasMultipleVariants && group.smallestVariant?.storage_size_megabytes}
<span class="text-xs font-mono text-white/30 flex-shrink-0">
{formatSize(group.smallestVariant.storage_size_megabytes)}
</span>
{/if}
<!-- Variant count -->
{#if group.hasMultipleVariants}
<span class="text-xs font-mono text-white/30 flex-shrink-0">
{group.variants.length} variants
</span>
{/if}
<!-- Check mark if selected (single-variant) -->
{#if isMainSelected}
<svg
class="w-4 h-4 text-exo-yellow flex-shrink-0"
viewBox="0 0 24 24"
fill="currentColor"
>
<path d="M9 16.17L4.83 12l-1.42 1.41L9 19 21 7l-1.41-1.41L9 16.17z" />
</svg>
{/if}
<!-- Favorite star -->
<button
type="button"
class="p-1 rounded hover:bg-white/10 transition-colors flex-shrink-0"
onclick={(e) => {
e.stopPropagation();
onToggleFavorite(group.id);
}}
title={isFavorite ? "Remove from favorites" : "Add to favorites"}
>
{#if isFavorite}
<svg
class="w-4 h-4 text-amber-400"
viewBox="0 0 24 24"
fill="currentColor"
>
<path
d="M12 2l3.09 6.26L22 9.27l-5 4.87 1.18 6.88L12 17.77l-6.18 3.25L7 14.14 2 9.27l6.91-1.01L12 2z"
/>
</svg>
{:else}
<svg
class="w-4 h-4 text-white/30 hover:text-white/50"
viewBox="0 0 24 24"
fill="none"
stroke="currentColor"
stroke-width="2"
>
<path
d="M12 2l3.09 6.26L22 9.27l-5 4.87 1.18 6.88L12 17.77l-6.18 3.25L7 14.14 2 9.27l6.91-1.01L12 2z"
/>
</svg>
{/if}
</button>
<!-- Info button -->
<button
type="button"
class="p-1 rounded hover:bg-white/10 transition-colors flex-shrink-0"
onclick={(e) => {
e.stopPropagation();
onShowInfo(group);
}}
title="View model details"
>
<svg
class="w-4 h-4 text-white/30 hover:text-white/50"
viewBox="0 0 24 24"
fill="currentColor"
>
<path
d="M12 2C6.48 2 2 6.48 2 12s4.48 10 10 10 10-4.48 10-10S17.52 2 12 2zm1 15h-2v-6h2v6zm0-8h-2V7h2v2z"
/>
</svg>
</button>
</div>
<!-- Expanded variants -->
{#if isExpanded && group.hasMultipleVariants}
<div class="bg-black/20 border-t border-white/5">
{#each group.variants as variant}
{@const modelCanFit = canModelFit(variant.id)}
{@const isSelected = selectedModelId === variant.id}
<button
type="button"
class="w-full flex items-center gap-3 px-3 py-2 pl-10 hover:bg-white/5 transition-colors text-left {!modelCanFit
? 'opacity-50 cursor-not-allowed'
: 'cursor-pointer'} {isSelected
? 'bg-exo-yellow/10 border-l-2 border-exo-yellow'
: 'border-l-2 border-transparent'}"
disabled={!modelCanFit}
onclick={() => {
if (modelCanFit) {
onSelectModel(variant.id);
}
}}
>
<!-- Quantization badge -->
<span
class="text-xs font-mono px-1.5 py-0.5 rounded bg-white/10 text-white/70 flex-shrink-0"
>
{variant.quantization || "default"}
</span>
<!-- Size -->
<span class="text-xs font-mono text-white/40 flex-1">
{formatSize(variant.storage_size_megabytes)}
</span>
<!-- Check mark if selected -->
{#if isSelected}
<svg
class="w-4 h-4 text-exo-yellow"
viewBox="0 0 24 24"
fill="currentColor"
>
<path
d="M9 16.17L4.83 12l-1.42 1.41L9 19 21 7l-1.41-1.41L9 16.17z"
/>
</svg>
{/if}
</button>
{/each}
</div>
{/if}
</div>

View File

@@ -0,0 +1,748 @@
<script lang="ts">
import { fade, fly } from "svelte/transition";
import { cubicOut } from "svelte/easing";
import FamilySidebar from "./FamilySidebar.svelte";
import ModelPickerGroup from "./ModelPickerGroup.svelte";
import ModelFilterPopover from "./ModelFilterPopover.svelte";
import HuggingFaceResultItem from "./HuggingFaceResultItem.svelte";
interface ModelInfo {
id: string;
name?: string;
storage_size_megabytes?: number;
base_model?: string;
quantization?: string;
supports_tensor?: boolean;
capabilities?: string[];
family?: string;
is_custom?: boolean;
tasks?: string[];
hugging_face_id?: string;
}
interface ModelGroup {
id: string;
name: string;
capabilities: string[];
family: string;
variants: ModelInfo[];
smallestVariant: ModelInfo;
hasMultipleVariants: boolean;
}
interface FilterState {
capabilities: string[];
sizeRange: { min: number; max: number } | null;
}
interface HuggingFaceModel {
id: string;
author: string;
downloads: number;
likes: number;
last_modified: string;
tags: string[];
}
type ModelPickerModalProps = {
isOpen: boolean;
models: ModelInfo[];
selectedModelId: string | null;
favorites: Set<string>;
existingModelIds: Set<string>;
canModelFit: (modelId: string) => boolean;
onSelect: (modelId: string) => void;
onClose: () => void;
onToggleFavorite: (baseModelId: string) => void;
onAddModel: (modelId: string) => Promise<void>;
onDeleteModel: (modelId: string) => Promise<void>;
totalMemoryGB: number;
usedMemoryGB: number;
};
let {
isOpen,
models,
selectedModelId,
favorites,
existingModelIds,
canModelFit,
onSelect,
onClose,
onToggleFavorite,
onAddModel,
onDeleteModel,
totalMemoryGB,
usedMemoryGB,
}: ModelPickerModalProps = $props();
// Local state
let searchQuery = $state("");
let selectedFamily = $state<string | null>(null);
let expandedGroups = $state<Set<string>>(new Set());
let showFilters = $state(false);
let filters = $state<FilterState>({ capabilities: [], sizeRange: null });
let infoGroup = $state<ModelGroup | null>(null);
// HuggingFace Hub state
let hfSearchQuery = $state("");
let hfSearchResults = $state<HuggingFaceModel[]>([]);
let hfTrendingModels = $state<HuggingFaceModel[]>([]);
let hfIsSearching = $state(false);
let hfIsLoadingTrending = $state(false);
let addingModelId = $state<string | null>(null);
let hfSearchDebounceTimer: ReturnType<typeof setTimeout> | null = null;
let manualModelId = $state("");
let addModelError = $state<string | null>(null);
// Reset state when modal opens
$effect(() => {
if (isOpen) {
searchQuery = "";
selectedFamily = null;
expandedGroups = new Set();
showFilters = false;
hfSearchQuery = "";
hfSearchResults = [];
manualModelId = "";
addModelError = null;
}
});
// Fetch trending models when HuggingFace is selected
$effect(() => {
if (
selectedFamily === "huggingface" &&
hfTrendingModels.length === 0 &&
!hfIsLoadingTrending
) {
fetchTrendingModels();
}
});
async function fetchTrendingModels() {
hfIsLoadingTrending = true;
try {
const response = await fetch("/models/search?query=&limit=20");
if (response.ok) {
hfTrendingModels = await response.json();
}
} catch (error) {
console.error("Failed to fetch trending models:", error);
} finally {
hfIsLoadingTrending = false;
}
}
async function searchHuggingFace(query: string) {
if (query.length < 2) {
hfSearchResults = [];
return;
}
hfIsSearching = true;
try {
const response = await fetch(
`/models/search?query=${encodeURIComponent(query)}&limit=20`,
);
if (response.ok) {
hfSearchResults = await response.json();
} else {
hfSearchResults = [];
}
} catch (error) {
console.error("Failed to search models:", error);
hfSearchResults = [];
} finally {
hfIsSearching = false;
}
}
function handleHfSearchInput(query: string) {
hfSearchQuery = query;
addModelError = null;
if (hfSearchDebounceTimer) {
clearTimeout(hfSearchDebounceTimer);
}
if (query.length >= 2) {
hfSearchDebounceTimer = setTimeout(() => {
searchHuggingFace(query);
}, 300);
} else {
hfSearchResults = [];
}
}
async function handleAddModel(modelId: string) {
addingModelId = modelId;
addModelError = null;
try {
await onAddModel(modelId);
} catch (error) {
addModelError =
error instanceof Error ? error.message : "Failed to add model";
} finally {
addingModelId = null;
}
}
async function handleAddManualModel() {
if (!manualModelId.trim()) return;
await handleAddModel(manualModelId.trim());
if (!addModelError) {
manualModelId = "";
}
}
function handleSelectHfModel(modelId: string) {
onSelect(modelId);
onClose();
}
// Models to display in HuggingFace view
const hfDisplayModels = $derived.by((): HuggingFaceModel[] => {
if (hfSearchQuery.length >= 2) {
return hfSearchResults;
}
return hfTrendingModels;
});
// Group models by base_model
const groupedModels = $derived.by((): ModelGroup[] => {
const groups = new Map<string, ModelGroup>();
for (const model of models) {
const groupId = model.base_model || model.id;
const groupName = model.base_model || model.name || model.id;
if (!groups.has(groupId)) {
groups.set(groupId, {
id: groupId,
name: groupName,
capabilities: model.capabilities || ["text"],
family: model.family || "",
variants: [],
smallestVariant: model,
hasMultipleVariants: false,
});
}
const group = groups.get(groupId)!;
group.variants.push(model);
// Track smallest variant
if (
(model.storage_size_megabytes || 0) <
(group.smallestVariant.storage_size_megabytes || Infinity)
) {
group.smallestVariant = model;
}
// Update capabilities if not set
if (
group.capabilities.length <= 1 &&
model.capabilities &&
model.capabilities.length > 1
) {
group.capabilities = model.capabilities;
}
if (!group.family && model.family) {
group.family = model.family;
}
}
// Sort variants within each group by size
for (const group of groups.values()) {
group.variants.sort(
(a, b) =>
(a.storage_size_megabytes || 0) - (b.storage_size_megabytes || 0),
);
group.hasMultipleVariants = group.variants.length > 1;
}
// Convert to array and sort by smallest variant size (biggest first)
return Array.from(groups.values()).sort((a, b) => {
return (
(b.smallestVariant.storage_size_megabytes || 0) -
(a.smallestVariant.storage_size_megabytes || 0)
);
});
});
// Get unique families
const uniqueFamilies = $derived.by((): string[] => {
const families = new Set<string>();
for (const group of groupedModels) {
if (group.family) {
families.add(group.family);
}
}
const familyOrder = [
"kimi",
"qwen",
"glm",
"minimax",
"deepseek",
"gpt-oss",
"llama",
];
return Array.from(families).sort((a, b) => {
const aIdx = familyOrder.indexOf(a);
const bIdx = familyOrder.indexOf(b);
if (aIdx === -1 && bIdx === -1) return a.localeCompare(b);
if (aIdx === -1) return 1;
if (bIdx === -1) return -1;
return aIdx - bIdx;
});
});
// Filter models based on search, family, and filters
const filteredGroups = $derived.by((): ModelGroup[] => {
let result: ModelGroup[] = [...groupedModels];
// Filter by family
if (selectedFamily === "favorites") {
result = result.filter((g) => favorites.has(g.id));
} else if (selectedFamily && selectedFamily !== "huggingface") {
result = result.filter((g) => g.family === selectedFamily);
}
// Filter by search query
if (searchQuery.trim()) {
const query = searchQuery.toLowerCase().trim();
result = result.filter(
(g) =>
g.name.toLowerCase().includes(query) ||
g.variants.some(
(v) =>
v.id.toLowerCase().includes(query) ||
(v.name || "").toLowerCase().includes(query),
),
);
}
// Filter by capabilities
if (filters.capabilities.length > 0) {
result = result.filter((g) =>
filters.capabilities.every((cap) => g.capabilities.includes(cap)),
);
}
// Filter by size range
if (filters.sizeRange) {
const { min, max } = filters.sizeRange;
result = result.filter((g) => {
const sizeGB = (g.smallestVariant.storage_size_megabytes || 0) / 1024;
return sizeGB >= min && sizeGB <= max;
});
}
// Sort: models that fit first, then by size (largest first)
result.sort((a, b) => {
const aFits = a.variants.some((v) => canModelFit(v.id));
const bFits = b.variants.some((v) => canModelFit(v.id));
if (aFits && !bFits) return -1;
if (!aFits && bFits) return 1;
return (
(b.smallestVariant.storage_size_megabytes || 0) -
(a.smallestVariant.storage_size_megabytes || 0)
);
});
return result;
});
// Check if any favorites exist
const hasFavorites = $derived(favorites.size > 0);
function toggleGroupExpanded(groupId: string) {
const next = new Set(expandedGroups);
if (next.has(groupId)) {
next.delete(groupId);
} else {
next.add(groupId);
}
expandedGroups = next;
}
function handleSelect(modelId: string) {
onSelect(modelId);
onClose();
}
function handleKeydown(e: KeyboardEvent) {
if (e.key === "Escape") {
onClose();
}
}
function handleFiltersChange(newFilters: FilterState) {
filters = newFilters;
}
function clearFilters() {
filters = { capabilities: [], sizeRange: null };
}
const hasActiveFilters = $derived(
filters.capabilities.length > 0 || filters.sizeRange !== null,
);
</script>
<svelte:window onkeydown={handleKeydown} />
{#if isOpen}
<!-- Backdrop -->
<div
class="fixed inset-0 z-50 bg-black/80 backdrop-blur-sm"
transition:fade={{ duration: 200 }}
onclick={onClose}
role="presentation"
></div>
<!-- Modal -->
<div
class="fixed z-50 top-1/2 left-1/2 -translate-x-1/2 -translate-y-1/2 w-[min(90vw,600px)] h-[min(80vh,700px)] bg-exo-dark-gray border border-exo-yellow/10 rounded-lg shadow-2xl overflow-hidden flex flex-col"
transition:fly={{ y: 20, duration: 300, easing: cubicOut }}
role="dialog"
aria-modal="true"
aria-label="Select a model"
>
<!-- Header with search -->
<div
class="flex items-center gap-2 p-3 border-b border-exo-yellow/10 bg-exo-medium-gray/30"
>
{#if selectedFamily === "huggingface"}
<!-- HuggingFace search -->
<svg
class="w-5 h-5 text-orange-400/60 flex-shrink-0"
viewBox="0 0 24 24"
fill="none"
stroke="currentColor"
stroke-width="2"
>
<circle cx="11" cy="11" r="8" />
<path d="M21 21l-4.35-4.35" />
</svg>
<input
type="search"
class="flex-1 bg-transparent border-none outline-none text-sm font-mono text-white placeholder-white/40"
placeholder="Search mlx-community models..."
value={hfSearchQuery}
oninput={(e) => handleHfSearchInput(e.currentTarget.value)}
/>
{#if hfIsSearching}
<div class="flex-shrink-0">
<span
class="w-4 h-4 border-2 border-orange-400 border-t-transparent rounded-full animate-spin block"
></span>
</div>
{/if}
{:else}
<!-- Normal model search -->
<svg
class="w-5 h-5 text-white/40 flex-shrink-0"
viewBox="0 0 24 24"
fill="none"
stroke="currentColor"
stroke-width="2"
>
<circle cx="11" cy="11" r="8" />
<path d="M21 21l-4.35-4.35" />
</svg>
<input
type="search"
class="flex-1 bg-transparent border-none outline-none text-sm font-mono text-white placeholder-white/40"
placeholder="Search models..."
bind:value={searchQuery}
/>
<!-- Cluster memory -->
<span
class="text-xs font-mono flex-shrink-0"
title="Cluster memory usage"
><span class="text-exo-yellow">{Math.round(usedMemoryGB)}GB</span
><span class="text-white/40">/{Math.round(totalMemoryGB)}GB</span
></span
>
<!-- Filter button -->
<div class="relative filter-toggle">
<button
type="button"
class="p-1.5 rounded hover:bg-white/10 transition-colors {hasActiveFilters
? 'text-exo-yellow'
: 'text-white/50'}"
onclick={() => (showFilters = !showFilters)}
title="Filter by capability or size"
>
<svg class="w-5 h-5" viewBox="0 0 24 24" fill="currentColor">
<path d="M10 18h4v-2h-4v2zM3 6v2h18V6H3zm3 7h12v-2H6v2z" />
</svg>
</button>
{#if showFilters}
<ModelFilterPopover
{filters}
onChange={handleFiltersChange}
onClear={clearFilters}
onClose={() => (showFilters = false)}
/>
{/if}
</div>
{/if}
<!-- Close button -->
<button
type="button"
class="p-1.5 rounded hover:bg-white/10 transition-colors text-white/50 hover:text-white/70"
onclick={onClose}
title="Close model picker"
>
<svg class="w-5 h-5" viewBox="0 0 24 24" fill="currentColor">
<path
d="M19 6.41L17.59 5 12 10.59 6.41 5 5 6.41 10.59 12 5 17.59 6.41 19 12 13.41 17.59 19 19 17.59 13.41 12 19 6.41z"
/>
</svg>
</button>
</div>
<!-- Body -->
<div class="flex flex-1 overflow-hidden">
<!-- Family sidebar -->
<FamilySidebar
families={uniqueFamilies}
{selectedFamily}
{hasFavorites}
onSelect={(family) => (selectedFamily = family)}
/>
<!-- Model list -->
<div class="flex-1 overflow-y-auto flex flex-col">
{#if selectedFamily === "huggingface"}
<!-- HuggingFace Hub view -->
<div class="flex-1 flex flex-col min-h-0">
<!-- Section header -->
<div
class="sticky top-0 z-10 px-3 py-2 bg-exo-dark-gray/95 border-b border-exo-yellow/10"
>
<span class="text-xs font-mono text-white/40">
{#if hfSearchQuery.length >= 2}
Search results for "{hfSearchQuery}"
{:else}
Trending on mlx-community
{/if}
</span>
</div>
<!-- Results list -->
<div class="flex-1 overflow-y-auto">
{#if hfIsLoadingTrending && hfTrendingModels.length === 0}
<div
class="flex items-center justify-center py-12 text-white/40"
>
<span
class="w-5 h-5 border-2 border-orange-400 border-t-transparent rounded-full animate-spin mr-2"
></span>
<span class="font-mono text-sm"
>Loading trending models...</span
>
</div>
{:else if hfDisplayModels.length === 0}
<div
class="flex flex-col items-center justify-center py-12 text-white/40"
>
<svg
class="w-10 h-10 mb-2"
viewBox="0 0 24 24"
fill="currentColor"
>
<path
d="M12 2C6.48 2 2 6.48 2 12s4.48 10 10 10 10-4.48 10-10S17.52 2 12 2zm-2 13.5c-.83 0-1.5-.67-1.5-1.5s.67-1.5 1.5-1.5 1.5.67 1.5 1.5-.67 1.5-1.5 1.5zm4 0c-.83 0-1.5-.67-1.5-1.5s.67-1.5 1.5-1.5 1.5.67 1.5 1.5-.67 1.5-1.5 1.5zm2-4.5H8c0-2.21 1.79-4 4-4s4 1.79 4 4z"
/>
</svg>
<p class="font-mono text-sm">No models found</p>
{#if hfSearchQuery}
<p class="font-mono text-xs mt-1">
Try a different search term
</p>
{/if}
</div>
{:else}
{#each hfDisplayModels as model}
<HuggingFaceResultItem
{model}
isAdded={existingModelIds.has(model.id)}
isAdding={addingModelId === model.id}
onAdd={() => handleAddModel(model.id)}
onSelect={() => handleSelectHfModel(model.id)}
/>
{/each}
{/if}
</div>
<!-- Manual input footer -->
<div
class="sticky bottom-0 border-t border-exo-yellow/10 bg-exo-dark-gray p-3"
>
{#if addModelError}
<div
class="bg-red-500/10 border border-red-500/30 rounded px-3 py-2 mb-2"
>
<p class="text-red-400 text-xs font-mono break-words">
{addModelError}
</p>
</div>
{/if}
<div class="flex gap-2">
<input
type="text"
class="flex-1 bg-exo-black/60 border border-exo-yellow/30 rounded px-3 py-1.5 text-xs font-mono text-white placeholder-white/30 focus:outline-none focus:border-exo-yellow/50"
placeholder="Or paste model ID directly..."
bind:value={manualModelId}
onkeydown={(e) => {
if (e.key === "Enter") handleAddManualModel();
}}
/>
<button
type="button"
onclick={handleAddManualModel}
disabled={!manualModelId.trim() || addingModelId !== null}
class="px-3 py-1.5 text-xs font-mono tracking-wider uppercase bg-orange-500/10 text-orange-400 border border-orange-400/30 hover:bg-orange-500/20 transition-colors rounded disabled:opacity-50 disabled:cursor-not-allowed"
>
Add
</button>
</div>
</div>
</div>
{:else if filteredGroups.length === 0}
<div
class="flex flex-col items-center justify-center h-full text-white/40 p-8"
>
<svg class="w-12 h-12 mb-3" viewBox="0 0 24 24" fill="currentColor">
<path
d="M12 2C6.48 2 2 6.48 2 12s4.48 10 10 10 10-4.48 10-10S17.52 2 12 2zm-2 15l-5-5 1.41-1.41L10 14.17l7.59-7.59L19 8l-9 9z"
/>
</svg>
<p class="font-mono text-sm">No models found</p>
{#if hasActiveFilters || searchQuery}
<button
type="button"
class="mt-2 text-xs text-exo-yellow hover:underline"
onclick={() => {
searchQuery = "";
clearFilters();
}}
>
Clear filters
</button>
{/if}
</div>
{:else}
{#each filteredGroups as group}
<ModelPickerGroup
{group}
isExpanded={expandedGroups.has(group.id)}
isFavorite={favorites.has(group.id)}
{selectedModelId}
{canModelFit}
onToggleExpand={() => toggleGroupExpanded(group.id)}
onSelectModel={handleSelect}
{onToggleFavorite}
onShowInfo={(g) => (infoGroup = g)}
/>
{/each}
{/if}
</div>
</div>
<!-- Footer with active filters indicator -->
{#if hasActiveFilters}
<div
class="flex items-center gap-2 px-3 py-2 border-t border-exo-yellow/10 bg-exo-medium-gray/20 text-xs font-mono text-white/50"
>
<span>Filters:</span>
{#each filters.capabilities as cap}
<span class="px-1.5 py-0.5 bg-exo-yellow/20 text-exo-yellow rounded"
>{cap}</span
>
{/each}
{#if filters.sizeRange}
<span class="px-1.5 py-0.5 bg-exo-yellow/20 text-exo-yellow rounded">
{filters.sizeRange.min}GB - {filters.sizeRange.max}GB
</span>
{/if}
<button
type="button"
class="ml-auto text-white/40 hover:text-white/60"
onclick={clearFilters}
>
Clear all
</button>
</div>
{/if}
</div>
<!-- Info modal -->
{#if infoGroup}
<div
class="fixed inset-0 z-[60] bg-black/60"
transition:fade={{ duration: 150 }}
onclick={() => (infoGroup = null)}
role="presentation"
></div>
<div
class="fixed z-[60] top-1/2 left-1/2 -translate-x-1/2 -translate-y-1/2 w-[min(80vw,400px)] bg-exo-dark-gray border border-exo-yellow/10 rounded-lg shadow-2xl p-4"
transition:fly={{ y: 10, duration: 200, easing: cubicOut }}
role="dialog"
aria-modal="true"
>
<div class="flex items-start justify-between mb-3">
<h3 class="font-mono text-lg text-white">{infoGroup.name}</h3>
<button
type="button"
class="p-1 rounded hover:bg-white/10 transition-colors text-white/50"
onclick={() => (infoGroup = null)}
title="Close model details"
aria-label="Close info dialog"
>
<svg class="w-4 h-4" viewBox="0 0 24 24" fill="currentColor">
<path
d="M19 6.41L17.59 5 12 10.59 6.41 5 5 6.41 10.59 12 5 17.59 6.41 19 12 13.41 17.59 19 19 17.59 13.41 12 19 6.41z"
/>
</svg>
</button>
</div>
<div class="space-y-2 text-xs font-mono">
<div class="flex items-center gap-2">
<span class="text-white/40">Family:</span>
<span class="text-white/70">{infoGroup.family || "Unknown"}</span>
</div>
<div class="flex items-center gap-2">
<span class="text-white/40">Capabilities:</span>
<span class="text-white/70">{infoGroup.capabilities.join(", ")}</span>
</div>
<div class="flex items-center gap-2">
<span class="text-white/40">Variants:</span>
<span class="text-white/70">{infoGroup.variants.length}</span>
</div>
{#if infoGroup.variants.length > 0}
<div class="mt-3 pt-3 border-t border-exo-yellow/10">
<span class="text-white/40">Available quantizations:</span>
<div class="flex flex-wrap gap-1 mt-1">
{#each infoGroup.variants as variant}
<span
class="px-1.5 py-0.5 bg-white/10 text-white/60 rounded text-[10px]"
>
{variant.quantization || "default"} ({Math.round(
(variant.storage_size_megabytes || 0) / 1024,
)}GB)
</span>
{/each}
</div>
</div>
{/if}
</div>
</div>
{/if}
{/if}

View File

@@ -6,3 +6,9 @@ export { default as ChatSidebar } from "./ChatSidebar.svelte";
export { default as ModelCard } from "./ModelCard.svelte";
export { default as MarkdownContent } from "./MarkdownContent.svelte";
export { default as ImageParamsPanel } from "./ImageParamsPanel.svelte";
export { default as FamilyLogos } from "./FamilyLogos.svelte";
export { default as FamilySidebar } from "./FamilySidebar.svelte";
export { default as HuggingFaceResultItem } from "./HuggingFaceResultItem.svelte";
export { default as ModelFilterPopover } from "./ModelFilterPopover.svelte";
export { default as ModelPickerGroup } from "./ModelPickerGroup.svelte";
export { default as ModelPickerModal } from "./ModelPickerModal.svelte";

View File

@@ -178,6 +178,36 @@ interface ImageApiResponse {
data: Array<{ b64_json?: string; url?: string }>;
}
// Trace API response types
export interface TraceCategoryStats {
totalUs: number;
count: number;
minUs: number;
maxUs: number;
avgUs: number;
}
export interface TraceRankStats {
byCategory: Record<string, TraceCategoryStats>;
}
export interface TraceStatsResponse {
taskId: string;
totalWallTimeUs: number;
byCategory: Record<string, TraceCategoryStats>;
byRank: Record<number, TraceRankStats>;
}
export interface TraceListItem {
taskId: string;
createdAt: string;
fileSize: number;
}
export interface TraceListResponse {
traces: TraceListItem[];
}
interface RawStateResponse {
topology?: RawTopology;
instances?: Record<
@@ -2555,6 +2585,49 @@ class AppStore {
throw error;
}
}
/**
* List all available traces
*/
async listTraces(): Promise<TraceListResponse> {
const response = await fetch("/v1/traces");
if (!response.ok) {
throw new Error(`Failed to list traces: ${response.status}`);
}
return (await response.json()) as TraceListResponse;
}
/**
* Check if a trace exists for a given task ID
*/
async checkTraceExists(taskId: string): Promise<boolean> {
try {
const response = await fetch(`/v1/traces/${encodeURIComponent(taskId)}`);
return response.ok;
} catch {
return false;
}
}
/**
* Get computed statistics for a task's trace
*/
async fetchTraceStats(taskId: string): Promise<TraceStatsResponse> {
const response = await fetch(
`/v1/traces/${encodeURIComponent(taskId)}/stats`,
);
if (!response.ok) {
throw new Error(`Failed to fetch trace stats: ${response.status}`);
}
return (await response.json()) as TraceStatsResponse;
}
/**
* Get the URL for the raw trace file (for Perfetto)
*/
getTraceRawUrl(taskId: string): string {
return `/v1/traces/${encodeURIComponent(taskId)}/raw`;
}
}
export const appStore = new AppStore();
@@ -2666,3 +2739,12 @@ export const startDownload = (nodeId: string, shardMetadata: object) =>
appStore.startDownload(nodeId, shardMetadata);
export const deleteDownload = (nodeId: string, modelId: string) =>
appStore.deleteDownload(nodeId, modelId);
// Trace actions
export const listTraces = () => appStore.listTraces();
export const checkTraceExists = (taskId: string) =>
appStore.checkTraceExists(taskId);
export const fetchTraceStats = (taskId: string) =>
appStore.fetchTraceStats(taskId);
export const getTraceRawUrl = (taskId: string) =>
appStore.getTraceRawUrl(taskId);

View File

@@ -0,0 +1,97 @@
/**
* FavoritesStore - Manages favorite models with localStorage persistence
*/
import { browser } from "$app/environment";
const FAVORITES_KEY = "exo-favorite-models";
class FavoritesStore {
favorites = $state<Set<string>>(new Set());
constructor() {
if (browser) {
this.loadFromStorage();
}
}
private loadFromStorage() {
try {
const stored = localStorage.getItem(FAVORITES_KEY);
if (stored) {
const parsed = JSON.parse(stored) as string[];
this.favorites = new Set(parsed);
}
} catch (error) {
console.error("Failed to load favorites:", error);
}
}
private saveToStorage() {
try {
const array = Array.from(this.favorites);
localStorage.setItem(FAVORITES_KEY, JSON.stringify(array));
} catch (error) {
console.error("Failed to save favorites:", error);
}
}
add(baseModelId: string) {
const next = new Set(this.favorites);
next.add(baseModelId);
this.favorites = next;
this.saveToStorage();
}
remove(baseModelId: string) {
const next = new Set(this.favorites);
next.delete(baseModelId);
this.favorites = next;
this.saveToStorage();
}
toggle(baseModelId: string) {
if (this.favorites.has(baseModelId)) {
this.remove(baseModelId);
} else {
this.add(baseModelId);
}
}
isFavorite(baseModelId: string): boolean {
return this.favorites.has(baseModelId);
}
getAll(): string[] {
return Array.from(this.favorites);
}
getSet(): Set<string> {
return new Set(this.favorites);
}
hasAny(): boolean {
return this.favorites.size > 0;
}
clearAll() {
this.favorites = new Set();
this.saveToStorage();
}
}
export const favoritesStore = new FavoritesStore();
export const favorites = () => favoritesStore.favorites;
export const hasFavorites = () => favoritesStore.hasAny();
export const isFavorite = (baseModelId: string) =>
favoritesStore.isFavorite(baseModelId);
export const toggleFavorite = (baseModelId: string) =>
favoritesStore.toggle(baseModelId);
export const addFavorite = (baseModelId: string) =>
favoritesStore.add(baseModelId);
export const removeFavorite = (baseModelId: string) =>
favoritesStore.remove(baseModelId);
export const getFavorites = () => favoritesStore.getAll();
export const getFavoritesSet = () => favoritesStore.getSet();
export const clearFavorites = () => favoritesStore.clearAll();

View File

@@ -5,7 +5,13 @@
ChatMessages,
ChatSidebar,
ModelCard,
ModelPickerModal,
} from "$lib/components";
import {
favorites,
toggleFavorite,
getFavoritesSet,
} from "$lib/stores/favorites.svelte";
import {
hasStartedChat,
isTopologyMinimized,
@@ -100,6 +106,11 @@
storage_size_megabytes?: number;
tasks?: string[];
hugging_face_id?: string;
is_custom?: boolean;
family?: string;
quantization?: string;
base_model?: string;
capabilities?: string[];
}>
>([]);
@@ -211,9 +222,11 @@
let launchingModelId = $state<string | null>(null);
let instanceDownloadExpandedNodes = $state<Set<string>>(new Set());
// Custom dropdown state
let isModelDropdownOpen = $state(false);
let modelDropdownSearch = $state("");
// Model picker modal state
let isModelPickerOpen = $state(false);
// Favorites state (reactive)
const favoritesSet = $derived(getFavoritesSet());
// Slider dragging state
let isDraggingSlider = $state(false);
@@ -530,6 +543,47 @@
}
}
async function addModelFromPicker(modelId: string) {
const response = await fetch("/models/add", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ model_id: modelId }),
});
if (!response.ok) {
let message = `Failed to add model (${response.status}: ${response.statusText})`;
try {
const err = await response.json();
if (err.detail) message = err.detail;
} catch {
// use default message
}
throw new Error(message);
}
await fetchModels();
}
async function deleteCustomModel(modelId: string) {
try {
const response = await fetch(
`/models/custom/${encodeURIComponent(modelId)}`,
{ method: "DELETE" },
);
if (response.ok) {
await fetchModels();
}
} catch {
console.error("Failed to delete custom model");
}
}
function handleModelPickerSelect(modelId: string) {
selectPreviewModel(modelId);
saveLaunchDefaults();
isModelPickerOpen = false;
}
async function launchInstance(
modelId: string,
specificPreview?: PlacementPreview | null,
@@ -2360,14 +2414,12 @@
>
</div>
<!-- Model Dropdown (Custom) -->
<div class="flex-shrink-0 mb-3 relative">
<!-- Model Picker Button -->
<div class="flex-shrink-0 mb-3">
<button
type="button"
onclick={() => (isModelDropdownOpen = !isModelDropdownOpen)}
class="w-full bg-exo-medium-gray/50 border border-exo-yellow/30 rounded pl-3 pr-8 py-2.5 text-sm font-mono text-left tracking-wide cursor-pointer transition-all duration-200 hover:border-exo-yellow/50 focus:outline-none focus:border-exo-yellow/70 {isModelDropdownOpen
? 'border-exo-yellow/70'
: ''}"
onclick={() => (isModelPickerOpen = true)}
class="w-full bg-exo-medium-gray/50 border border-exo-yellow/30 rounded pl-3 pr-8 py-2.5 text-sm font-mono text-left tracking-wide cursor-pointer transition-all duration-200 hover:border-exo-yellow/50 focus:outline-none focus:border-exo-yellow/70 relative"
>
{#if selectedModelId}
{@const foundModel = models.find(
@@ -2375,54 +2427,12 @@
)}
{#if foundModel}
{@const sizeGB = getModelSizeGB(foundModel)}
{@const isImageModel = modelSupportsImageGeneration(
foundModel.id,
)}
{@const isImageEditModel = modelSupportsImageEditing(
foundModel.id,
)}
<span
class="flex items-center justify-between gap-2 w-full pr-4"
>
<span
class="flex items-center gap-2 text-exo-light-gray truncate"
>
{#if isImageModel}
<svg
class="w-4 h-4 flex-shrink-0 text-exo-yellow"
fill="none"
viewBox="0 0 24 24"
stroke="currentColor"
stroke-width="2"
>
<rect
x="3"
y="3"
width="18"
height="18"
rx="2"
ry="2"
/>
<circle cx="8.5" cy="8.5" r="1.5" />
<polyline points="21 15 16 10 5 21" />
</svg>
{/if}
{#if isImageEditModel}
<svg
class="w-4 h-4 flex-shrink-0 text-exo-yellow"
fill="none"
viewBox="0 0 24 24"
stroke="currentColor"
stroke-width="2"
>
<path
d="M11 4H4a2 2 0 0 0-2 2v14a2 2 0 0 0 2 2h14a2 2 0 0 0 2-2v-7"
/>
<path
d="M18.5 2.5a2.121 2.121 0 0 1 3 3L12 15l-4 1 1-4 9.5-9.5z"
/>
</svg>
{/if}
<span class="truncate"
>{foundModel.name || foundModel.id}</span
>
@@ -2439,142 +2449,24 @@
{:else}
<span class="text-white/50"> SELECT MODEL </span>
{/if}
</button>
<div
class="absolute right-3 top-1/2 -translate-y-1/2 pointer-events-none transition-transform duration-200 {isModelDropdownOpen
? 'rotate-180'
: ''}"
>
<svg
class="w-4 h-4 text-exo-yellow/60"
fill="none"
viewBox="0 0 24 24"
stroke="currentColor"
>
<path
stroke-linecap="round"
stroke-linejoin="round"
stroke-width="2"
d="M19 9l-7 7-7-7"
/>
</svg>
</div>
{#if isModelDropdownOpen}
<!-- Backdrop to close dropdown -->
<button
type="button"
class="fixed inset-0 z-40 cursor-default"
onclick={() => (isModelDropdownOpen = false)}
aria-label="Close dropdown"
></button>
<!-- Dropdown Panel -->
<div
class="absolute top-full left-0 right-0 mt-1 bg-exo-dark-gray border border-exo-yellow/30 rounded shadow-lg shadow-black/50 z-50 max-h-64 overflow-y-auto"
class="absolute right-3 top-1/2 -translate-y-1/2 pointer-events-none"
>
<!-- Search within dropdown -->
<div
class="sticky top-0 bg-exo-dark-gray border-b border-exo-medium-gray/30 p-2"
<svg
class="w-4 h-4 text-exo-yellow/60"
fill="none"
viewBox="0 0 24 24"
stroke="currentColor"
>
<input
type="text"
placeholder="Search models..."
bind:value={modelDropdownSearch}
class="w-full bg-exo-dark-gray/60 border border-exo-medium-gray/30 rounded px-2 py-1.5 text-xs font-mono text-white/80 placeholder:text-white/40 focus:outline-none focus:border-exo-yellow/50"
<path
stroke-linecap="round"
stroke-linejoin="round"
stroke-width="2"
d="M19 9l-7 7-7-7"
/>
</div>
<!-- Options -->
<div class="py-1">
{#each sortedModels().filter((m) => !modelDropdownSearch || (m.name || m.id)
.toLowerCase()
.includes(modelDropdownSearch.toLowerCase())) as model}
{@const sizeGB = getModelSizeGB(model)}
{@const modelCanFit = hasEnoughMemory(model)}
{@const isImageModel = modelSupportsImageGeneration(
model.id,
)}
{@const isImageEditModel = modelSupportsImageEditing(
model.id,
)}
<button
type="button"
onclick={() => {
if (modelCanFit) {
selectPreviewModel(model.id);
saveLaunchDefaults();
isModelDropdownOpen = false;
modelDropdownSearch = "";
}
}}
disabled={!modelCanFit}
class="w-full px-3 py-2 text-left text-sm font-mono tracking-wide transition-colors duration-100 flex items-center justify-between gap-2 {selectedModelId ===
model.id
? 'bg-transparent text-exo-yellow cursor-pointer'
: modelCanFit
? 'text-white/80 hover:text-exo-yellow cursor-pointer'
: 'text-white/30 cursor-default'}"
>
<span class="flex items-center gap-2 truncate flex-1">
{#if isImageModel}
<svg
class="w-4 h-4 flex-shrink-0 text-exo-yellow"
fill="none"
viewBox="0 0 24 24"
stroke="currentColor"
stroke-width="2"
aria-label="Image generation model"
>
<rect
x="3"
y="3"
width="18"
height="18"
rx="2"
ry="2"
/>
<circle cx="8.5" cy="8.5" r="1.5" />
<polyline points="21 15 16 10 5 21" />
</svg>
{/if}
{#if isImageEditModel}
<svg
class="w-4 h-4 flex-shrink-0 text-exo-yellow"
fill="none"
viewBox="0 0 24 24"
stroke="currentColor"
stroke-width="2"
aria-label="Image editing model"
>
<path
d="M11 4H4a2 2 0 0 0-2 2v14a2 2 0 0 0 2 2h14a2 2 0 0 0 2-2v-7"
/>
<path
d="M18.5 2.5a2.121 2.121 0 0 1 3 3L12 15l-4 1 1-4 9.5-9.5z"
/>
</svg>
{/if}
<span class="truncate">{model.name || model.id}</span>
</span>
<span
class="flex-shrink-0 text-xs {modelCanFit
? 'text-white/50'
: 'text-red-400/60'}"
>
{sizeGB >= 1
? sizeGB.toFixed(0)
: sizeGB.toFixed(1)}GB
</span>
</button>
{:else}
<div class="px-3 py-2 text-xs text-white/50 font-mono">
No models found
</div>
{/each}
</div>
</svg>
</div>
{/if}
</button>
</div>
<!-- Configuration Options -->
@@ -3354,3 +3246,22 @@
{/if}
</main>
</div>
<ModelPickerModal
isOpen={isModelPickerOpen}
{models}
{selectedModelId}
favorites={favoritesSet}
existingModelIds={new Set(models.map((m) => m.id))}
canModelFit={(modelId) => {
const model = models.find((m) => m.id === modelId);
return model ? hasEnoughMemory(model) : false;
}}
onSelect={handleModelPickerSelect}
onClose={() => (isModelPickerOpen = false)}
onToggleFavorite={toggleFavorite}
onAddModel={addModelFromPicker}
onDeleteModel={deleteCustomModel}
totalMemoryGB={clusterMemory().total / (1024 * 1024 * 1024)}
usedMemoryGB={clusterMemory().used / (1024 * 1024 * 1024)}
/>

View File

@@ -0,0 +1,190 @@
<script lang="ts">
import { onMount } from "svelte";
import {
listTraces,
getTraceRawUrl,
type TraceListItem,
} from "$lib/stores/app.svelte";
import HeaderNav from "$lib/components/HeaderNav.svelte";
let traces = $state<TraceListItem[]>([]);
let loading = $state(true);
let error = $state<string | null>(null);
function formatBytes(bytes: number): string {
if (!bytes || bytes <= 0) return "0B";
const units = ["B", "KB", "MB", "GB"];
const i = Math.min(
Math.floor(Math.log(bytes) / Math.log(1024)),
units.length - 1,
);
const val = bytes / Math.pow(1024, i);
return `${val.toFixed(val >= 10 ? 0 : 1)}${units[i]}`;
}
function formatDate(isoString: string): string {
const date = new Date(isoString);
return date.toLocaleString();
}
async function downloadTrace(taskId: string) {
const response = await fetch(getTraceRawUrl(taskId));
const blob = await response.blob();
const url = URL.createObjectURL(blob);
const a = document.createElement("a");
a.href = url;
a.download = `trace_${taskId}.json`;
a.click();
URL.revokeObjectURL(url);
}
async function openInPerfetto(taskId: string) {
// Fetch trace data from our local API
const response = await fetch(getTraceRawUrl(taskId));
const traceData = await response.arrayBuffer();
// Open Perfetto UI
const perfettoWindow = window.open("https://ui.perfetto.dev");
if (!perfettoWindow) {
alert("Failed to open Perfetto. Please allow popups.");
return;
}
// Wait for Perfetto to be ready, then send trace via postMessage
const onMessage = (e: MessageEvent) => {
if (e.data === "PONG") {
window.removeEventListener("message", onMessage);
perfettoWindow.postMessage(
{
perfetto: {
buffer: traceData,
title: `Trace ${taskId}`,
},
},
"https://ui.perfetto.dev",
);
}
};
window.addEventListener("message", onMessage);
// Ping Perfetto until it responds
const pingInterval = setInterval(() => {
perfettoWindow.postMessage("PING", "https://ui.perfetto.dev");
}, 50);
// Clean up after 10 seconds
setTimeout(() => {
clearInterval(pingInterval);
window.removeEventListener("message", onMessage);
}, 10000);
}
async function refresh() {
loading = true;
error = null;
try {
const response = await listTraces();
traces = response.traces;
} catch (e) {
error = e instanceof Error ? e.message : "Failed to load traces";
} finally {
loading = false;
}
}
onMount(() => {
refresh();
});
</script>
<div class="min-h-screen bg-exo-dark-gray text-white">
<HeaderNav showHome={true} />
<div class="max-w-7xl mx-auto px-4 lg:px-8 py-6 space-y-6">
<div class="flex items-center justify-between gap-4 flex-wrap">
<div>
<h1
class="text-2xl font-mono tracking-[0.2em] uppercase text-exo-yellow"
>
Traces
</h1>
</div>
<div class="flex items-center gap-3">
<button
type="button"
class="text-xs font-mono text-exo-light-gray hover:text-exo-yellow transition-colors uppercase border border-exo-medium-gray/40 px-2 py-1 rounded"
onclick={refresh}
disabled={loading}
>
Refresh
</button>
</div>
</div>
{#if loading}
<div
class="rounded border border-exo-medium-gray/30 bg-exo-black/30 p-6 text-center text-exo-light-gray"
>
<div class="text-sm">Loading traces...</div>
</div>
{:else if error}
<div
class="rounded border border-red-500/30 bg-red-500/10 p-6 text-center text-red-400"
>
<div class="text-sm">{error}</div>
</div>
{:else if traces.length === 0}
<div
class="rounded border border-exo-medium-gray/30 bg-exo-black/30 p-6 text-center text-exo-light-gray space-y-2"
>
<div class="text-sm">No traces found.</div>
<div class="text-xs text-exo-light-gray/70">
Run exo with EXO_TRACING_ENABLED=1 to collect traces.
</div>
</div>
{:else}
<div class="space-y-3">
{#each traces as trace}
<div
class="rounded border border-exo-medium-gray/30 bg-exo-black/30 p-4 flex items-center justify-between gap-4"
>
<div class="min-w-0 flex-1">
<a
href="#/traces/{trace.taskId}"
class="text-sm font-mono text-white hover:text-exo-yellow transition-colors truncate block"
>
{trace.taskId}
</a>
<div class="text-xs text-exo-light-gray font-mono mt-1">
{formatDate(trace.createdAt)} &bull; {formatBytes(
trace.fileSize,
)}
</div>
</div>
<div class="flex items-center gap-2 shrink-0">
<a
href="#/traces/{trace.taskId}"
class="text-xs font-mono text-exo-light-gray hover:text-exo-yellow transition-colors uppercase border border-exo-medium-gray/40 px-2 py-1 rounded"
>
View Stats
</a>
<button
type="button"
class="text-xs font-mono text-exo-light-gray hover:text-exo-yellow transition-colors uppercase border border-exo-medium-gray/40 px-2 py-1 rounded"
onclick={() => downloadTrace(trace.taskId)}
>
Download
</button>
<button
type="button"
class="text-xs font-mono text-exo-dark-gray bg-exo-yellow hover:bg-exo-yellow/90 transition-colors uppercase px-2 py-1 rounded font-semibold"
onclick={() => openInPerfetto(trace.taskId)}
>
View Trace
</button>
</div>
</div>
{/each}
</div>
{/if}
</div>
</div>

View File

@@ -0,0 +1,367 @@
<script lang="ts">
import { page } from "$app/stores";
import { onMount } from "svelte";
import {
fetchTraceStats,
getTraceRawUrl,
type TraceStatsResponse,
type TraceCategoryStats,
} from "$lib/stores/app.svelte";
import HeaderNav from "$lib/components/HeaderNav.svelte";
const taskId = $derived($page.params.taskId);
let stats = $state<TraceStatsResponse | null>(null);
let loading = $state(true);
let error = $state<string | null>(null);
function formatDuration(us: number): string {
if (us < 1000) return `${us.toFixed(0)}us`;
if (us < 1_000_000) return `${(us / 1000).toFixed(2)}ms`;
return `${(us / 1_000_000).toFixed(2)}s`;
}
function formatPercentage(part: number, total: number): string {
if (total === 0) return "0.0%";
return `${((part / total) * 100).toFixed(1)}%`;
}
// Parse hierarchical categories like "sync/compute" into phases
type PhaseData = {
name: string;
subcategories: { name: string; stats: TraceCategoryStats }[];
totalUs: number; // From outer span (e.g., "sync" category)
stepCount: number; // Count of outer span events
};
function parsePhases(
byCategory: Record<string, TraceCategoryStats>,
): PhaseData[] {
const phases = new Map<
string,
{
subcats: Map<string, TraceCategoryStats>;
outerStats: TraceCategoryStats | null;
}
>();
for (const [category, catStats] of Object.entries(byCategory)) {
if (category.includes("/")) {
const [phase, subcat] = category.split("/", 2);
if (!phases.has(phase)) {
phases.set(phase, { subcats: new Map(), outerStats: null });
}
phases.get(phase)!.subcats.set(subcat, catStats);
} else {
// Outer span - this IS the phase total
if (!phases.has(category)) {
phases.set(category, { subcats: new Map(), outerStats: null });
}
phases.get(category)!.outerStats = catStats;
}
}
return Array.from(phases.entries())
.filter(([_, data]) => data.outerStats !== null) // Only phases with outer spans
.map(([name, data]) => ({
name,
subcategories: Array.from(data.subcats.entries())
.map(([subName, subStats]) => ({ name: subName, stats: subStats }))
.sort((a, b) => b.stats.totalUs - a.stats.totalUs),
totalUs: data.outerStats!.totalUs, // Outer span total
stepCount: data.outerStats!.count, // Number of steps
}))
.sort((a, b) => b.totalUs - a.totalUs);
}
async function downloadTrace() {
if (!taskId) return;
const response = await fetch(getTraceRawUrl(taskId));
const blob = await response.blob();
const url = URL.createObjectURL(blob);
const a = document.createElement("a");
a.href = url;
a.download = `trace_${taskId}.json`;
a.click();
URL.revokeObjectURL(url);
}
async function openInPerfetto() {
if (!taskId) return;
// Fetch trace data from our local API
const response = await fetch(getTraceRawUrl(taskId));
const traceData = await response.arrayBuffer();
// Open Perfetto UI
const perfettoWindow = window.open("https://ui.perfetto.dev");
if (!perfettoWindow) {
alert("Failed to open Perfetto. Please allow popups.");
return;
}
// Wait for Perfetto to be ready, then send trace via postMessage
const onMessage = (e: MessageEvent) => {
if (e.data === "PONG") {
window.removeEventListener("message", onMessage);
perfettoWindow.postMessage(
{
perfetto: {
buffer: traceData,
title: `Trace ${taskId}`,
},
},
"https://ui.perfetto.dev",
);
}
};
window.addEventListener("message", onMessage);
// Ping Perfetto until it responds
const pingInterval = setInterval(() => {
perfettoWindow.postMessage("PING", "https://ui.perfetto.dev");
}, 50);
// Clean up after 10 seconds
setTimeout(() => {
clearInterval(pingInterval);
window.removeEventListener("message", onMessage);
}, 10000);
}
onMount(async () => {
if (!taskId) {
error = "No task ID provided";
loading = false;
return;
}
try {
stats = await fetchTraceStats(taskId);
} catch (e) {
error = e instanceof Error ? e.message : "Failed to load trace";
} finally {
loading = false;
}
});
const phases = $derived(stats ? parsePhases(stats.byCategory) : []);
const sortedRanks = $derived(
stats
? Object.keys(stats.byRank)
.map(Number)
.sort((a, b) => a - b)
: [],
);
const nodeCount = $derived(sortedRanks.length || 1);
</script>
<div class="min-h-screen bg-exo-dark-gray text-white">
<HeaderNav showHome={true} />
<div class="max-w-7xl mx-auto px-4 lg:px-8 py-6 space-y-6">
<div class="flex items-center justify-between gap-4 flex-wrap">
<div>
<h1
class="text-2xl font-mono tracking-[0.2em] uppercase text-exo-yellow"
>
Trace
</h1>
<p class="text-sm text-exo-light-gray font-mono truncate max-w-lg">
{taskId}
</p>
</div>
<div class="flex items-center gap-3">
<a
href="#/traces"
class="text-xs font-mono text-exo-light-gray hover:text-exo-yellow transition-colors uppercase border border-exo-medium-gray/40 px-3 py-1.5 rounded"
>
All Traces
</a>
<button
type="button"
class="text-xs font-mono text-exo-light-gray hover:text-exo-yellow transition-colors uppercase border border-exo-medium-gray/40 px-3 py-1.5 rounded"
onclick={downloadTrace}
disabled={loading || !!error}
>
Download
</button>
<button
type="button"
class="text-xs font-mono text-exo-dark-gray bg-exo-yellow hover:bg-exo-yellow/90 transition-colors uppercase px-3 py-1.5 rounded font-semibold"
onclick={openInPerfetto}
disabled={loading || !!error}
>
View Trace
</button>
</div>
</div>
{#if loading}
<div
class="rounded border border-exo-medium-gray/30 bg-exo-black/30 p-6 text-center text-exo-light-gray"
>
<div class="text-sm">Loading trace data...</div>
</div>
{:else if error}
<div
class="rounded border border-red-500/30 bg-red-500/10 p-6 text-center text-red-400"
>
<div class="text-sm">{error}</div>
</div>
{:else if stats}
<!-- Wall Time Summary -->
<div
class="rounded border border-exo-medium-gray/30 bg-exo-black/30 p-4 space-y-2"
>
<h2
class="text-sm font-mono uppercase tracking-wider text-exo-light-gray"
>
Summary
</h2>
<div class="text-3xl font-mono text-exo-yellow">
{formatDuration(stats.totalWallTimeUs)}
</div>
<div class="text-xs text-exo-light-gray">Total wall time</div>
</div>
<!-- By Phase -->
{#if phases.length > 0}
<div
class="rounded border border-exo-medium-gray/30 bg-exo-black/30 p-4 space-y-4"
>
<h2
class="text-sm font-mono uppercase tracking-wider text-exo-light-gray"
>
By Phase <span class="text-exo-light-gray/50">(avg per node)</span>
</h2>
<div class="space-y-4">
{#each phases as phase}
{@const normalizedTotal = phase.totalUs / nodeCount}
{@const normalizedStepCount = phase.stepCount / nodeCount}
<div class="space-y-2">
<div class="flex items-center justify-between">
<span class="text-sm font-mono text-white">{phase.name}</span>
<span class="text-sm font-mono">
<span class="text-exo-yellow"
>{formatDuration(normalizedTotal)}</span
>
<span class="text-exo-light-gray ml-2">
({normalizedStepCount} steps, {formatDuration(
normalizedTotal / normalizedStepCount,
)}/step)
</span>
</span>
</div>
{#if phase.subcategories.length > 0}
<div class="pl-4 space-y-1.5">
{#each phase.subcategories as subcat}
{@const normalizedSubcat =
subcat.stats.totalUs / nodeCount}
{@const pct = formatPercentage(
normalizedSubcat,
normalizedTotal,
)}
{@const perStep = normalizedSubcat / normalizedStepCount}
<div
class="flex items-center justify-between text-xs font-mono"
>
<span class="text-exo-light-gray">{subcat.name}</span>
<span class="text-white">
{formatDuration(normalizedSubcat)}
<span class="text-exo-light-gray ml-2">({pct})</span>
<span class="text-exo-light-gray/60 ml-2"
>{formatDuration(perStep)}/step</span
>
</span>
</div>
<!-- Progress bar -->
<div
class="relative h-1.5 bg-exo-black/60 rounded-sm overflow-hidden"
>
<div
class="absolute inset-y-0 left-0 bg-gradient-to-r from-exo-yellow to-exo-yellow/70 transition-all duration-300"
style="width: {pct}"
></div>
</div>
{/each}
</div>
{/if}
</div>
{/each}
</div>
</div>
{/if}
<!-- By Rank -->
{#if sortedRanks.length > 0}
<div
class="rounded border border-exo-medium-gray/30 bg-exo-black/30 p-4 space-y-4"
>
<h2
class="text-sm font-mono uppercase tracking-wider text-exo-light-gray"
>
By Rank
</h2>
<div class="grid grid-cols-1 md:grid-cols-2 lg:grid-cols-3 gap-4">
{#each sortedRanks as rank}
{@const rankStats = stats.byRank[rank]}
{@const rankPhases = parsePhases(rankStats.byCategory)}
<div
class="rounded border border-exo-medium-gray/20 bg-exo-dark-gray/60 p-3 space-y-3"
>
<div class="text-sm font-mono text-exo-yellow">
Rank {rank}
</div>
<div class="space-y-2">
{#each rankPhases as phase}
<div class="space-y-1">
<div class="flex items-center justify-between text-xs">
<span class="font-mono text-exo-light-gray"
>{phase.name}</span
>
<span class="font-mono text-white">
{formatDuration(phase.totalUs)}
<span class="text-exo-light-gray/50 ml-1">
({phase.stepCount}x)
</span>
</span>
</div>
{#if phase.subcategories.length > 0}
<div class="pl-2 space-y-0.5">
{#each phase.subcategories as subcat}
{@const pct = formatPercentage(
subcat.stats.totalUs,
phase.totalUs,
)}
{@const perStep =
subcat.stats.totalUs / phase.stepCount}
<div
class="flex items-center justify-between text-[10px] font-mono"
>
<span class="text-exo-light-gray/70"
>{subcat.name}</span
>
<span class="text-exo-light-gray">
{formatDuration(subcat.stats.totalUs)}
<span class="text-exo-light-gray/50"
>({pct})</span
>
<span class="text-exo-light-gray/30 ml-1"
>{formatDuration(perStep)}/step</span
>
</span>
</div>
{/each}
</div>
{/if}
</div>
{/each}
</div>
</div>
{/each}
</div>
</div>
{/if}
{/if}
</div>
</div>

View File

@@ -10,6 +10,7 @@ PROJECT_ROOT = Path.cwd()
SOURCE_ROOT = PROJECT_ROOT / "src"
ENTRYPOINT = SOURCE_ROOT / "exo" / "__main__.py"
DASHBOARD_DIR = PROJECT_ROOT / "dashboard" / "build"
RESOURCES_DIR = PROJECT_ROOT / "resources"
EXO_SHARED_MODELS_DIR = SOURCE_ROOT / "exo" / "shared" / "models"
if not ENTRYPOINT.is_file():
@@ -18,6 +19,9 @@ if not ENTRYPOINT.is_file():
if not DASHBOARD_DIR.is_dir():
raise SystemExit(f"Dashboard assets are missing: {DASHBOARD_DIR}")
if not RESOURCES_DIR.is_dir():
raise SystemExit(f"Resource assets are missing: {RESOURCES_DIR}")
if not EXO_SHARED_MODELS_DIR.is_dir():
raise SystemExit(f"Shared model assets are missing: {EXO_SHARED_MODELS_DIR}")
@@ -58,6 +62,7 @@ HIDDEN_IMPORTS = sorted(
DATAS: list[tuple[str, str]] = [
(str(DASHBOARD_DIR), "dashboard"),
(str(RESOURCES_DIR), "resources"),
(str(MLX_LIB_DIR), "mlx/lib"),
(str(EXO_SHARED_MODELS_DIR), "exo/shared/models"),
]

View File

@@ -69,7 +69,9 @@
# Create wrapper scripts
for script in exo exo-master exo-worker; do
makeWrapper ${exoVenv}/bin/$script $out/bin/$script \
--set DASHBOARD_DIR ${self'.packages.dashboard}
--set EXO_DASHBOARD_DIR ${self'.packages.dashboard} \
--set EXO_RESOURCES_DIR ${inputs.self + "/resources"} \
${lib.optionalString pkgs.stdenv.isDarwin "--prefix PATH : ${pkgs.macmon}/bin"}
done
'';
in

View File

@@ -0,0 +1,45 @@
model_id = "exolabs/FLUX.1-Krea-dev-4bit"
n_layers = 57
hidden_size = 1
supports_tensor = false
tasks = ["TextToImage"]
[storage_size]
in_bytes = 15475325472
[[components]]
component_name = "text_encoder"
component_path = "text_encoder/"
n_layers = 12
can_shard = false
[components.storage_size]
in_bytes = 0
[[components]]
component_name = "text_encoder_2"
component_path = "text_encoder_2/"
n_layers = 24
can_shard = false
safetensors_index_filename = "model.safetensors.index.json"
[components.storage_size]
in_bytes = 9524621312
[[components]]
component_name = "transformer"
component_path = "transformer/"
n_layers = 57
can_shard = true
safetensors_index_filename = "diffusion_pytorch_model.safetensors.index.json"
[components.storage_size]
in_bytes = 5950704160
[[components]]
component_name = "vae"
component_path = "vae/"
can_shard = false
[components.storage_size]
in_bytes = 0

View File

@@ -0,0 +1,45 @@
model_id = "exolabs/FLUX.1-Krea-dev-8bit"
n_layers = 57
hidden_size = 1
supports_tensor = false
tasks = ["TextToImage"]
[storage_size]
in_bytes = 21426029632
[[components]]
component_name = "text_encoder"
component_path = "text_encoder/"
n_layers = 12
can_shard = false
[components.storage_size]
in_bytes = 0
[[components]]
component_name = "text_encoder_2"
component_path = "text_encoder_2/"
n_layers = 24
can_shard = false
safetensors_index_filename = "model.safetensors.index.json"
[components.storage_size]
in_bytes = 9524621312
[[components]]
component_name = "transformer"
component_path = "transformer/"
n_layers = 57
can_shard = true
safetensors_index_filename = "diffusion_pytorch_model.safetensors.index.json"
[components.storage_size]
in_bytes = 11901408320
[[components]]
component_name = "vae"
component_path = "vae/"
can_shard = false
[components.storage_size]
in_bytes = 0

View File

@@ -0,0 +1,45 @@
model_id = "exolabs/FLUX.1-Krea-dev"
n_layers = 57
hidden_size = 1
supports_tensor = false
tasks = ["TextToImage"]
[storage_size]
in_bytes = 33327437952
[[components]]
component_name = "text_encoder"
component_path = "text_encoder/"
n_layers = 12
can_shard = false
[components.storage_size]
in_bytes = 0
[[components]]
component_name = "text_encoder_2"
component_path = "text_encoder_2/"
n_layers = 24
can_shard = false
safetensors_index_filename = "model.safetensors.index.json"
[components.storage_size]
in_bytes = 9524621312
[[components]]
component_name = "transformer"
component_path = "transformer/"
n_layers = 57
can_shard = true
safetensors_index_filename = "diffusion_pytorch_model.safetensors.index.json"
[components.storage_size]
in_bytes = 23802816640
[[components]]
component_name = "vae"
component_path = "vae/"
can_shard = false
[components.storage_size]
in_bytes = 0

View File

@@ -0,0 +1,45 @@
model_id = "exolabs/FLUX.1-dev-4bit"
n_layers = 57
hidden_size = 1
supports_tensor = false
tasks = ["TextToImage"]
[storage_size]
in_bytes = 15475325472
[[components]]
component_name = "text_encoder"
component_path = "text_encoder/"
n_layers = 12
can_shard = false
[components.storage_size]
in_bytes = 0
[[components]]
component_name = "text_encoder_2"
component_path = "text_encoder_2/"
n_layers = 24
can_shard = false
safetensors_index_filename = "model.safetensors.index.json"
[components.storage_size]
in_bytes = 9524621312
[[components]]
component_name = "transformer"
component_path = "transformer/"
n_layers = 57
can_shard = true
safetensors_index_filename = "diffusion_pytorch_model.safetensors.index.json"
[components.storage_size]
in_bytes = 5950704160
[[components]]
component_name = "vae"
component_path = "vae/"
can_shard = false
[components.storage_size]
in_bytes = 0

View File

@@ -0,0 +1,45 @@
model_id = "exolabs/FLUX.1-dev-8bit"
n_layers = 57
hidden_size = 1
supports_tensor = false
tasks = ["TextToImage"]
[storage_size]
in_bytes = 21426029632
[[components]]
component_name = "text_encoder"
component_path = "text_encoder/"
n_layers = 12
can_shard = false
[components.storage_size]
in_bytes = 0
[[components]]
component_name = "text_encoder_2"
component_path = "text_encoder_2/"
n_layers = 24
can_shard = false
safetensors_index_filename = "model.safetensors.index.json"
[components.storage_size]
in_bytes = 9524621312
[[components]]
component_name = "transformer"
component_path = "transformer/"
n_layers = 57
can_shard = true
safetensors_index_filename = "diffusion_pytorch_model.safetensors.index.json"
[components.storage_size]
in_bytes = 11901408320
[[components]]
component_name = "vae"
component_path = "vae/"
can_shard = false
[components.storage_size]
in_bytes = 0

View File

@@ -0,0 +1,45 @@
model_id = "exolabs/FLUX.1-dev"
n_layers = 57
hidden_size = 1
supports_tensor = false
tasks = ["TextToImage"]
[storage_size]
in_bytes = 33327437952
[[components]]
component_name = "text_encoder"
component_path = "text_encoder/"
n_layers = 12
can_shard = false
[components.storage_size]
in_bytes = 0
[[components]]
component_name = "text_encoder_2"
component_path = "text_encoder_2/"
n_layers = 24
can_shard = false
safetensors_index_filename = "model.safetensors.index.json"
[components.storage_size]
in_bytes = 9524621312
[[components]]
component_name = "transformer"
component_path = "transformer/"
n_layers = 57
can_shard = true
safetensors_index_filename = "diffusion_pytorch_model.safetensors.index.json"
[components.storage_size]
in_bytes = 23802816640
[[components]]
component_name = "vae"
component_path = "vae/"
can_shard = false
[components.storage_size]
in_bytes = 0

View File

@@ -0,0 +1,45 @@
model_id = "exolabs/FLUX.1-schnell-4bit"
n_layers = 57
hidden_size = 1
supports_tensor = false
tasks = ["TextToImage"]
[storage_size]
in_bytes = 15470210592
[[components]]
component_name = "text_encoder"
component_path = "text_encoder/"
n_layers = 12
can_shard = false
[components.storage_size]
in_bytes = 0
[[components]]
component_name = "text_encoder_2"
component_path = "text_encoder_2/"
n_layers = 24
can_shard = false
safetensors_index_filename = "model.safetensors.index.json"
[components.storage_size]
in_bytes = 9524621312
[[components]]
component_name = "transformer"
component_path = "transformer/"
n_layers = 57
can_shard = true
safetensors_index_filename = "diffusion_pytorch_model.safetensors.index.json"
[components.storage_size]
in_bytes = 5945589280
[[components]]
component_name = "vae"
component_path = "vae/"
can_shard = false
[components.storage_size]
in_bytes = 0

View File

@@ -0,0 +1,45 @@
model_id = "exolabs/FLUX.1-schnell-8bit"
n_layers = 57
hidden_size = 1
supports_tensor = false
tasks = ["TextToImage"]
[storage_size]
in_bytes = 21415799872
[[components]]
component_name = "text_encoder"
component_path = "text_encoder/"
n_layers = 12
can_shard = false
[components.storage_size]
in_bytes = 0
[[components]]
component_name = "text_encoder_2"
component_path = "text_encoder_2/"
n_layers = 24
can_shard = false
safetensors_index_filename = "model.safetensors.index.json"
[components.storage_size]
in_bytes = 9524621312
[[components]]
component_name = "transformer"
component_path = "transformer/"
n_layers = 57
can_shard = true
safetensors_index_filename = "diffusion_pytorch_model.safetensors.index.json"
[components.storage_size]
in_bytes = 11891178560
[[components]]
component_name = "vae"
component_path = "vae/"
can_shard = false
[components.storage_size]
in_bytes = 0

View File

@@ -0,0 +1,45 @@
model_id = "exolabs/FLUX.1-schnell"
n_layers = 57
hidden_size = 1
supports_tensor = false
tasks = ["TextToImage"]
[storage_size]
in_bytes = 33306978432
[[components]]
component_name = "text_encoder"
component_path = "text_encoder/"
n_layers = 12
can_shard = false
[components.storage_size]
in_bytes = 0
[[components]]
component_name = "text_encoder_2"
component_path = "text_encoder_2/"
n_layers = 24
can_shard = false
safetensors_index_filename = "model.safetensors.index.json"
[components.storage_size]
in_bytes = 9524621312
[[components]]
component_name = "transformer"
component_path = "transformer/"
n_layers = 57
can_shard = true
safetensors_index_filename = "diffusion_pytorch_model.safetensors.index.json"
[components.storage_size]
in_bytes = 23782357120
[[components]]
component_name = "vae"
component_path = "vae/"
can_shard = false
[components.storage_size]
in_bytes = 0

View File

@@ -0,0 +1,36 @@
model_id = "exolabs/Qwen-Image-4bit"
n_layers = 60
hidden_size = 1
supports_tensor = false
tasks = ["TextToImage"]
uses_cfg = true
[storage_size]
in_bytes = 26799533856
[[components]]
component_name = "text_encoder"
component_path = "text_encoder/"
n_layers = 12
can_shard = false
[components.storage_size]
in_bytes = 16584333312
[[components]]
component_name = "transformer"
component_path = "transformer/"
n_layers = 60
can_shard = true
safetensors_index_filename = "diffusion_pytorch_model.safetensors.index.json"
[components.storage_size]
in_bytes = 10215200544
[[components]]
component_name = "vae"
component_path = "vae/"
can_shard = false
[components.storage_size]
in_bytes = 0

View File

@@ -0,0 +1,36 @@
model_id = "exolabs/Qwen-Image-8bit"
n_layers = 60
hidden_size = 1
supports_tensor = false
tasks = ["TextToImage"]
uses_cfg = true
[storage_size]
in_bytes = 37014734400
[[components]]
component_name = "text_encoder"
component_path = "text_encoder/"
n_layers = 12
can_shard = false
[components.storage_size]
in_bytes = 16584333312
[[components]]
component_name = "transformer"
component_path = "transformer/"
n_layers = 60
can_shard = true
safetensors_index_filename = "diffusion_pytorch_model.safetensors.index.json"
[components.storage_size]
in_bytes = 20430401088
[[components]]
component_name = "vae"
component_path = "vae/"
can_shard = false
[components.storage_size]
in_bytes = 0

View File

@@ -0,0 +1,36 @@
model_id = "exolabs/Qwen-Image-Edit-2509-4bit"
n_layers = 60
hidden_size = 1
supports_tensor = false
tasks = ["ImageToImage"]
uses_cfg = true
[storage_size]
in_bytes = 26799533856
[[components]]
component_name = "text_encoder"
component_path = "text_encoder/"
n_layers = 12
can_shard = false
[components.storage_size]
in_bytes = 16584333312
[[components]]
component_name = "transformer"
component_path = "transformer/"
n_layers = 60
can_shard = true
safetensors_index_filename = "diffusion_pytorch_model.safetensors.index.json"
[components.storage_size]
in_bytes = 10215200544
[[components]]
component_name = "vae"
component_path = "vae/"
can_shard = false
[components.storage_size]
in_bytes = 0

View File

@@ -0,0 +1,36 @@
model_id = "exolabs/Qwen-Image-Edit-2509-8bit"
n_layers = 60
hidden_size = 1
supports_tensor = false
tasks = ["ImageToImage"]
uses_cfg = true
[storage_size]
in_bytes = 37014734400
[[components]]
component_name = "text_encoder"
component_path = "text_encoder/"
n_layers = 12
can_shard = false
[components.storage_size]
in_bytes = 16584333312
[[components]]
component_name = "transformer"
component_path = "transformer/"
n_layers = 60
can_shard = true
safetensors_index_filename = "diffusion_pytorch_model.safetensors.index.json"
[components.storage_size]
in_bytes = 20430401088
[[components]]
component_name = "vae"
component_path = "vae/"
can_shard = false
[components.storage_size]
in_bytes = 0

View File

@@ -0,0 +1,36 @@
model_id = "exolabs/Qwen-Image-Edit-2509"
n_layers = 60
hidden_size = 1
supports_tensor = false
tasks = ["ImageToImage"]
uses_cfg = true
[storage_size]
in_bytes = 57445135488
[[components]]
component_name = "text_encoder"
component_path = "text_encoder/"
n_layers = 12
can_shard = false
[components.storage_size]
in_bytes = 16584333312
[[components]]
component_name = "transformer"
component_path = "transformer/"
n_layers = 60
can_shard = true
safetensors_index_filename = "diffusion_pytorch_model.safetensors.index.json"
[components.storage_size]
in_bytes = 40860802176
[[components]]
component_name = "vae"
component_path = "vae/"
can_shard = false
[components.storage_size]
in_bytes = 0

View File

@@ -0,0 +1,36 @@
model_id = "exolabs/Qwen-Image"
n_layers = 60
hidden_size = 1
supports_tensor = false
tasks = ["TextToImage"]
uses_cfg = true
[storage_size]
in_bytes = 57445135488
[[components]]
component_name = "text_encoder"
component_path = "text_encoder/"
n_layers = 12
can_shard = false
[components.storage_size]
in_bytes = 16584333312
[[components]]
component_name = "transformer"
component_path = "transformer/"
n_layers = 60
can_shard = true
safetensors_index_filename = "diffusion_pytorch_model.safetensors.index.json"
[components.storage_size]
in_bytes = 40860802176
[[components]]
component_name = "vae"
component_path = "vae/"
can_shard = false
[components.storage_size]
in_bytes = 0

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/DeepSeek-V3.1-4bit"
n_layers = 61
hidden_size = 7168
supports_tensor = true
tasks = ["TextGeneration"]
family = "deepseek"
quantization = "4bit"
base_model = "DeepSeek V3.1"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 405874409472

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/DeepSeek-V3.1-8bit"
n_layers = 61
hidden_size = 7168
supports_tensor = true
tasks = ["TextGeneration"]
family = "deepseek"
quantization = "8bit"
base_model = "DeepSeek V3.1"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 765577920512

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/GLM-4.5-Air-8bit"
n_layers = 46
hidden_size = 4096
supports_tensor = false
tasks = ["TextGeneration"]
family = "glm"
quantization = "8bit"
base_model = "GLM 4.5 Air"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 122406567936

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/GLM-4.5-Air-bf16"
n_layers = 46
hidden_size = 4096
supports_tensor = true
tasks = ["TextGeneration"]
family = "glm"
quantization = "bf16"
base_model = "GLM 4.5 Air"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 229780750336

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/GLM-4.7-4bit"
n_layers = 91
hidden_size = 5120
supports_tensor = true
tasks = ["TextGeneration"]
family = "glm"
quantization = "4bit"
base_model = "GLM 4.7"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 198556925568

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/GLM-4.7-6bit"
n_layers = 91
hidden_size = 5120
supports_tensor = true
tasks = ["TextGeneration"]
family = "glm"
quantization = "6bit"
base_model = "GLM 4.7"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 286737579648

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/GLM-4.7-8bit-gs32"
n_layers = 91
hidden_size = 5120
supports_tensor = true
tasks = ["TextGeneration"]
family = "glm"
quantization = "8bit"
base_model = "GLM 4.7"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 396963397248

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/GLM-4.7-Flash-4bit"
n_layers = 47
hidden_size = 2048
supports_tensor = true
tasks = ["TextGeneration"]
family = "glm"
quantization = "4bit"
base_model = "GLM 4.7 Flash"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 19327352832

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/GLM-4.7-Flash-5bit"
n_layers = 47
hidden_size = 2048
supports_tensor = true
tasks = ["TextGeneration"]
family = "glm"
quantization = "5bit"
base_model = "GLM 4.7 Flash"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 22548578304

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/GLM-4.7-Flash-6bit"
n_layers = 47
hidden_size = 2048
supports_tensor = true
tasks = ["TextGeneration"]
family = "glm"
quantization = "6bit"
base_model = "GLM 4.7 Flash"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 26843545600

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/GLM-4.7-Flash-8bit"
n_layers = 47
hidden_size = 2048
supports_tensor = true
tasks = ["TextGeneration"]
family = "glm"
quantization = "8bit"
base_model = "GLM 4.7 Flash"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 34359738368

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/Kimi-K2-Instruct-4bit"
n_layers = 61
hidden_size = 7168
supports_tensor = true
tasks = ["TextGeneration"]
family = "kimi"
quantization = "4bit"
base_model = "Kimi K2"
capabilities = ["text"]
[storage_size]
in_bytes = 620622774272

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/Kimi-K2-Thinking"
n_layers = 61
hidden_size = 7168
supports_tensor = true
tasks = ["TextGeneration"]
family = "kimi"
quantization = ""
base_model = "Kimi K2"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 706522120192

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/Kimi-K2.5"
n_layers = 61
hidden_size = 7168
supports_tensor = true
tasks = ["TextGeneration"]
family = "kimi"
quantization = ""
base_model = "Kimi K2.5"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 662498705408

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/Llama-3.2-1B-Instruct-4bit"
n_layers = 16
hidden_size = 2048
supports_tensor = true
tasks = ["TextGeneration"]
family = "llama"
quantization = "4bit"
base_model = "Llama 3.2 1B"
capabilities = ["text"]
[storage_size]
in_bytes = 729808896

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/Llama-3.2-3B-Instruct-4bit"
n_layers = 28
hidden_size = 3072
supports_tensor = true
tasks = ["TextGeneration"]
family = "llama"
quantization = "4bit"
base_model = "Llama 3.2 3B"
capabilities = ["text"]
[storage_size]
in_bytes = 1863319552

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/Llama-3.2-3B-Instruct-8bit"
n_layers = 28
hidden_size = 3072
supports_tensor = true
tasks = ["TextGeneration"]
family = "llama"
quantization = "8bit"
base_model = "Llama 3.2 3B"
capabilities = ["text"]
[storage_size]
in_bytes = 3501195264

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/Llama-3.3-70B-Instruct-4bit"
n_layers = 80
hidden_size = 8192
supports_tensor = true
tasks = ["TextGeneration"]
family = "llama"
quantization = "4bit"
base_model = "Llama 3.3 70B"
capabilities = ["text"]
[storage_size]
in_bytes = 40652242944

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/Llama-3.3-70B-Instruct-8bit"
n_layers = 80
hidden_size = 8192
supports_tensor = true
tasks = ["TextGeneration"]
family = "llama"
quantization = "8bit"
base_model = "Llama 3.3 70B"
capabilities = ["text"]
[storage_size]
in_bytes = 76799803392

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/Meta-Llama-3.1-70B-Instruct-4bit"
n_layers = 80
hidden_size = 8192
supports_tensor = true
tasks = ["TextGeneration"]
family = "llama"
quantization = "4bit"
base_model = "Llama 3.1 70B"
capabilities = ["text"]
[storage_size]
in_bytes = 40652242944

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/Meta-Llama-3.1-8B-Instruct-4bit"
n_layers = 32
hidden_size = 4096
supports_tensor = true
tasks = ["TextGeneration"]
family = "llama"
quantization = "4bit"
base_model = "Llama 3.1 8B"
capabilities = ["text"]
[storage_size]
in_bytes = 4637851648

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/Meta-Llama-3.1-8B-Instruct-8bit"
n_layers = 32
hidden_size = 4096
supports_tensor = true
tasks = ["TextGeneration"]
family = "llama"
quantization = "8bit"
base_model = "Llama 3.1 8B"
capabilities = ["text"]
[storage_size]
in_bytes = 8954839040

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/Meta-Llama-3.1-8B-Instruct-bf16"
n_layers = 32
hidden_size = 4096
supports_tensor = true
tasks = ["TextGeneration"]
family = "llama"
quantization = "bf16"
base_model = "Llama 3.1 8B"
capabilities = ["text"]
[storage_size]
in_bytes = 16882073600

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/MiniMax-M2.1-3bit"
n_layers = 61
hidden_size = 3072
supports_tensor = true
tasks = ["TextGeneration"]
family = "minimax"
quantization = "3bit"
base_model = "MiniMax M2.1"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 100086644736

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/MiniMax-M2.1-8bit"
n_layers = 61
hidden_size = 3072
supports_tensor = true
tasks = ["TextGeneration"]
family = "minimax"
quantization = "8bit"
base_model = "MiniMax M2.1"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 242986745856

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/Qwen3-0.6B-4bit"
n_layers = 28
hidden_size = 1024
supports_tensor = false
tasks = ["TextGeneration"]
family = "qwen"
quantization = "4bit"
base_model = "Qwen3 0.6B"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 342884352

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/Qwen3-0.6B-8bit"
n_layers = 28
hidden_size = 1024
supports_tensor = false
tasks = ["TextGeneration"]
family = "qwen"
quantization = "8bit"
base_model = "Qwen3 0.6B"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 698351616

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/Qwen3-235B-A22B-Instruct-2507-4bit"
n_layers = 94
hidden_size = 4096
supports_tensor = true
tasks = ["TextGeneration"]
family = "qwen"
quantization = "4bit"
base_model = "Qwen3 235B"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 141733920768

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/Qwen3-235B-A22B-Instruct-2507-8bit"
n_layers = 94
hidden_size = 4096
supports_tensor = true
tasks = ["TextGeneration"]
family = "qwen"
quantization = "8bit"
base_model = "Qwen3 235B"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 268435456000

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/Qwen3-30B-A3B-4bit"
n_layers = 48
hidden_size = 2048
supports_tensor = true
tasks = ["TextGeneration"]
family = "qwen"
quantization = "4bit"
base_model = "Qwen3 30B"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 17612931072

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/Qwen3-30B-A3B-8bit"
n_layers = 48
hidden_size = 2048
supports_tensor = true
tasks = ["TextGeneration"]
family = "qwen"
quantization = "8bit"
base_model = "Qwen3 30B"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 33279705088

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/Qwen3-Coder-480B-A35B-Instruct-4bit"
n_layers = 62
hidden_size = 6144
supports_tensor = true
tasks = ["TextGeneration"]
family = "qwen"
quantization = "4bit"
base_model = "Qwen3 Coder 480B"
capabilities = ["text", "code"]
[storage_size]
in_bytes = 289910292480

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/Qwen3-Coder-480B-A35B-Instruct-8bit"
n_layers = 62
hidden_size = 6144
supports_tensor = true
tasks = ["TextGeneration"]
family = "qwen"
quantization = "8bit"
base_model = "Qwen3 Coder 480B"
capabilities = ["text", "code"]
[storage_size]
in_bytes = 579820584960

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/Qwen3-Next-80B-A3B-Instruct-4bit"
n_layers = 48
hidden_size = 2048
supports_tensor = true
tasks = ["TextGeneration"]
family = "qwen"
quantization = "4bit"
base_model = "Qwen3 Next 80B"
capabilities = ["text"]
[storage_size]
in_bytes = 46976204800

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/Qwen3-Next-80B-A3B-Instruct-8bit"
n_layers = 48
hidden_size = 2048
supports_tensor = true
tasks = ["TextGeneration"]
family = "qwen"
quantization = "8bit"
base_model = "Qwen3 Next 80B"
capabilities = ["text"]
[storage_size]
in_bytes = 88814387200

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/Qwen3-Next-80B-A3B-Thinking-4bit"
n_layers = 48
hidden_size = 2048
supports_tensor = true
tasks = ["TextGeneration"]
family = "qwen"
quantization = "4bit"
base_model = "Qwen3 Next 80B"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 47080074240

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/Qwen3-Next-80B-A3B-Thinking-8bit"
n_layers = 48
hidden_size = 2048
supports_tensor = true
tasks = ["TextGeneration"]
family = "qwen"
quantization = "8bit"
base_model = "Qwen3 Next 80B"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 88814387200

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/gpt-oss-120b-MXFP4-Q8"
n_layers = 36
hidden_size = 2880
supports_tensor = true
tasks = ["TextGeneration"]
family = "gpt-oss"
quantization = "MXFP4-Q8"
base_model = "GPT-OSS 120B"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 70652212224

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/gpt-oss-20b-MXFP4-Q8"
n_layers = 24
hidden_size = 2880
supports_tensor = true
tasks = ["TextGeneration"]
family = "gpt-oss"
quantization = "MXFP4-Q8"
base_model = "GPT-OSS 20B"
capabilities = ["text", "thinking"]
[storage_size]
in_bytes = 12025908224

View File

@@ -0,0 +1,12 @@
model_id = "mlx-community/llama-3.3-70b-instruct-fp16"
n_layers = 80
hidden_size = 8192
supports_tensor = true
tasks = ["TextGeneration"]
family = "llama"
quantization = "fp16"
base_model = "Llama 3.3 70B"
capabilities = ["text"]
[storage_size]
in_bytes = 144383672320

View File

@@ -1,4 +1,5 @@
import asyncio
import socket
from dataclasses import dataclass, field
from typing import Iterator
@@ -60,10 +61,37 @@ class DownloadCoordinator:
async def run(self) -> None:
logger.info("Starting DownloadCoordinator")
self._test_internet_connection()
async with self._tg as tg:
tg.start_soon(self._command_processor)
tg.start_soon(self._forward_events)
tg.start_soon(self._emit_existing_download_progress)
tg.start_soon(self._check_internet_connection)
def _test_internet_connection(self) -> None:
try:
socket.create_connection(("1.1.1.1", 443), timeout=3).close()
self.shard_downloader.set_internet_connection(True)
except OSError:
self.shard_downloader.set_internet_connection(False)
logger.debug(
f"Internet connectivity: {self.shard_downloader.internet_connection}"
)
async def _check_internet_connection(self) -> None:
first_connection = True
while True:
await asyncio.sleep(10)
# Assume that internet connection is set to False on 443 errors.
if self.shard_downloader.internet_connection:
continue
self._test_internet_connection()
if first_connection and self.shard_downloader.internet_connection:
first_connection = False
self._tg.start_soon(self._emit_existing_download_progress)
def shutdown(self) -> None:
self._tg.cancel_scope.cancel()
@@ -241,7 +269,7 @@ class DownloadCoordinator:
async def _emit_existing_download_progress(self) -> None:
try:
while True:
logger.info(
logger.debug(
"DownloadCoordinator: Fetching and emitting existing download progress..."
)
async for (
@@ -274,10 +302,10 @@ class DownloadCoordinator:
await self.event_sender.send(
NodeDownloadProgress(download_progress=status)
)
logger.info(
logger.debug(
"DownloadCoordinator: Done emitting existing download progress."
)
await anyio.sleep(5 * 60) # 5 minutes
await anyio.sleep(60)
except Exception as e:
logger.error(
f"DownloadCoordinator: Error emitting existing download progress: {e}"

View File

@@ -49,6 +49,10 @@ class HuggingFaceAuthenticationError(Exception):
"""Raised when HuggingFace returns 401/403 for a model download."""
class HuggingFaceRateLimitError(Exception):
"""429 Huggingface code"""
async def _build_auth_error_message(status_code: int, model_id: ModelId) -> str:
token = await get_hf_token()
if status_code == 401 and token is None:
@@ -154,49 +158,76 @@ async def seed_models(seed_dir: str | Path):
logger.error(traceback.format_exc())
_fetched_file_lists_this_session: set[str] = set()
async def fetch_file_list_with_cache(
model_id: ModelId, revision: str = "main", recursive: bool = False
model_id: ModelId,
revision: str = "main",
recursive: bool = False,
skip_internet: bool = False,
on_connection_lost: Callable[[], None] = lambda: None,
) -> list[FileListEntry]:
target_dir = (await ensure_models_dir()) / "caches" / model_id.normalize()
await aios.makedirs(target_dir, exist_ok=True)
cache_file = target_dir / f"{model_id.normalize()}--{revision}--file_list.json"
cache_key = f"{model_id.normalize()}--{revision}"
if cache_key in _fetched_file_lists_this_session and await aios.path.exists(
cache_file
):
async with aiofiles.open(cache_file, "r") as f:
return TypeAdapter(list[FileListEntry]).validate_json(await f.read())
if skip_internet:
if await aios.path.exists(cache_file):
async with aiofiles.open(cache_file, "r") as f:
return TypeAdapter(list[FileListEntry]).validate_json(await f.read())
raise FileNotFoundError(
f"No internet connection and no cached file list for {model_id}"
)
# Always try fresh first
try:
file_list = await fetch_file_list_with_retry(
model_id, revision, recursive=recursive
model_id,
revision,
recursive=recursive,
on_connection_lost=on_connection_lost,
)
# Update cache with fresh data
async with aiofiles.open(cache_file, "w") as f:
await f.write(
TypeAdapter(list[FileListEntry]).dump_json(file_list).decode()
)
_fetched_file_lists_this_session.add(cache_key)
return file_list
except Exception as e:
# Fetch failed - try cache fallback
if await aios.path.exists(cache_file):
logger.warning(
f"Failed to fetch file list for {model_id}, using cached data: {e}"
)
async with aiofiles.open(cache_file, "r") as f:
return TypeAdapter(list[FileListEntry]).validate_json(await f.read())
# No cache available, propagate the error
raise
raise FileNotFoundError(f"Failed to fetch file list for {model_id}: {e}") from e
async def fetch_file_list_with_retry(
model_id: ModelId, revision: str = "main", path: str = "", recursive: bool = False
model_id: ModelId,
revision: str = "main",
path: str = "",
recursive: bool = False,
on_connection_lost: Callable[[], None] = lambda: None,
) -> list[FileListEntry]:
n_attempts = 30
n_attempts = 3
for attempt in range(n_attempts):
try:
return await _fetch_file_list(model_id, revision, path, recursive)
except HuggingFaceAuthenticationError:
raise
except Exception as e:
on_connection_lost()
if attempt == n_attempts - 1:
raise e
await asyncio.sleep(min(8, 0.1 * float(2.0 ** int(attempt))))
await asyncio.sleep(2.0**attempt)
raise Exception(
f"Failed to fetch file list for {model_id=} {revision=} {path=} {recursive=}"
)
@@ -216,7 +247,11 @@ async def _fetch_file_list(
if response.status in [401, 403]:
msg = await _build_auth_error_message(response.status, model_id)
raise HuggingFaceAuthenticationError(msg)
if response.status == 200:
elif response.status == 429:
raise HuggingFaceRateLimitError(
f"Couldn't download {model_id} because of HuggingFace rate limit."
)
elif response.status == 200:
data_json = await response.text()
data = TypeAdapter(list[FileListEntry]).validate_json(data_json)
files: list[FileListEntry] = []
@@ -249,7 +284,7 @@ def create_http_session(
else:
total_timeout = 1800
connect_timeout = 60
sock_read_timeout = 1800
sock_read_timeout = 60
sock_connect_timeout = 60
ssl_context = ssl.create_default_context(
@@ -324,8 +359,9 @@ async def download_file_with_retry(
path: str,
target_dir: Path,
on_progress: Callable[[int, int, bool], None] = lambda _, __, ___: None,
on_connection_lost: Callable[[], None] = lambda: None,
) -> Path:
n_attempts = 30
n_attempts = 3
for attempt in range(n_attempts):
try:
return await _download_file(
@@ -333,14 +369,19 @@ async def download_file_with_retry(
)
except HuggingFaceAuthenticationError:
raise
except Exception as e:
if isinstance(e, FileNotFoundError) or attempt == n_attempts - 1:
except HuggingFaceRateLimitError as e:
if attempt == n_attempts - 1:
raise e
logger.error(
f"Download error on attempt {attempt}/{n_attempts} for {model_id=} {revision=} {path=} {target_dir=}"
)
logger.error(traceback.format_exc())
await asyncio.sleep(min(8, 0.1 * (2.0**attempt)))
await asyncio.sleep(2.0**attempt)
except Exception as e:
on_connection_lost()
if attempt == n_attempts - 1:
raise e
break
raise Exception(
f"Failed to download file {model_id=} {revision=} {path=} {target_dir=}"
)
@@ -542,7 +583,9 @@ async def download_shard(
on_progress: Callable[[ShardMetadata, RepoDownloadProgress], Awaitable[None]],
max_parallel_downloads: int = 8,
skip_download: bool = False,
skip_internet: bool = False,
allow_patterns: list[str] | None = None,
on_connection_lost: Callable[[], None] = lambda: None,
) -> tuple[Path, RepoDownloadProgress]:
if not skip_download:
logger.debug(f"Downloading {shard.model_card.model_id=}")
@@ -562,7 +605,11 @@ async def download_shard(
all_start_time = time.time()
file_list = await fetch_file_list_with_cache(
shard.model_card.model_id, revision, recursive=True
shard.model_card.model_id,
revision,
recursive=True,
skip_internet=skip_internet,
on_connection_lost=on_connection_lost,
)
filtered_file_list = list(
filter_repo_objects(
@@ -672,6 +719,7 @@ async def download_shard(
lambda curr_bytes, total_bytes, is_renamed: schedule_progress(
file, curr_bytes, total_bytes, is_renamed
),
on_connection_lost=on_connection_lost,
)
if not skip_download:

View File

@@ -1,4 +1,5 @@
import asyncio
from asyncio import create_task
from collections.abc import Awaitable
from pathlib import Path
from typing import AsyncIterator, Callable
@@ -7,7 +8,7 @@ from loguru import logger
from exo.download.download_utils import RepoDownloadProgress, download_shard
from exo.download.shard_downloader import ShardDownloader
from exo.shared.models.model_cards import MODEL_CARDS, ModelCard, ModelId
from exo.shared.models.model_cards import ModelCard, ModelId, get_model_cards
from exo.shared.types.worker.shards import (
PipelineShardMetadata,
ShardMetadata,
@@ -49,6 +50,10 @@ class SingletonShardDownloader(ShardDownloader):
self.shard_downloader = shard_downloader
self.active_downloads: dict[ShardMetadata, asyncio.Task[Path]] = {}
def set_internet_connection(self, value: bool) -> None:
self.internet_connection = value
self.shard_downloader.set_internet_connection(value)
def on_progress(
self,
callback: Callable[[ShardMetadata, RepoDownloadProgress], Awaitable[None]],
@@ -85,6 +90,10 @@ class CachedShardDownloader(ShardDownloader):
self.shard_downloader = shard_downloader
self.cache: dict[tuple[str, ShardMetadata], Path] = {}
def set_internet_connection(self, value: bool) -> None:
self.internet_connection = value
self.shard_downloader.set_internet_connection(value)
def on_progress(
self,
callback: Callable[[ShardMetadata, RepoDownloadProgress], Awaitable[None]],
@@ -142,6 +151,8 @@ class ResumableShardDownloader(ShardDownloader):
self.on_progress_wrapper,
max_parallel_downloads=self.max_parallel_downloads,
allow_patterns=allow_patterns,
skip_internet=not self.internet_connection,
on_connection_lost=lambda: self.set_internet_connection(False),
)
return target_dir
@@ -154,13 +165,24 @@ class ResumableShardDownloader(ShardDownloader):
"""Helper coroutine that builds the shard for a model and gets its download status."""
shard = await build_full_shard(model_id)
return await download_shard(
shard, self.on_progress_wrapper, skip_download=True
shard,
self.on_progress_wrapper,
skip_download=True,
skip_internet=not self.internet_connection,
on_connection_lost=lambda: self.set_internet_connection(False),
)
# Kick off download status coroutines concurrently
semaphore = asyncio.Semaphore(self.max_parallel_downloads)
async def download_with_semaphore(
model_card: ModelCard,
) -> tuple[Path, RepoDownloadProgress]:
async with semaphore:
return await _status_for_model(model_card.model_id)
tasks = [
asyncio.create_task(_status_for_model(model_card.model_id))
for model_card in MODEL_CARDS.values()
create_task(download_with_semaphore(model_card))
for model_card in await get_model_cards()
]
for task in asyncio.as_completed(tasks):

View File

@@ -16,6 +16,11 @@ from exo.shared.types.worker.shards import (
# TODO: the PipelineShardMetadata getting reinstantiated is a bit messy. Should this be a classmethod?
class ShardDownloader(ABC):
internet_connection: bool = False
def set_internet_connection(self, value: bool) -> None:
self.internet_connection = value
@abstractmethod
async def ensure_shard(
self, shard: ShardMetadata, config_only: bool = False

View File

@@ -90,7 +90,6 @@ class Node:
worker = Worker(
node_id,
session_id,
connection_message_receiver=router.receiver(topics.CONNECTION_MESSAGES),
global_event_receiver=router.receiver(topics.GLOBAL_EVENTS),
local_event_sender=router.sender(topics.LOCAL_EVENTS),
command_sender=router.sender(topics.COMMANDS),
@@ -227,9 +226,6 @@ class Node:
self.worker = Worker(
self.node_id,
result.session_id,
connection_message_receiver=self.router.receiver(
topics.CONNECTION_MESSAGES
),
global_event_receiver=self.router.receiver(
topics.GLOBAL_EVENTS
),

View File

@@ -0,0 +1 @@
"""API adapters for different API formats (Claude, OpenAI Responses, etc.)."""

View File

@@ -0,0 +1,214 @@
"""OpenAI Chat Completions API adapter for converting requests/responses."""
import time
from collections.abc import AsyncGenerator
from typing import Any
from uuid import uuid4
from exo.shared.types.api import (
ChatCompletionChoice,
ChatCompletionMessage,
ChatCompletionMessageText,
ChatCompletionRequest,
ChatCompletionResponse,
ErrorInfo,
ErrorResponse,
FinishReason,
StreamingChoiceResponse,
ToolCall,
)
from exo.shared.types.chunks import ErrorChunk, TokenChunk, ToolCallChunk
from exo.shared.types.common import CommandId
from exo.shared.types.text_generation import InputMessage, TextGenerationTaskParams
def chat_request_to_text_generation(
request: ChatCompletionRequest,
) -> TextGenerationTaskParams:
instructions: str | None = None
input_messages: list[InputMessage] = []
chat_template_messages: list[dict[str, Any]] = []
for msg in request.messages:
# Normalize content to string
content: str
if msg.content is None:
content = ""
elif isinstance(msg.content, str):
content = msg.content
elif isinstance(msg.content, ChatCompletionMessageText):
content = msg.content.text
else:
# List of ChatCompletionMessageText
content = "\n".join(item.text for item in msg.content)
# Extract system message as instructions
if msg.role == "system":
if instructions is None:
instructions = content
else:
# Append additional system messages
instructions = f"{instructions}\n{content}"
chat_template_messages.append({"role": "system", "content": content})
else:
# Skip messages with no meaningful content
if msg.content is None and msg.thinking is None and msg.tool_calls is None:
continue
if msg.role in ("user", "assistant", "developer"):
input_messages.append(InputMessage(role=msg.role, content=content))
# Build full message dict for chat template (preserves tool_calls etc.)
# Normalize content for model_dump
msg_copy = msg.model_copy(update={"content": content})
dumped: dict[str, Any] = msg_copy.model_dump(exclude_none=True)
chat_template_messages.append(dumped)
return TextGenerationTaskParams(
model=request.model,
input=input_messages
if input_messages
else [InputMessage(role="user", content="")],
instructions=instructions,
max_output_tokens=request.max_tokens,
temperature=request.temperature,
top_p=request.top_p,
top_k=request.top_k,
stop=request.stop,
seed=request.seed,
stream=request.stream,
tools=request.tools,
chat_template_messages=chat_template_messages
if chat_template_messages
else None,
)
def chunk_to_response(
chunk: TokenChunk, command_id: CommandId
) -> ChatCompletionResponse:
"""Convert a TokenChunk to a streaming ChatCompletionResponse."""
return ChatCompletionResponse(
id=command_id,
created=int(time.time()),
model=chunk.model,
choices=[
StreamingChoiceResponse(
index=0,
delta=ChatCompletionMessage(role="assistant", content=chunk.text),
finish_reason=chunk.finish_reason,
)
],
)
async def generate_chat_stream(
command_id: CommandId,
chunk_stream: AsyncGenerator[ErrorChunk | ToolCallChunk | TokenChunk, None],
) -> AsyncGenerator[str, None]:
"""Generate Chat Completions API streaming events from chunks."""
async for chunk in chunk_stream:
if isinstance(chunk, ErrorChunk):
error_response = ErrorResponse(
error=ErrorInfo(
message=chunk.error_message or "Internal server error",
type="InternalServerError",
code=500,
)
)
yield f"data: {error_response.model_dump_json()}\n\n"
yield "data: [DONE]\n\n"
return
if isinstance(chunk, ToolCallChunk):
tool_call_deltas = [
ToolCall(
id=str(uuid4()),
index=i,
function=tool,
)
for i, tool in enumerate(chunk.tool_calls)
]
tool_response = ChatCompletionResponse(
id=command_id,
created=int(time.time()),
model=chunk.model,
choices=[
StreamingChoiceResponse(
index=0,
delta=ChatCompletionMessage(
role="assistant",
tool_calls=tool_call_deltas,
),
finish_reason="tool_calls",
)
],
)
yield f"data: {tool_response.model_dump_json()}\n\n"
yield "data: [DONE]\n\n"
return
chunk_response = chunk_to_response(chunk, command_id)
yield f"data: {chunk_response.model_dump_json()}\n\n"
if chunk.finish_reason is not None:
yield "data: [DONE]\n\n"
async def collect_chat_response(
command_id: CommandId,
chunk_stream: AsyncGenerator[ErrorChunk | ToolCallChunk | TokenChunk, None],
) -> ChatCompletionResponse:
"""Collect all token chunks and return a single ChatCompletionResponse."""
text_parts: list[str] = []
tool_calls: list[ToolCall] = []
model: str | None = None
finish_reason: FinishReason | None = None
error_message: str | None = None
async for chunk in chunk_stream:
if isinstance(chunk, ErrorChunk):
error_message = chunk.error_message or "Internal server error"
break
if model is None:
model = chunk.model
if isinstance(chunk, TokenChunk):
text_parts.append(chunk.text)
if isinstance(chunk, ToolCallChunk):
tool_calls.extend(
ToolCall(
id=str(uuid4()),
index=i,
function=tool,
)
for i, tool in enumerate(chunk.tool_calls)
)
if chunk.finish_reason is not None:
finish_reason = chunk.finish_reason
if error_message is not None:
raise ValueError(error_message)
combined_text = "".join(text_parts)
assert model is not None
return ChatCompletionResponse(
id=command_id,
created=int(time.time()),
model=model,
choices=[
ChatCompletionChoice(
index=0,
message=ChatCompletionMessage(
role="assistant",
content=combined_text,
tool_calls=tool_calls if tool_calls else None,
),
finish_reason=finish_reason,
)
],
)

View File

@@ -0,0 +1,323 @@
"""Claude Messages API adapter for converting requests/responses."""
import json
from collections.abc import AsyncGenerator
from typing import Any
from uuid import uuid4
from exo.shared.types.api import FinishReason
from exo.shared.types.chunks import ErrorChunk, TokenChunk, ToolCallChunk
from exo.shared.types.claude_api import (
ClaudeContentBlock,
ClaudeContentBlockDeltaEvent,
ClaudeContentBlockStartEvent,
ClaudeContentBlockStopEvent,
ClaudeInputJsonDelta,
ClaudeMessageDelta,
ClaudeMessageDeltaEvent,
ClaudeMessageDeltaUsage,
ClaudeMessagesRequest,
ClaudeMessagesResponse,
ClaudeMessageStart,
ClaudeMessageStartEvent,
ClaudeMessageStopEvent,
ClaudeStopReason,
ClaudeTextBlock,
ClaudeTextDelta,
ClaudeToolResultBlock,
ClaudeToolUseBlock,
ClaudeUsage,
)
from exo.shared.types.common import CommandId
from exo.shared.types.text_generation import InputMessage, TextGenerationTaskParams
def finish_reason_to_claude_stop_reason(
finish_reason: FinishReason | None,
) -> ClaudeStopReason | None:
"""Map OpenAI finish_reason to Claude stop_reason."""
if finish_reason is None:
return None
mapping: dict[FinishReason, ClaudeStopReason] = {
"stop": "end_turn",
"length": "max_tokens",
"tool_calls": "tool_use",
"content_filter": "end_turn",
"function_call": "tool_use",
}
return mapping.get(finish_reason, "end_turn")
def _extract_tool_result_text(block: ClaudeToolResultBlock) -> str:
"""Extract plain text from a tool_result content field."""
if block.content is None:
return ""
if isinstance(block.content, str):
return block.content
return "".join(sub_block.text for sub_block in block.content)
def claude_request_to_text_generation(
request: ClaudeMessagesRequest,
) -> TextGenerationTaskParams:
# Handle system message
instructions: str | None = None
chat_template_messages: list[dict[str, Any]] = []
if request.system:
if isinstance(request.system, str):
instructions = request.system
else:
instructions = "".join(block.text for block in request.system)
chat_template_messages.append({"role": "system", "content": instructions})
# Convert messages to input
input_messages: list[InputMessage] = []
for msg in request.messages:
if isinstance(msg.content, str):
input_messages.append(InputMessage(role=msg.role, content=msg.content))
chat_template_messages.append({"role": msg.role, "content": msg.content})
continue
# Process structured content blocks
text_parts: list[str] = []
tool_calls: list[dict[str, Any]] = []
tool_results: list[ClaudeToolResultBlock] = []
for block in msg.content:
if isinstance(block, ClaudeTextBlock):
text_parts.append(block.text)
elif isinstance(block, ClaudeToolUseBlock):
tool_calls.append(
{
"id": block.id,
"type": "function",
"function": {
"name": block.name,
"arguments": json.dumps(block.input),
},
}
)
elif isinstance(block, ClaudeToolResultBlock):
tool_results.append(block)
content = "".join(text_parts)
# Build InputMessage from text content
if msg.role in ("user", "assistant"):
input_messages.append(InputMessage(role=msg.role, content=content))
# Build chat_template_messages preserving tool structure
if tool_calls:
chat_template_messages.append(
{"role": "assistant", "content": content, "tool_calls": tool_calls}
)
elif tool_results:
for tr in tool_results:
chat_template_messages.append(
{
"role": "tool",
"tool_call_id": tr.tool_use_id,
"content": _extract_tool_result_text(tr),
}
)
else:
chat_template_messages.append({"role": msg.role, "content": content})
# Convert Claude tool definitions to OpenAI-style function tools
tools: list[dict[str, Any]] | None = None
if request.tools:
tools = [
{
"type": "function",
"function": {
"name": tool.name,
"description": tool.description or "",
"parameters": tool.input_schema,
},
}
for tool in request.tools
]
return TextGenerationTaskParams(
model=request.model,
input=input_messages
if input_messages
else [InputMessage(role="user", content="")],
instructions=instructions,
max_output_tokens=request.max_tokens,
temperature=request.temperature,
top_p=request.top_p,
top_k=request.top_k,
stop=request.stop_sequences,
stream=request.stream,
tools=tools,
chat_template_messages=chat_template_messages
if chat_template_messages
else None,
)
async def collect_claude_response(
command_id: CommandId,
model: str,
chunk_stream: AsyncGenerator[ErrorChunk | ToolCallChunk | TokenChunk, None],
) -> ClaudeMessagesResponse:
"""Collect all token chunks and return a single ClaudeMessagesResponse."""
text_parts: list[str] = []
tool_use_blocks: list[ClaudeToolUseBlock] = []
stop_reason: ClaudeStopReason | None = None
last_stats = None
error_message: str | None = None
async for chunk in chunk_stream:
if isinstance(chunk, ErrorChunk):
error_message = chunk.error_message or "Internal server error"
break
if isinstance(chunk, ToolCallChunk):
for tool in chunk.tool_calls:
tool_use_blocks.append(
ClaudeToolUseBlock(
id=f"toolu_{uuid4().hex[:24]}",
name=tool.name,
input=json.loads(tool.arguments), # pyright: ignore[reportAny]
)
)
last_stats = chunk.stats or last_stats
stop_reason = "tool_use"
continue
text_parts.append(chunk.text)
last_stats = chunk.stats or last_stats
if chunk.finish_reason is not None:
stop_reason = finish_reason_to_claude_stop_reason(chunk.finish_reason)
if error_message is not None:
raise ValueError(error_message)
combined_text = "".join(text_parts)
# Build content blocks
content: list[ClaudeContentBlock] = []
if combined_text:
content.append(ClaudeTextBlock(text=combined_text))
content.extend(tool_use_blocks)
# If no content at all, include empty text block
if not content:
content.append(ClaudeTextBlock(text=""))
# Use actual usage data from stats if available
input_tokens = last_stats.prompt_tokens if last_stats else 0
output_tokens = last_stats.generation_tokens if last_stats else 0
return ClaudeMessagesResponse(
id=f"msg_{command_id}",
model=model,
content=content,
stop_reason=stop_reason,
usage=ClaudeUsage(
input_tokens=input_tokens,
output_tokens=output_tokens,
),
)
async def generate_claude_stream(
command_id: CommandId,
model: str,
chunk_stream: AsyncGenerator[ErrorChunk | ToolCallChunk | TokenChunk, None],
) -> AsyncGenerator[str, None]:
"""Generate Claude Messages API streaming events from TokenChunks."""
# Initial message_start event
initial_message = ClaudeMessageStart(
id=f"msg_{command_id}",
model=model,
content=[],
stop_reason=None,
usage=ClaudeUsage(input_tokens=0, output_tokens=0),
)
start_event = ClaudeMessageStartEvent(message=initial_message)
yield f"event: message_start\ndata: {start_event.model_dump_json()}\n\n"
# content_block_start for text block at index 0
block_start = ClaudeContentBlockStartEvent(
index=0, content_block=ClaudeTextBlock(text="")
)
yield f"event: content_block_start\ndata: {block_start.model_dump_json()}\n\n"
output_tokens = 0
stop_reason: ClaudeStopReason | None = None
last_stats = None
next_block_index = 1 # text block is 0, tool blocks start at 1
async for chunk in chunk_stream:
if isinstance(chunk, ErrorChunk):
# Close text block and bail
break
if isinstance(chunk, ToolCallChunk):
last_stats = chunk.stats or last_stats
stop_reason = "tool_use"
# Emit tool_use content blocks
for tool in chunk.tool_calls:
tool_id = f"toolu_{uuid4().hex[:24]}"
tool_input_json = tool.arguments
# content_block_start for tool_use
tool_block_start = ClaudeContentBlockStartEvent(
index=next_block_index,
content_block=ClaudeToolUseBlock(
id=tool_id, name=tool.name, input={}
),
)
yield f"event: content_block_start\ndata: {tool_block_start.model_dump_json()}\n\n"
# content_block_delta with input_json_delta
tool_delta_event = ClaudeContentBlockDeltaEvent(
index=next_block_index,
delta=ClaudeInputJsonDelta(partial_json=tool_input_json),
)
yield f"event: content_block_delta\ndata: {tool_delta_event.model_dump_json()}\n\n"
# content_block_stop
tool_block_stop = ClaudeContentBlockStopEvent(index=next_block_index)
yield f"event: content_block_stop\ndata: {tool_block_stop.model_dump_json()}\n\n"
next_block_index += 1
continue
output_tokens += 1 # Count each chunk as one token
last_stats = chunk.stats or last_stats
# content_block_delta
delta_event = ClaudeContentBlockDeltaEvent(
index=0,
delta=ClaudeTextDelta(text=chunk.text),
)
yield f"event: content_block_delta\ndata: {delta_event.model_dump_json()}\n\n"
if chunk.finish_reason is not None:
stop_reason = finish_reason_to_claude_stop_reason(chunk.finish_reason)
# Use actual token count from stats if available
if last_stats is not None:
output_tokens = last_stats.generation_tokens
# content_block_stop for text block
block_stop = ClaudeContentBlockStopEvent(index=0)
yield f"event: content_block_stop\ndata: {block_stop.model_dump_json()}\n\n"
# message_delta
message_delta = ClaudeMessageDeltaEvent(
delta=ClaudeMessageDelta(stop_reason=stop_reason),
usage=ClaudeMessageDeltaUsage(output_tokens=output_tokens),
)
yield f"event: message_delta\ndata: {message_delta.model_dump_json()}\n\n"
# message_stop
message_stop = ClaudeMessageStopEvent()
yield f"event: message_stop\ndata: {message_stop.model_dump_json()}\n\n"

View File

@@ -0,0 +1,373 @@
"""OpenAI Responses API adapter for converting requests/responses."""
from collections.abc import AsyncGenerator
from itertools import count
from typing import Any
from uuid import uuid4
from exo.shared.types.chunks import ErrorChunk, TokenChunk, ToolCallChunk
from exo.shared.types.common import CommandId
from exo.shared.types.openai_responses import (
FunctionCallInputItem,
ResponseCompletedEvent,
ResponseContentPart,
ResponseContentPartAddedEvent,
ResponseContentPartDoneEvent,
ResponseCreatedEvent,
ResponseFunctionCallArgumentsDeltaEvent,
ResponseFunctionCallArgumentsDoneEvent,
ResponseFunctionCallItem,
ResponseInProgressEvent,
ResponseInputMessage,
ResponseItem,
ResponseMessageItem,
ResponseOutputItemAddedEvent,
ResponseOutputItemDoneEvent,
ResponseOutputText,
ResponsesRequest,
ResponsesResponse,
ResponseTextDeltaEvent,
ResponseTextDoneEvent,
ResponseUsage,
)
from exo.shared.types.text_generation import InputMessage, TextGenerationTaskParams
def _extract_content(content: str | list[ResponseContentPart]) -> str:
"""Extract plain text from a content field that may be a string or list of parts."""
if isinstance(content, str):
return content
return "".join(part.text for part in content)
def responses_request_to_text_generation(
request: ResponsesRequest,
) -> TextGenerationTaskParams:
input_value: list[InputMessage]
built_chat_template: list[dict[str, Any]] | None = None
if isinstance(request.input, str):
input_value = [InputMessage(role="user", content=request.input)]
else:
input_messages: list[InputMessage] = []
chat_template_messages: list[dict[str, Any]] = []
if request.instructions is not None:
chat_template_messages.append(
{"role": "system", "content": request.instructions}
)
for item in request.input:
if isinstance(item, ResponseInputMessage):
content = _extract_content(item.content)
if item.role in ("user", "assistant", "developer"):
input_messages.append(InputMessage(role=item.role, content=content))
if item.role == "system":
chat_template_messages.append(
{"role": "system", "content": content}
)
else:
chat_template_messages.append(
{"role": item.role, "content": content}
)
elif isinstance(item, FunctionCallInputItem):
chat_template_messages.append(
{
"role": "assistant",
"content": "",
"tool_calls": [
{
"id": item.call_id,
"type": "function",
"function": {
"name": item.name,
"arguments": item.arguments,
},
}
],
}
)
else:
chat_template_messages.append(
{
"role": "tool",
"tool_call_id": item.call_id,
"content": item.output,
}
)
input_value = (
input_messages
if input_messages
else [InputMessage(role="user", content="")]
)
built_chat_template = chat_template_messages if chat_template_messages else None
return TextGenerationTaskParams(
model=request.model,
input=input_value,
instructions=request.instructions,
max_output_tokens=request.max_output_tokens,
temperature=request.temperature,
top_p=request.top_p,
stream=request.stream,
tools=request.tools,
top_k=request.top_k,
stop=request.stop,
seed=request.seed,
chat_template_messages=built_chat_template or request.chat_template_messages,
)
async def collect_responses_response(
command_id: CommandId,
model: str,
chunk_stream: AsyncGenerator[ErrorChunk | ToolCallChunk | TokenChunk, None],
) -> ResponsesResponse:
"""Collect all token chunks and return a single ResponsesResponse."""
response_id = f"resp_{command_id}"
item_id = f"item_{command_id}"
accumulated_text = ""
function_call_items: list[ResponseFunctionCallItem] = []
last_stats = None
error_message: str | None = None
async for chunk in chunk_stream:
if isinstance(chunk, ErrorChunk):
error_message = chunk.error_message or "Internal server error"
break
if isinstance(chunk, ToolCallChunk):
for tool in chunk.tool_calls:
function_call_items.append(
ResponseFunctionCallItem(
id=f"fc_{uuid4().hex[:24]}",
call_id=f"call_{uuid4().hex[:24]}",
name=tool.name,
arguments=tool.arguments,
)
)
last_stats = chunk.stats or last_stats
continue
accumulated_text += chunk.text
last_stats = chunk.stats or last_stats
if error_message is not None:
raise ValueError(error_message)
# Create usage from stats if available
usage = None
if last_stats is not None:
usage = ResponseUsage(
input_tokens=last_stats.prompt_tokens,
output_tokens=last_stats.generation_tokens,
total_tokens=last_stats.prompt_tokens + last_stats.generation_tokens,
)
output: list[ResponseItem] = [
ResponseMessageItem(
id=item_id,
content=[ResponseOutputText(text=accumulated_text)],
status="completed",
)
]
output.extend(function_call_items)
return ResponsesResponse(
id=response_id,
model=model,
status="completed",
output=output,
output_text=accumulated_text,
usage=usage,
)
async def generate_responses_stream(
command_id: CommandId,
model: str,
chunk_stream: AsyncGenerator[ErrorChunk | ToolCallChunk | TokenChunk, None],
) -> AsyncGenerator[str, None]:
"""Generate OpenAI Responses API streaming events from TokenChunks."""
response_id = f"resp_{command_id}"
item_id = f"item_{command_id}"
seq = count(1)
# response.created
initial_response = ResponsesResponse(
id=response_id,
model=model,
status="in_progress",
output=[],
output_text="",
)
created_event = ResponseCreatedEvent(
sequence_number=next(seq), response=initial_response
)
yield f"event: response.created\ndata: {created_event.model_dump_json()}\n\n"
# response.in_progress
in_progress_event = ResponseInProgressEvent(
sequence_number=next(seq), response=initial_response
)
yield f"event: response.in_progress\ndata: {in_progress_event.model_dump_json()}\n\n"
# response.output_item.added
initial_item = ResponseMessageItem(
id=item_id,
content=[ResponseOutputText(text="")],
status="in_progress",
)
item_added = ResponseOutputItemAddedEvent(
sequence_number=next(seq), output_index=0, item=initial_item
)
yield f"event: response.output_item.added\ndata: {item_added.model_dump_json()}\n\n"
# response.content_part.added
initial_part = ResponseOutputText(text="")
part_added = ResponseContentPartAddedEvent(
sequence_number=next(seq),
item_id=item_id,
output_index=0,
content_index=0,
part=initial_part,
)
yield f"event: response.content_part.added\ndata: {part_added.model_dump_json()}\n\n"
accumulated_text = ""
function_call_items: list[ResponseFunctionCallItem] = []
last_stats = None
next_output_index = 1 # message item is at 0
async for chunk in chunk_stream:
if isinstance(chunk, ErrorChunk):
break
if isinstance(chunk, ToolCallChunk):
last_stats = chunk.stats or last_stats
for tool in chunk.tool_calls:
fc_id = f"fc_{uuid4().hex[:24]}"
call_id = f"call_{uuid4().hex[:24]}"
# response.output_item.added for function_call
fc_item = ResponseFunctionCallItem(
id=fc_id,
call_id=call_id,
name=tool.name,
arguments="",
status="in_progress",
)
fc_added = ResponseOutputItemAddedEvent(
sequence_number=next(seq),
output_index=next_output_index,
item=fc_item,
)
yield f"event: response.output_item.added\ndata: {fc_added.model_dump_json()}\n\n"
# response.function_call_arguments.delta
args_delta = ResponseFunctionCallArgumentsDeltaEvent(
sequence_number=next(seq),
item_id=fc_id,
output_index=next_output_index,
delta=tool.arguments,
)
yield f"event: response.function_call_arguments.delta\ndata: {args_delta.model_dump_json()}\n\n"
# response.function_call_arguments.done
args_done = ResponseFunctionCallArgumentsDoneEvent(
sequence_number=next(seq),
item_id=fc_id,
output_index=next_output_index,
name=tool.name,
arguments=tool.arguments,
)
yield f"event: response.function_call_arguments.done\ndata: {args_done.model_dump_json()}\n\n"
# response.output_item.done
fc_done_item = ResponseFunctionCallItem(
id=fc_id,
call_id=call_id,
name=tool.name,
arguments=tool.arguments,
status="completed",
)
fc_item_done = ResponseOutputItemDoneEvent(
sequence_number=next(seq),
output_index=next_output_index,
item=fc_done_item,
)
yield f"event: response.output_item.done\ndata: {fc_item_done.model_dump_json()}\n\n"
function_call_items.append(fc_done_item)
next_output_index += 1
continue
accumulated_text += chunk.text
last_stats = chunk.stats or last_stats
# response.output_text.delta
delta_event = ResponseTextDeltaEvent(
sequence_number=next(seq),
item_id=item_id,
output_index=0,
content_index=0,
delta=chunk.text,
)
yield f"event: response.output_text.delta\ndata: {delta_event.model_dump_json()}\n\n"
# response.output_text.done
text_done = ResponseTextDoneEvent(
sequence_number=next(seq),
item_id=item_id,
output_index=0,
content_index=0,
text=accumulated_text,
)
yield f"event: response.output_text.done\ndata: {text_done.model_dump_json()}\n\n"
# response.content_part.done
final_part = ResponseOutputText(text=accumulated_text)
part_done = ResponseContentPartDoneEvent(
sequence_number=next(seq),
item_id=item_id,
output_index=0,
content_index=0,
part=final_part,
)
yield f"event: response.content_part.done\ndata: {part_done.model_dump_json()}\n\n"
# response.output_item.done
final_message_item = ResponseMessageItem(
id=item_id,
content=[ResponseOutputText(text=accumulated_text)],
status="completed",
)
item_done = ResponseOutputItemDoneEvent(
sequence_number=next(seq), output_index=0, item=final_message_item
)
yield f"event: response.output_item.done\ndata: {item_done.model_dump_json()}\n\n"
# Create usage from stats if available
usage = None
if last_stats is not None:
usage = ResponseUsage(
input_tokens=last_stats.prompt_tokens,
output_tokens=last_stats.generation_tokens,
total_tokens=last_stats.prompt_tokens + last_stats.generation_tokens,
)
# response.completed
output: list[ResponseItem] = [final_message_item]
output.extend(function_call_items)
final_response = ResponsesResponse(
id=response_id,
model=model,
status="completed",
output=output,
output_text=accumulated_text,
usage=usage,
)
completed_event = ResponseCompletedEvent(
sequence_number=next(seq), response=final_response
)
yield f"event: response.completed\ndata: {completed_event.model_dump_json()}\n\n"

View File

@@ -1,9 +1,12 @@
import base64
import contextlib
import json
import random
import time
from collections.abc import AsyncGenerator
from collections.abc import AsyncGenerator, Awaitable, Callable
from datetime import datetime, timezone
from http import HTTPStatus
from pathlib import Path
from typing import Annotated, Literal, cast
from uuid import uuid4
@@ -19,28 +22,50 @@ from hypercorn.config import Config
from hypercorn.typing import ASGIFramework
from loguru import logger
from exo.master.adapters.chat_completions import (
chat_request_to_text_generation,
collect_chat_response,
generate_chat_stream,
)
from exo.master.adapters.claude import (
claude_request_to_text_generation,
collect_claude_response,
generate_claude_stream,
)
from exo.master.adapters.responses import (
collect_responses_response,
generate_responses_stream,
responses_request_to_text_generation,
)
from exo.master.image_store import ImageStore
from exo.master.placement import place_instance as get_instance_placements
from exo.shared.apply import apply
from exo.shared.constants import (
DASHBOARD_DIR,
EXO_IMAGE_CACHE_DIR,
EXO_MAX_CHUNK_SIZE,
EXO_TRACING_CACHE_DIR,
)
from exo.shared.election import ElectionMessage
from exo.shared.logging import InterceptLogger
from exo.shared.models.model_cards import (
MODEL_CARDS,
ModelCard,
ModelId,
delete_custom_card,
get_model_cards,
is_custom_card,
)
from exo.shared.tracing import TraceEvent, compute_stats, export_trace, load_trace_file
from exo.shared.types.api import (
AddCustomModelParams,
AdvancedImageParams,
BenchChatCompletionRequest,
BenchChatCompletionResponse,
BenchChatCompletionTaskParams,
BenchImageGenerationResponse,
BenchImageGenerationTaskParams,
ChatCompletionChoice,
ChatCompletionMessage,
ChatCompletionRequest,
ChatCompletionResponse,
CreateInstanceParams,
CreateInstanceResponse,
@@ -50,8 +75,9 @@ from exo.shared.types.api import (
ErrorResponse,
FinishReason,
GenerationStats,
HuggingFaceSearchResult,
ImageData,
ImageEditsInternalParams,
ImageEditsTaskParams,
ImageGenerationResponse,
ImageGenerationStats,
ImageGenerationTaskParams,
@@ -64,10 +90,14 @@ from exo.shared.types.api import (
PlacementPreviewResponse,
StartDownloadParams,
StartDownloadResponse,
StreamingChoiceResponse,
StreamOptions,
ToolCall,
Usage,
TraceCategoryStats,
TraceEventResponse,
TraceListItem,
TraceListResponse,
TraceRankStats,
TraceResponse,
TraceStatsResponse,
)
from exo.shared.types.chunks import (
ErrorChunk,
@@ -76,8 +106,11 @@ from exo.shared.types.chunks import (
TokenChunk,
ToolCallChunk,
)
from exo.shared.types.claude_api import (
ClaudeMessagesRequest,
ClaudeMessagesResponse,
)
from exo.shared.types.commands import (
ChatCompletion,
Command,
CreateInstance,
DeleteDownload,
@@ -91,6 +124,7 @@ from exo.shared.types.commands import (
SendInputChunk,
StartDownload,
TaskFinished,
TextGeneration,
)
from exo.shared.types.common import CommandId, Id, NodeId, SessionId
from exo.shared.types.events import (
@@ -98,15 +132,18 @@ from exo.shared.types.events import (
Event,
ForwarderEvent,
IndexedEvent,
TracesMerged,
)
from exo.shared.types.memory import Memory
from exo.shared.types.openai_responses import (
ResponsesRequest,
ResponsesResponse,
)
from exo.shared.types.state import State
from exo.shared.types.tasks import ChatCompletionTaskParams
from exo.shared.types.worker.instances import Instance, InstanceId, InstanceMeta
from exo.shared.types.worker.shards import Sharding
from exo.utils.banner import print_startup_banner
from exo.utils.channels import Receiver, Sender, channel
from exo.utils.dashboard_path import find_dashboard
from exo.utils.event_buffer import OrderedBuffer
@@ -114,36 +151,13 @@ def _format_to_content_type(image_format: Literal["png", "jpeg", "webp"] | None)
return f"image/{image_format or 'png'}"
def chunk_to_response(
chunk: TokenChunk | ToolCallChunk,
command_id: CommandId,
usage: Usage | None,
) -> ChatCompletionResponse:
return ChatCompletionResponse(
id=command_id,
created=int(time.time()),
model=chunk.model,
choices=[
StreamingChoiceResponse(
index=0,
delta=ChatCompletionMessage(role="assistant", content=chunk.text)
if isinstance(chunk, TokenChunk)
else ChatCompletionMessage(
role="assistant",
tool_calls=[
ToolCall(
id=str(uuid4()),
index=i,
function=tool,
)
for i, tool in enumerate(chunk.tool_calls)
],
),
finish_reason=chunk.finish_reason,
)
],
usage=usage,
)
def _ensure_seed(params: AdvancedImageParams | None) -> AdvancedImageParams:
"""Ensure advanced params has a seed set for distributed consistency."""
if params is None:
return AdvancedImageParams(seed=random.randint(0, 2**32 - 1))
if params.seed is None:
return params.model_copy(update={"seed": random.randint(0, 2**32 - 1)})
return params
class API:
@@ -176,6 +190,15 @@ class API:
self.paused_ev: anyio.Event = anyio.Event()
self.app = FastAPI()
@self.app.middleware("http")
async def _log_requests( # pyright: ignore[reportUnusedFunction]
request: Request,
call_next: Callable[[Request], Awaitable[StreamingResponse]],
) -> StreamingResponse:
logger.debug(f"API request: {request.method} {request.url.path}")
return await call_next(request)
self._setup_exception_handlers()
self._setup_cors()
self._setup_routes()
@@ -183,13 +206,13 @@ class API:
self.app.mount(
"/",
StaticFiles(
directory=find_dashboard(),
directory=DASHBOARD_DIR,
html=True,
),
name="dashboard",
)
self._chat_completion_queues: dict[
self._text_generation_queues: dict[
CommandId, Sender[TokenChunk | ErrorChunk | ToolCallChunk]
] = {}
self._image_generation_queues: dict[
@@ -203,7 +226,7 @@ class API:
self.state = State()
self.session_id = new_session_id
self.event_buffer = OrderedBuffer[Event]()
self._chat_completion_queues = {}
self._text_generation_queues = {}
self._image_generation_queues = {}
self.unpause(result_clock)
@@ -248,6 +271,9 @@ class API:
self.app.delete("/instance/{instance_id}")(self.delete_instance)
self.app.get("/models")(self.get_models)
self.app.get("/v1/models")(self.get_models)
self.app.post("/models/add")(self.add_custom_model)
self.app.delete("/models/custom/{model_id:path}")(self.delete_custom_model)
self.app.get("/models/search")(self.search_models)
self.app.post("/v1/chat/completions", response_model=None)(
self.chat_completions
)
@@ -260,10 +286,16 @@ class API:
self.app.post("/bench/images/edits")(self.bench_image_edits)
self.app.get("/images")(self.list_images)
self.app.get("/images/{image_id}")(self.get_image)
self.app.post("/v1/messages", response_model=None)(self.claude_messages)
self.app.post("/v1/responses", response_model=None)(self.openai_responses)
self.app.get("/state")(lambda: self.state)
self.app.get("/events")(lambda: self._event_log)
self.app.post("/download/start")(self.start_download)
self.app.delete("/download/{node_id}/{model_id:path}")(self.delete_download)
self.app.get("/v1/traces")(self.list_traces)
self.app.get("/v1/traces/{task_id}")(self.get_trace)
self.app.get("/v1/traces/{task_id}/stats")(self.get_trace_stats)
self.app.get("/v1/traces/{task_id}/raw")(self.get_trace_raw)
async def place_instance(self, payload: PlaceInstanceParams):
command = PlaceInstance(
@@ -354,10 +386,7 @@ class API:
if len(list(self.state.topology.list_nodes())) == 0:
return PlacementPreviewResponse(previews=[])
cards = [card for card in MODEL_CARDS.values() if card.model_id == model_id]
if not cards:
raise HTTPException(status_code=404, detail=f"Model {model_id} not found")
model_card = await ModelCard.load(model_id)
instance_combinations: list[tuple[Sharding, InstanceMeta, int]] = []
for sharding in (Sharding.Pipeline, Sharding.Tensor):
for instance_meta in (InstanceMeta.MlxRing, InstanceMeta.MlxJaccl):
@@ -372,96 +401,93 @@ class API:
# TODO: PDD
# instance_combinations.append((Sharding.PrefillDecodeDisaggregation, InstanceMeta.MlxRing, 1))
for model_card in cards:
for sharding, instance_meta, min_nodes in instance_combinations:
try:
placements = get_instance_placements(
PlaceInstance(
model_card=model_card,
sharding=sharding,
instance_meta=instance_meta,
min_nodes=min_nodes,
),
node_memory=self.state.node_memory,
node_network=self.state.node_network,
topology=self.state.topology,
current_instances=self.state.instances,
required_nodes=required_nodes,
)
except ValueError as exc:
if (model_card.model_id, sharding, instance_meta, 0) not in seen:
previews.append(
PlacementPreview(
model_id=model_card.model_id,
sharding=sharding,
instance_meta=instance_meta,
instance=None,
error=str(exc),
)
)
seen.add((model_card.model_id, sharding, instance_meta, 0))
continue
current_ids = set(self.state.instances.keys())
new_instances = [
instance
for instance_id, instance in placements.items()
if instance_id not in current_ids
]
if len(new_instances) != 1:
if (model_card.model_id, sharding, instance_meta, 0) not in seen:
previews.append(
PlacementPreview(
model_id=model_card.model_id,
sharding=sharding,
instance_meta=instance_meta,
instance=None,
error="Expected exactly one new instance from placement",
)
)
seen.add((model_card.model_id, sharding, instance_meta, 0))
continue
instance = new_instances[0]
shard_assignments = instance.shard_assignments
placement_node_ids = list(shard_assignments.node_to_runner.keys())
memory_delta_by_node: dict[str, int] = {}
if placement_node_ids:
total_bytes = model_card.storage_size.in_bytes
per_node = total_bytes // len(placement_node_ids)
remainder = total_bytes % len(placement_node_ids)
for index, node_id in enumerate(
sorted(placement_node_ids, key=str)
):
extra = 1 if index < remainder else 0
memory_delta_by_node[str(node_id)] = per_node + extra
if (
model_card.model_id,
sharding,
instance_meta,
len(placement_node_ids),
) not in seen:
for sharding, instance_meta, min_nodes in instance_combinations:
try:
placements = get_instance_placements(
PlaceInstance(
model_card=model_card,
sharding=sharding,
instance_meta=instance_meta,
min_nodes=min_nodes,
),
node_memory=self.state.node_memory,
node_network=self.state.node_network,
topology=self.state.topology,
current_instances=self.state.instances,
required_nodes=required_nodes,
)
except ValueError as exc:
if (model_card.model_id, sharding, instance_meta, 0) not in seen:
previews.append(
PlacementPreview(
model_id=model_card.model_id,
sharding=sharding,
instance_meta=instance_meta,
instance=instance,
memory_delta_by_node=memory_delta_by_node or None,
error=None,
instance=None,
error=str(exc),
)
)
seen.add(
(
model_card.model_id,
sharding,
instance_meta,
len(placement_node_ids),
seen.add((model_card.model_id, sharding, instance_meta, 0))
continue
current_ids = set(self.state.instances.keys())
new_instances = [
instance
for instance_id, instance in placements.items()
if instance_id not in current_ids
]
if len(new_instances) != 1:
if (model_card.model_id, sharding, instance_meta, 0) not in seen:
previews.append(
PlacementPreview(
model_id=model_card.model_id,
sharding=sharding,
instance_meta=instance_meta,
instance=None,
error="Expected exactly one new instance from placement",
)
)
seen.add((model_card.model_id, sharding, instance_meta, 0))
continue
instance = new_instances[0]
shard_assignments = instance.shard_assignments
placement_node_ids = list(shard_assignments.node_to_runner.keys())
memory_delta_by_node: dict[str, int] = {}
if placement_node_ids:
total_bytes = model_card.storage_size.in_bytes
per_node = total_bytes // len(placement_node_ids)
remainder = total_bytes % len(placement_node_ids)
for index, node_id in enumerate(sorted(placement_node_ids, key=str)):
extra = 1 if index < remainder else 0
memory_delta_by_node[str(node_id)] = per_node + extra
if (
model_card.model_id,
sharding,
instance_meta,
len(placement_node_ids),
) not in seen:
previews.append(
PlacementPreview(
model_id=model_card.model_id,
sharding=sharding,
instance_meta=instance_meta,
instance=instance,
memory_delta_by_node=memory_delta_by_node or None,
error=None,
)
)
seen.add(
(
model_card.model_id,
sharding,
instance_meta,
len(placement_node_ids),
)
)
return PlacementPreviewResponse(previews=previews)
@@ -484,13 +510,15 @@ class API:
instance_id=instance_id,
)
async def _chat_chunk_stream(
async def _token_chunk_stream(
self, command_id: CommandId
) -> AsyncGenerator[ErrorChunk | ToolCallChunk | TokenChunk, None]:
"""Yield `TokenChunk`s for a given command until completion."""
"""Yield chunks for a given command until completion.
This is the internal low-level stream used by all API adapters.
"""
try:
self._chat_completion_queues[command_id], recv = channel[
self._text_generation_queues[command_id], recv = channel[
ErrorChunk | ToolCallChunk | TokenChunk
]()
@@ -511,103 +539,10 @@ class API:
finally:
command = TaskFinished(finished_command_id=command_id)
await self._send(command)
if command_id in self._chat_completion_queues:
del self._chat_completion_queues[command_id]
if command_id in self._text_generation_queues:
del self._text_generation_queues[command_id]
async def _generate_chat_stream(
self, command_id: CommandId, stream_options: StreamOptions | None = None
) -> AsyncGenerator[str, None]:
"""Generate chat completion stream as JSON strings."""
include_usage = stream_options.include_usage if stream_options else False
async for chunk in self._chat_chunk_stream(command_id):
assert not isinstance(chunk, ImageChunk)
if chunk.finish_reason == "error":
error_response = ErrorResponse(
error=ErrorInfo(
message=chunk.error_message or "Internal server error",
type="InternalServerError",
code=500,
)
)
yield f"data: {error_response.model_dump_json()}\n\n"
yield "data: [DONE]\n\n"
return
usage = chunk.usage if include_usage else None
chunk_response: ChatCompletionResponse = chunk_to_response(
chunk, command_id, usage=usage
)
logger.debug(f"chunk_response: {chunk_response}")
yield f"data: {chunk_response.model_dump_json()}\n\n"
if chunk.finish_reason is not None:
yield "data: [DONE]\n\n"
async def _collect_chat_completion(
self, command_id: CommandId
) -> ChatCompletionResponse:
"""Collect all token chunks for a chat completion and return a single response."""
text_parts: list[str] = []
tool_calls: list[ToolCall] = []
model: ModelId | None = None
finish_reason: FinishReason | None = None
usage: Usage | None = None
async for chunk in self._chat_chunk_stream(command_id):
if isinstance(chunk, ErrorChunk):
raise HTTPException(
status_code=500,
detail=chunk.error_message or "Internal server error",
)
if model is None:
model = chunk.model
if isinstance(chunk, TokenChunk):
text_parts.append(chunk.text)
if isinstance(chunk, ToolCallChunk):
tool_calls.extend(
ToolCall(
id=str(uuid4()),
index=i,
function=tool,
)
for i, tool in enumerate(chunk.tool_calls)
)
if chunk.usage is not None:
usage = chunk.usage
if chunk.finish_reason is not None:
finish_reason = chunk.finish_reason
combined_text = "".join(text_parts)
assert model is not None
return ChatCompletionResponse(
id=command_id,
created=int(time.time()),
model=model,
choices=[
ChatCompletionChoice(
index=0,
message=ChatCompletionMessage(
role="assistant",
content=combined_text,
tool_calls=tool_calls,
),
finish_reason=finish_reason,
)
],
usage=usage,
)
async def _collect_chat_completion_with_stats(
async def _collect_text_generation_with_stats(
self, command_id: CommandId
) -> BenchChatCompletionResponse:
text_parts: list[str] = []
@@ -617,7 +552,7 @@ class API:
stats: GenerationStats | None = None
async for chunk in self._chat_chunk_stream(command_id):
async for chunk in self._token_chunk_stream(command_id):
if chunk.finish_reason == "error":
raise HTTPException(
status_code=500,
@@ -656,7 +591,9 @@ class API:
ChatCompletionChoice(
index=0,
message=ChatCompletionMessage(
role="assistant", content=combined_text, tool_calls=tool_calls
role="assistant",
content=combined_text,
tool_calls=tool_calls if tool_calls else None,
),
finish_reason=finish_reason,
)
@@ -671,56 +608,65 @@ class API:
)
async def chat_completions(
self, payload: ChatCompletionTaskParams
self, payload: ChatCompletionRequest
) -> ChatCompletionResponse | StreamingResponse:
"""Handle chat completions, supporting both streaming and non-streaming responses."""
model_card = await ModelCard.load(ModelId(payload.model))
payload.model = model_card.model_id
if not any(
instance.shard_assignments.model_id == payload.model
for instance in self.state.instances.values()
):
await self._trigger_notify_user_to_download_model(payload.model)
raise HTTPException(
status_code=404, detail=f"No instance found for model {payload.model}"
)
command = ChatCompletion(
request_params=payload,
"""OpenAI Chat Completions API - adapter."""
task_params = chat_request_to_text_generation(payload)
resolved_model = await self._resolve_and_validate_text_model(
ModelId(task_params.model)
)
task_params = task_params.model_copy(update={"model": resolved_model})
command = TextGeneration(task_params=task_params)
await self._send(command)
if payload.stream:
return StreamingResponse(
self._generate_chat_stream(command.command_id, payload.stream_options),
generate_chat_stream(
command.command_id,
self._token_chunk_stream(command.command_id),
),
media_type="text/event-stream",
)
return await self._collect_chat_completion(command.command_id)
return await collect_chat_response(
command.command_id,
self._token_chunk_stream(command.command_id),
)
async def bench_chat_completions(
self, payload: BenchChatCompletionTaskParams
self, payload: BenchChatCompletionRequest
) -> BenchChatCompletionResponse:
model_card = await ModelCard.load(ModelId(payload.model))
payload.model = model_card.model_id
task_params = chat_request_to_text_generation(payload)
resolved_model = await self._resolve_and_validate_text_model(
ModelId(task_params.model)
)
task_params = task_params.model_copy(update={"model": resolved_model})
if not any(
instance.shard_assignments.model_id == payload.model
for instance in self.state.instances.values()
):
await self._trigger_notify_user_to_download_model(payload.model)
raise HTTPException(
status_code=404, detail=f"No instance found for model {payload.model}"
)
task_params = task_params.model_copy(update={"stream": False, "bench": True})
payload.stream = False
command = ChatCompletion(request_params=payload)
command = TextGeneration(task_params=task_params)
await self._send(command)
response = await self._collect_chat_completion_with_stats(command.command_id)
response = await self._collect_text_generation_with_stats(command.command_id)
return response
async def _resolve_and_validate_text_model(self, model_id: ModelId) -> ModelId:
"""Validate a text model exists and return the resolved model ID.
Raises HTTPException 404 if no instance is found for the model.
"""
if not any(
instance.shard_assignments.model_id == model_id
for instance in self.state.instances.values()
):
await self._trigger_notify_user_to_download_model(model_id)
raise HTTPException(
status_code=404,
detail=f"No instance found for model {model_id}",
)
return model_id
async def _validate_image_model(self, model: ModelId) -> ModelId:
"""Validate model exists and return resolved model ID.
@@ -773,9 +719,12 @@ class API:
with SSE-formatted events for partial and final images.
"""
payload.model = await self._validate_image_model(ModelId(payload.model))
payload = payload.model_copy(
update={"advanced_params": _ensure_seed(payload.advanced_params)}
)
command = ImageGeneration(
request_params=payload,
task_params=payload,
)
await self._send(command)
@@ -1021,9 +970,12 @@ class API:
payload.stream = False
payload.partial_images = 0
payload = payload.model_copy(
update={"advanced_params": _ensure_seed(payload.advanced_params)}
)
command = ImageGeneration(
request_params=payload,
task_params=payload,
)
await self._send(command)
@@ -1052,6 +1004,7 @@ class API:
) -> ImageEdits:
"""Prepare and send an image edits command with chunked image upload."""
resolved_model = await self._validate_image_model(model)
advanced_params = _ensure_seed(advanced_params)
image_content = await image.read()
image_data = base64.b64encode(image_content).decode("utf-8")
@@ -1065,7 +1018,7 @@ class API:
total_chunks = len(data_chunks)
command = ImageEdits(
request_params=ImageEditsInternalParams(
task_params=ImageEditsTaskParams(
image_data="",
total_input_chunks=total_chunks,
prompt=prompt,
@@ -1209,6 +1162,62 @@ class API:
response_format=response_format,
)
async def claude_messages(
self, payload: ClaudeMessagesRequest
) -> ClaudeMessagesResponse | StreamingResponse:
"""Claude Messages API - adapter."""
task_params = claude_request_to_text_generation(payload)
resolved_model = await self._resolve_and_validate_text_model(
ModelId(task_params.model)
)
task_params = task_params.model_copy(update={"model": resolved_model})
command = TextGeneration(task_params=task_params)
await self._send(command)
if payload.stream:
return StreamingResponse(
generate_claude_stream(
command.command_id,
payload.model,
self._token_chunk_stream(command.command_id),
),
media_type="text/event-stream",
)
return await collect_claude_response(
command.command_id,
payload.model,
self._token_chunk_stream(command.command_id),
)
async def openai_responses(
self, payload: ResponsesRequest
) -> ResponsesResponse | StreamingResponse:
"""OpenAI Responses API."""
task_params = responses_request_to_text_generation(payload)
resolved_model = await self._resolve_and_validate_text_model(task_params.model)
task_params = task_params.model_copy(update={"model": resolved_model})
command = TextGeneration(task_params=task_params)
await self._send(command)
if payload.stream:
return StreamingResponse(
generate_responses_stream(
command.command_id,
payload.model,
self._token_chunk_stream(command.command_id),
),
media_type="text/event-stream",
)
return await collect_responses_response(
command.command_id,
payload.model,
self._token_chunk_stream(command.command_id),
)
def _calculate_total_available_memory(self) -> Memory:
"""Calculate total available memory across all nodes in bytes."""
total_available = Memory()
@@ -1231,11 +1240,70 @@ class API:
storage_size_megabytes=int(card.storage_size.in_mb),
supports_tensor=card.supports_tensor,
tasks=[task.value for task in card.tasks],
is_custom=is_custom_card(card.model_id),
family=card.family,
quantization=card.quantization,
base_model=card.base_model,
capabilities=card.capabilities,
)
for card in MODEL_CARDS.values()
for card in await get_model_cards()
]
)
async def add_custom_model(self, payload: AddCustomModelParams) -> ModelListModel:
"""Fetch a model from HuggingFace and save as a custom model card."""
try:
card = await ModelCard.fetch_from_hf(payload.model_id)
except Exception as exc:
raise HTTPException(
status_code=400, detail=f"Failed to fetch model: {exc}"
) from exc
return ModelListModel(
id=card.model_id,
hugging_face_id=card.model_id,
name=card.model_id.short(),
description="",
tags=[],
storage_size_megabytes=int(card.storage_size.in_mb),
supports_tensor=card.supports_tensor,
tasks=[task.value for task in card.tasks],
is_custom=True,
)
async def delete_custom_model(self, model_id: ModelId) -> JSONResponse:
"""Delete a user-added custom model card."""
deleted = await delete_custom_card(model_id)
if not deleted:
raise HTTPException(status_code=404, detail="Custom model card not found")
return JSONResponse(
{"message": "Model card deleted", "model_id": str(model_id)}
)
async def search_models(
self, query: str = "", limit: int = 20
) -> list[HuggingFaceSearchResult]:
"""Search HuggingFace Hub for mlx-community models."""
from huggingface_hub import list_models
results = list_models(
search=query or None,
author="mlx-community",
sort="downloads",
limit=limit,
)
return [
HuggingFaceSearchResult(
id=m.id,
author=m.author or "",
downloads=m.downloads or 0,
likes=m.likes or 0,
last_modified=str(m.last_modified or ""),
tags=list(m.tags or []),
)
for m in results
]
async def run(self):
cfg = Config()
cfg.bind = f"0.0.0.0:{self.port}"
@@ -1281,14 +1349,32 @@ class API:
self._image_generation_queues.pop(
event.command_id, None
)
if queue := self._chat_completion_queues.get(
if queue := self._text_generation_queues.get(
event.command_id, None
):
assert not isinstance(event.chunk, ImageChunk)
try:
await queue.send(event.chunk)
except BrokenResourceError:
self._chat_completion_queues.pop(event.command_id, None)
self._text_generation_queues.pop(event.command_id, None)
if isinstance(event, TracesMerged):
self._save_merged_trace(event)
def _save_merged_trace(self, event: TracesMerged) -> None:
traces = [
TraceEvent(
name=t.name,
start_us=t.start_us,
duration_us=t.duration_us,
rank=t.rank,
category=t.category,
)
for t in event.traces
]
output_path = EXO_TRACING_CACHE_DIR / f"trace_{event.task_id}.json"
export_trace(traces, output_path)
logger.debug(f"Saved merged trace to {output_path}")
async def _pause_on_new_election(self):
with self.election_receiver as ems:
@@ -1336,3 +1422,103 @@ class API:
)
await self._send_download(command)
return DeleteDownloadResponse(command_id=command.command_id)
def _get_trace_path(self, task_id: str) -> Path:
return EXO_TRACING_CACHE_DIR / f"trace_{task_id}.json"
async def list_traces(self) -> TraceListResponse:
traces: list[TraceListItem] = []
for trace_file in sorted(
EXO_TRACING_CACHE_DIR.glob("trace_*.json"),
key=lambda p: p.stat().st_mtime,
reverse=True,
):
# Extract task_id from filename (trace_{task_id}.json)
task_id = trace_file.stem.removeprefix("trace_")
stat = trace_file.stat()
created_at = datetime.fromtimestamp(
stat.st_mtime, tz=timezone.utc
).isoformat()
traces.append(
TraceListItem(
task_id=task_id,
created_at=created_at,
file_size=stat.st_size,
)
)
return TraceListResponse(traces=traces)
async def get_trace(self, task_id: str) -> TraceResponse:
trace_path = self._get_trace_path(task_id)
if not trace_path.exists():
raise HTTPException(status_code=404, detail=f"Trace not found: {task_id}")
trace_events = load_trace_file(trace_path)
return TraceResponse(
task_id=task_id,
traces=[
TraceEventResponse(
name=event.name,
start_us=event.start_us,
duration_us=event.duration_us,
rank=event.rank,
category=event.category,
)
for event in trace_events
],
)
async def get_trace_stats(self, task_id: str) -> TraceStatsResponse:
trace_path = self._get_trace_path(task_id)
if not trace_path.exists():
raise HTTPException(status_code=404, detail=f"Trace not found: {task_id}")
trace_events = load_trace_file(trace_path)
stats = compute_stats(trace_events)
return TraceStatsResponse(
task_id=task_id,
total_wall_time_us=stats.total_wall_time_us,
by_category={
category: TraceCategoryStats(
total_us=cat_stats.total_us,
count=cat_stats.count,
min_us=cat_stats.min_us,
max_us=cat_stats.max_us,
avg_us=cat_stats.avg_us,
)
for category, cat_stats in stats.by_category.items()
},
by_rank={
rank: TraceRankStats(
by_category={
category: TraceCategoryStats(
total_us=cat_stats.total_us,
count=cat_stats.count,
min_us=cat_stats.min_us,
max_us=cat_stats.max_us,
avg_us=cat_stats.avg_us,
)
for category, cat_stats in rank_stats.items()
}
)
for rank, rank_stats in stats.by_rank.items()
},
)
async def get_trace_raw(self, task_id: str) -> FileResponse:
trace_path = self._get_trace_path(task_id)
if not trace_path.exists():
raise HTTPException(status_code=404, detail=f"Trace not found: {task_id}")
return FileResponse(
path=trace_path,
media_type="application/json",
filename=f"trace_{task_id}.json",
)

View File

@@ -11,8 +11,8 @@ from exo.master.placement import (
place_instance,
)
from exo.shared.apply import apply
from exo.shared.constants import EXO_TRACING_ENABLED
from exo.shared.types.commands import (
ChatCompletion,
CreateInstance,
DeleteInstance,
ForwarderCommand,
@@ -23,6 +23,7 @@ from exo.shared.types.commands import (
SendInputChunk,
TaskFinished,
TestCommand,
TextGeneration,
)
from exo.shared.types.common import CommandId, NodeId, SessionId
from exo.shared.types.events import (
@@ -35,11 +36,11 @@ from exo.shared.types.events import (
NodeTimedOut,
TaskCreated,
TaskDeleted,
TraceEventData,
TracesCollected,
TracesMerged,
)
from exo.shared.types.state import State
from exo.shared.types.tasks import (
ChatCompletion as ChatCompletionTask,
)
from exo.shared.types.tasks import (
ImageEdits as ImageEditsTask,
)
@@ -50,6 +51,9 @@ from exo.shared.types.tasks import (
TaskId,
TaskStatus,
)
from exo.shared.types.tasks import (
TextGeneration as TextGenerationTask,
)
from exo.shared.types.worker.instances import InstanceId
from exo.utils.channels import Receiver, Sender, channel
from exo.utils.event_buffer import MultiSourceBuffer
@@ -86,6 +90,8 @@ class Master:
self._multi_buffer = MultiSourceBuffer[NodeId, Event]()
# TODO: not have this
self._event_log: list[Event] = []
self._pending_traces: dict[TaskId, dict[int, list[TraceEventData]]] = {}
self._expected_ranks: dict[TaskId, set[int]] = {}
async def run(self):
logger.info("Starting Master")
@@ -117,11 +123,11 @@ class Master:
match command:
case TestCommand():
pass
case ChatCompletion():
case TextGeneration():
for instance in self.state.instances.values():
if (
instance.shard_assignments.model_id
== command.request_params.model
== command.task_params.model
):
task_count = sum(
1
@@ -134,7 +140,7 @@ class Master:
if not instance_task_counts:
raise ValueError(
f"No instance found for model {command.request_params.model}"
f"No instance found for model {command.task_params.model}"
)
available_instance_ids = sorted(
@@ -148,12 +154,12 @@ class Master:
generated_events.append(
TaskCreated(
task_id=task_id,
task=ChatCompletionTask(
task=TextGenerationTask(
task_id=task_id,
command_id=command.command_id,
instance_id=available_instance_ids[0],
task_status=TaskStatus.Pending,
task_params=command.request_params,
task_params=command.task_params,
),
)
)
@@ -163,7 +169,7 @@ class Master:
for instance in self.state.instances.values():
if (
instance.shard_assignments.model_id
== command.request_params.model
== command.task_params.model
):
task_count = sum(
1
@@ -176,7 +182,7 @@ class Master:
if not instance_task_counts:
raise ValueError(
f"No instance found for model {command.request_params.model}"
f"No instance found for model {command.task_params.model}"
)
available_instance_ids = sorted(
@@ -187,25 +193,37 @@ class Master:
)
task_id = TaskId()
selected_instance_id = available_instance_ids[0]
generated_events.append(
TaskCreated(
task_id=task_id,
task=ImageGenerationTask(
task_id=task_id,
command_id=command.command_id,
instance_id=available_instance_ids[0],
instance_id=selected_instance_id,
task_status=TaskStatus.Pending,
task_params=command.request_params,
task_params=command.task_params,
),
)
)
self.command_task_mapping[command.command_id] = task_id
if EXO_TRACING_ENABLED:
selected_instance = self.state.instances.get(
selected_instance_id
)
if selected_instance:
ranks = set(
shard.device_rank
for shard in selected_instance.shard_assignments.runner_to_shard.values()
)
self._expected_ranks[task_id] = ranks
case ImageEdits():
for instance in self.state.instances.values():
if (
instance.shard_assignments.model_id
== command.request_params.model
== command.task_params.model
):
task_count = sum(
1
@@ -218,7 +236,7 @@ class Master:
if not instance_task_counts:
raise ValueError(
f"No instance found for model {command.request_params.model}"
f"No instance found for model {command.task_params.model}"
)
available_instance_ids = sorted(
@@ -229,20 +247,32 @@ class Master:
)
task_id = TaskId()
selected_instance_id = available_instance_ids[0]
generated_events.append(
TaskCreated(
task_id=task_id,
task=ImageEditsTask(
task_id=task_id,
command_id=command.command_id,
instance_id=available_instance_ids[0],
instance_id=selected_instance_id,
task_status=TaskStatus.Pending,
task_params=command.request_params,
task_params=command.task_params,
),
)
)
self.command_task_mapping[command.command_id] = task_id
if EXO_TRACING_ENABLED:
selected_instance = self.state.instances.get(
selected_instance_id
)
if selected_instance:
ranks = set(
shard.device_rank
for shard in selected_instance.shard_assignments.runner_to_shard.values()
)
self._expected_ranks[task_id] = ranks
case DeleteInstance():
placement = delete_instance(command, self.state.instances)
transition_events = get_transition_events(
@@ -335,6 +365,10 @@ class Master:
local_event.origin,
)
for event in self._multi_buffer.drain():
if isinstance(event, TracesCollected):
await self._handle_traces_collected(event)
continue
logger.debug(f"Master indexing event: {str(event)[:100]}")
indexed = IndexedEvent(event=event, idx=len(self._event_log))
self.state = apply(self.state, indexed)
@@ -373,3 +407,29 @@ class Master:
event=event.event,
)
)
async def _handle_traces_collected(self, event: TracesCollected) -> None:
task_id = event.task_id
if task_id not in self._pending_traces:
self._pending_traces[task_id] = {}
self._pending_traces[task_id][event.rank] = event.traces
if (
task_id in self._expected_ranks
and set(self._pending_traces[task_id].keys())
>= self._expected_ranks[task_id]
):
await self._merge_and_save_traces(task_id)
async def _merge_and_save_traces(self, task_id: TaskId) -> None:
all_trace_data: list[TraceEventData] = []
for trace_data in self._pending_traces[task_id].values():
all_trace_data.extend(trace_data)
await self.event_sender.send(
TracesMerged(task_id=task_id, traces=all_trace_data)
)
del self._pending_traces[task_id]
if task_id in self._expected_ranks:
del self._expected_ranks[task_id]

View File

@@ -10,6 +10,7 @@ from exo.shared.types.profiling import MemoryUsage, NodeNetworkInfo
from exo.shared.types.topology import Cycle, RDMAConnection, SocketConnection
from exo.shared.types.worker.runners import RunnerId, ShardAssignments
from exo.shared.types.worker.shards import (
CfgShardMetadata,
PipelineShardMetadata,
Sharding,
ShardMetadata,
@@ -74,40 +75,43 @@ def allocate_layers_proportionally(
return result
def get_shard_assignments_for_pipeline_parallel(
model_card: ModelCard,
cycle: Cycle,
node_memory: Mapping[NodeId, MemoryUsage],
):
def _validate_cycle(cycle: Cycle) -> None:
if not cycle.node_ids:
raise ValueError("Cannot create shard assignments for empty node cycle")
cycle_memory = sum(
(node_memory[node_id].ram_available for node_id in cycle.node_ids),
def _compute_total_memory(
node_ids: list[NodeId],
node_memory: Mapping[NodeId, MemoryUsage],
) -> Memory:
total_memory = sum(
(node_memory[node_id].ram_available for node_id in node_ids),
start=Memory(),
)
if cycle_memory.in_bytes == 0:
if total_memory.in_bytes == 0:
raise ValueError("Cannot create shard assignments: total available memory is 0")
return total_memory
total_layers = model_card.n_layers
world_size = len(cycle)
runner_to_shard: dict[RunnerId, ShardMetadata] = {}
node_to_runner: dict[NodeId, RunnerId] = {}
def _allocate_and_validate_layers(
node_ids: list[NodeId],
node_memory: Mapping[NodeId, MemoryUsage],
total_memory: Memory,
model_card: ModelCard,
) -> list[int]:
layer_allocations = allocate_layers_proportionally(
total_layers=total_layers,
total_layers=model_card.n_layers,
memory_fractions=[
node_memory[node_id].ram_available.in_bytes / cycle_memory.in_bytes
for node_id in cycle.node_ids
node_memory[node_id].ram_available.in_bytes / total_memory.in_bytes
for node_id in node_ids
],
)
# Validate each node has sufficient memory for its assigned layers
memory_per_layer = model_card.storage_size.in_bytes / total_layers
for i, (node_id, node_layers) in enumerate(
zip(cycle.node_ids, layer_allocations, strict=True)
):
required_memory = node_layers * memory_per_layer
total_storage_bytes = model_card.storage_size.in_bytes
total_layers = model_card.n_layers
for i, node_id in enumerate(node_ids):
node_layers = layer_allocations[i]
required_memory = (total_storage_bytes * node_layers) // total_layers
available_memory = node_memory[node_id].ram_available.in_bytes
if required_memory > available_memory:
raise ValueError(
@@ -116,32 +120,125 @@ def get_shard_assignments_for_pipeline_parallel(
f"but only has {available_memory / (1024**3):.2f} GB available"
)
layers_assigned = 0
for i, (node_id, node_layers) in enumerate(
zip(cycle.node_ids, layer_allocations, strict=True)
):
runner_id = RunnerId()
return layer_allocations
shard = PipelineShardMetadata(
def get_shard_assignments_for_pipeline_parallel(
model_card: ModelCard,
cycle: Cycle,
node_memory: Mapping[NodeId, MemoryUsage],
) -> ShardAssignments:
"""Create shard assignments for pipeline parallel execution."""
world_size = len(cycle)
use_cfg_parallel = model_card.uses_cfg and world_size >= 2 and world_size % 2 == 0
if use_cfg_parallel:
return _get_shard_assignments_for_cfg_parallel(model_card, cycle, node_memory)
else:
return _get_shard_assignments_for_pure_pipeline(model_card, cycle, node_memory)
def _get_shard_assignments_for_cfg_parallel(
model_card: ModelCard,
cycle: Cycle,
node_memory: Mapping[NodeId, MemoryUsage],
) -> ShardAssignments:
"""Create shard assignments for CFG parallel execution.
CFG parallel runs two independent pipelines. Group 0 processes the positive
prompt, group 1 processes the negative prompt. The ring topology places
group 1's ranks in reverse order so both "last stages" are neighbors for
efficient CFG exchange.
"""
_validate_cycle(cycle)
world_size = len(cycle)
cfg_world_size = 2
pipeline_world_size = world_size // cfg_world_size
# Allocate layers for one pipeline group (both groups run the same layers)
pipeline_node_ids = cycle.node_ids[:pipeline_world_size]
pipeline_memory = _compute_total_memory(pipeline_node_ids, node_memory)
layer_allocations = _allocate_and_validate_layers(
pipeline_node_ids, node_memory, pipeline_memory, model_card
)
# Ring topology: group 0 ascending [0,1,2,...], group 1 descending [...,2,1,0]
# This places both last stages as neighbors for CFG exchange.
position_to_cfg_pipeline = [(0, r) for r in range(pipeline_world_size)] + [
(1, r) for r in reversed(range(pipeline_world_size))
]
runner_to_shard: dict[RunnerId, ShardMetadata] = {}
node_to_runner: dict[NodeId, RunnerId] = {}
for device_rank, node_id in enumerate(cycle.node_ids):
cfg_rank, pipeline_rank = position_to_cfg_pipeline[device_rank]
layers_before = sum(layer_allocations[:pipeline_rank])
node_layers = layer_allocations[pipeline_rank]
shard = CfgShardMetadata(
model_card=model_card,
device_rank=i,
device_rank=device_rank,
world_size=world_size,
start_layer=layers_assigned,
end_layer=layers_assigned + node_layers,
n_layers=total_layers,
start_layer=layers_before,
end_layer=layers_before + node_layers,
n_layers=model_card.n_layers,
cfg_rank=cfg_rank,
cfg_world_size=cfg_world_size,
pipeline_rank=pipeline_rank,
pipeline_world_size=pipeline_world_size,
)
runner_id = RunnerId()
runner_to_shard[runner_id] = shard
node_to_runner[node_id] = runner_id
layers_assigned += node_layers
shard_assignments = ShardAssignments(
return ShardAssignments(
model_id=model_card.model_id,
runner_to_shard=runner_to_shard,
node_to_runner=node_to_runner,
)
return shard_assignments
def _get_shard_assignments_for_pure_pipeline(
model_card: ModelCard,
cycle: Cycle,
node_memory: Mapping[NodeId, MemoryUsage],
) -> ShardAssignments:
"""Create shard assignments for pure pipeline execution."""
_validate_cycle(cycle)
total_memory = _compute_total_memory(cycle.node_ids, node_memory)
layer_allocations = _allocate_and_validate_layers(
cycle.node_ids, node_memory, total_memory, model_card
)
runner_to_shard: dict[RunnerId, ShardMetadata] = {}
node_to_runner: dict[NodeId, RunnerId] = {}
for pipeline_rank, node_id in enumerate(cycle.node_ids):
layers_before = sum(layer_allocations[:pipeline_rank])
node_layers = layer_allocations[pipeline_rank]
shard = PipelineShardMetadata(
model_card=model_card,
device_rank=pipeline_rank,
world_size=len(cycle),
start_layer=layers_before,
end_layer=layers_before + node_layers,
n_layers=model_card.n_layers,
)
runner_id = RunnerId()
runner_to_shard[runner_id] = shard
node_to_runner[node_id] = runner_id
return ShardAssignments(
model_id=model_card.model_id,
runner_to_shard=runner_to_shard,
node_to_runner=node_to_runner,
)
def get_shard_assignments_for_tensor_parallel(

View File

@@ -0,0 +1,182 @@
"""Tests for Claude Messages API conversion functions and types."""
import pydantic
import pytest
from exo.master.adapters.claude import (
claude_request_to_text_generation,
finish_reason_to_claude_stop_reason,
)
from exo.shared.types.claude_api import (
ClaudeMessage,
ClaudeMessagesRequest,
ClaudeTextBlock,
)
from exo.shared.types.common import ModelId
class TestFinishReasonToClaudeStopReason:
"""Tests for finish_reason to Claude stop_reason mapping."""
def test_stop_maps_to_end_turn(self):
assert finish_reason_to_claude_stop_reason("stop") == "end_turn"
def test_length_maps_to_max_tokens(self):
assert finish_reason_to_claude_stop_reason("length") == "max_tokens"
def test_tool_calls_maps_to_tool_use(self):
assert finish_reason_to_claude_stop_reason("tool_calls") == "tool_use"
def test_function_call_maps_to_tool_use(self):
assert finish_reason_to_claude_stop_reason("function_call") == "tool_use"
def test_content_filter_maps_to_end_turn(self):
assert finish_reason_to_claude_stop_reason("content_filter") == "end_turn"
def test_none_returns_none(self):
assert finish_reason_to_claude_stop_reason(None) is None
class TestClaudeRequestToInternal:
"""Tests for converting Claude Messages API requests to TextGenerationTaskParams."""
def test_basic_request_conversion(self):
request = ClaudeMessagesRequest(
model=ModelId("claude-3-opus"),
max_tokens=100,
messages=[
ClaudeMessage(role="user", content="Hello"),
],
)
params = claude_request_to_text_generation(request)
assert params.model == "claude-3-opus"
assert params.max_output_tokens == 100
assert isinstance(params.input, list)
assert len(params.input) == 1
assert params.input[0].role == "user"
assert params.input[0].content == "Hello"
assert params.instructions is None
def test_request_with_system_string(self):
request = ClaudeMessagesRequest(
model=ModelId("claude-3-opus"),
max_tokens=100,
system="You are a helpful assistant.",
messages=[
ClaudeMessage(role="user", content="Hello"),
],
)
params = claude_request_to_text_generation(request)
assert params.instructions == "You are a helpful assistant."
assert isinstance(params.input, list)
assert len(params.input) == 1
assert params.input[0].role == "user"
assert params.input[0].content == "Hello"
def test_request_with_system_text_blocks(self):
request = ClaudeMessagesRequest(
model=ModelId("claude-3-opus"),
max_tokens=100,
system=[
ClaudeTextBlock(text="You are helpful. "),
ClaudeTextBlock(text="Be concise."),
],
messages=[
ClaudeMessage(role="user", content="Hello"),
],
)
params = claude_request_to_text_generation(request)
assert params.instructions == "You are helpful. Be concise."
assert isinstance(params.input, list)
assert len(params.input) == 1
def test_request_with_content_blocks(self):
request = ClaudeMessagesRequest(
model=ModelId("claude-3-opus"),
max_tokens=100,
messages=[
ClaudeMessage(
role="user",
content=[
ClaudeTextBlock(text="First part. "),
ClaudeTextBlock(text="Second part."),
],
),
],
)
params = claude_request_to_text_generation(request)
assert isinstance(params.input, list)
assert len(params.input) == 1
assert params.input[0].content == "First part. Second part."
def test_request_with_multi_turn_conversation(self):
request = ClaudeMessagesRequest(
model=ModelId("claude-3-opus"),
max_tokens=100,
messages=[
ClaudeMessage(role="user", content="Hello"),
ClaudeMessage(role="assistant", content="Hi there!"),
ClaudeMessage(role="user", content="How are you?"),
],
)
params = claude_request_to_text_generation(request)
assert isinstance(params.input, list)
assert len(params.input) == 3
assert params.input[0].role == "user"
assert params.input[1].role == "assistant"
assert params.input[2].role == "user"
def test_request_with_optional_parameters(self):
request = ClaudeMessagesRequest(
model=ModelId("claude-3-opus"),
max_tokens=100,
messages=[ClaudeMessage(role="user", content="Hello")],
temperature=0.7,
top_p=0.9,
top_k=40,
stop_sequences=["STOP", "END"],
stream=True,
)
params = claude_request_to_text_generation(request)
assert params.temperature == 0.7
assert params.top_p == 0.9
assert params.top_k == 40
assert params.stop == ["STOP", "END"]
assert params.stream is True
class TestClaudeMessagesRequestValidation:
"""Tests for Claude Messages API request validation."""
def test_request_requires_model(self):
with pytest.raises(pydantic.ValidationError):
ClaudeMessagesRequest.model_validate(
{
"max_tokens": 100,
"messages": [{"role": "user", "content": "Hello"}],
}
)
def test_request_requires_max_tokens(self):
with pytest.raises(pydantic.ValidationError):
ClaudeMessagesRequest.model_validate(
{
"model": "claude-3-opus",
"messages": [{"role": "user", "content": "Hello"}],
}
)
def test_request_requires_messages(self):
with pytest.raises(pydantic.ValidationError):
ClaudeMessagesRequest.model_validate(
{
"model": "claude-3-opus",
"max_tokens": 100,
}
)

View File

@@ -0,0 +1,265 @@
"""Tests for Claude Messages API tool_use support in the adapter."""
import json
from collections.abc import AsyncGenerator
from typing import Any, cast
from exo.master.adapters.claude import collect_claude_response, generate_claude_stream
from exo.shared.types.api import ToolCallItem
from exo.shared.types.chunks import ErrorChunk, TokenChunk, ToolCallChunk
from exo.shared.types.common import CommandId, ModelId
async def _chunks_to_stream(
chunks: list[ErrorChunk | ToolCallChunk | TokenChunk],
) -> AsyncGenerator[ErrorChunk | ToolCallChunk | TokenChunk, None]:
for chunk in chunks:
yield chunk
MODEL = ModelId("test-model")
COMMAND_ID = CommandId("cmd_test123")
def _parse_sse_events(events: list[str]) -> list[dict[str, Any]]:
"""Parse SSE event strings into JSON dicts."""
parsed: list[dict[str, Any]] = []
for event_str in events:
for line in event_str.strip().split("\n"):
if line.startswith("data: "):
parsed.append(cast(dict[str, Any], json.loads(line[6:])))
return parsed
class TestCollectClaudeResponseToolUse:
"""Tests for non-streaming tool_use response collection."""
async def test_tool_call_chunk_produces_tool_use_blocks(self):
chunks: list[ErrorChunk | ToolCallChunk | TokenChunk] = [
ToolCallChunk(
model=MODEL,
usage=None,
tool_calls=[
ToolCallItem(
name="get_weather",
arguments='{"location": "San Francisco"}',
)
],
),
]
response = await collect_claude_response(
COMMAND_ID, "test-model", _chunks_to_stream(chunks)
)
assert response.stop_reason == "tool_use"
tool_blocks = [b for b in response.content if b.type == "tool_use"]
assert len(tool_blocks) == 1
block = tool_blocks[0]
assert block.type == "tool_use"
assert block.name == "get_weather"
assert block.input == {"location": "San Francisco"}
assert block.id.startswith("toolu_")
async def test_multiple_tool_calls(self):
chunks: list[ErrorChunk | ToolCallChunk | TokenChunk] = [
ToolCallChunk(
model=MODEL,
usage=None,
tool_calls=[
ToolCallItem(
name="get_weather",
arguments='{"location": "SF"}',
),
ToolCallItem(
name="get_time",
arguments='{"timezone": "PST"}',
),
],
),
]
response = await collect_claude_response(
COMMAND_ID, "test-model", _chunks_to_stream(chunks)
)
assert response.stop_reason == "tool_use"
tool_blocks = [b for b in response.content if b.type == "tool_use"]
assert len(tool_blocks) == 2
assert tool_blocks[0].name == "get_weather"
assert tool_blocks[1].name == "get_time"
async def test_mixed_text_and_tool_use(self):
chunks: list[ErrorChunk | ToolCallChunk | TokenChunk] = [
TokenChunk(model=MODEL, text="Let me check ", token_id=1, usage=None),
TokenChunk(model=MODEL, text="the weather.", token_id=2, usage=None),
ToolCallChunk(
model=MODEL,
usage=None,
tool_calls=[
ToolCallItem(
name="get_weather",
arguments='{"location": "NYC"}',
)
],
),
]
response = await collect_claude_response(
COMMAND_ID, "test-model", _chunks_to_stream(chunks)
)
assert response.stop_reason == "tool_use"
text_blocks = [b for b in response.content if b.type == "text"]
tool_blocks = [b for b in response.content if b.type == "tool_use"]
assert len(text_blocks) == 1
assert text_blocks[0].text == "Let me check the weather."
assert len(tool_blocks) == 1
assert tool_blocks[0].name == "get_weather"
async def test_no_content_produces_empty_text_block(self):
chunks: list[ErrorChunk | ToolCallChunk | TokenChunk] = []
response = await collect_claude_response(
COMMAND_ID, "test-model", _chunks_to_stream(chunks)
)
assert len(response.content) == 1
assert response.content[0].type == "text"
class TestGenerateClaudeStreamToolUse:
"""Tests for streaming tool_use event generation."""
async def test_tool_call_emits_tool_use_events(self):
chunks: list[ErrorChunk | ToolCallChunk | TokenChunk] = [
ToolCallChunk(
model=MODEL,
usage=None,
tool_calls=[
ToolCallItem(
name="get_weather",
arguments='{"location": "SF"}',
)
],
),
]
events: list[str] = []
async for event in generate_claude_stream(
COMMAND_ID, "test-model", _chunks_to_stream(chunks)
):
events.append(event)
parsed = _parse_sse_events(events)
# Find tool_use content_block_start
tool_starts = [
e
for e in parsed
if e.get("type") == "content_block_start"
and cast(dict[str, Any], e.get("content_block", {})).get("type")
== "tool_use"
]
assert len(tool_starts) == 1
content_block = cast(dict[str, Any], tool_starts[0]["content_block"])
assert content_block["name"] == "get_weather"
assert content_block["input"] == {}
assert cast(str, content_block["id"]).startswith("toolu_")
# Find input_json_delta
json_deltas = [
e
for e in parsed
if e.get("type") == "content_block_delta"
and cast(dict[str, Any], e.get("delta", {})).get("type")
== "input_json_delta"
]
assert len(json_deltas) == 1
delta = cast(dict[str, Any], json_deltas[0]["delta"])
assert json.loads(cast(str, delta["partial_json"])) == {"location": "SF"}
# Find message_delta with tool_use stop reason
msg_deltas = [e for e in parsed if e.get("type") == "message_delta"]
assert len(msg_deltas) == 1
assert cast(dict[str, Any], msg_deltas[0]["delta"])["stop_reason"] == "tool_use"
async def test_streaming_mixed_text_and_tool_use(self):
chunks: list[ErrorChunk | ToolCallChunk | TokenChunk] = [
TokenChunk(model=MODEL, text="Hello ", token_id=1, usage=None),
ToolCallChunk(
model=MODEL,
usage=None,
tool_calls=[
ToolCallItem(
name="search",
arguments='{"query": "test"}',
)
],
),
]
events: list[str] = []
async for event in generate_claude_stream(
COMMAND_ID, "test-model", _chunks_to_stream(chunks)
):
events.append(event)
parsed = _parse_sse_events(events)
# Should have text delta at index 0
text_deltas = [
e
for e in parsed
if e.get("type") == "content_block_delta"
and cast(dict[str, Any], e.get("delta", {})).get("type") == "text_delta"
]
assert len(text_deltas) == 1
assert text_deltas[0]["index"] == 0
assert cast(dict[str, Any], text_deltas[0]["delta"])["text"] == "Hello "
# Tool block at index 1
tool_starts = [
e
for e in parsed
if e.get("type") == "content_block_start"
and cast(dict[str, Any], e.get("content_block", {})).get("type")
== "tool_use"
]
assert len(tool_starts) == 1
assert tool_starts[0]["index"] == 1
# Stop reason should be tool_use
msg_deltas = [e for e in parsed if e.get("type") == "message_delta"]
assert cast(dict[str, Any], msg_deltas[0]["delta"])["stop_reason"] == "tool_use"
async def test_streaming_tool_block_stop_events(self):
chunks: list[ErrorChunk | ToolCallChunk | TokenChunk] = [
ToolCallChunk(
model=MODEL,
usage=None,
tool_calls=[
ToolCallItem(name="fn1", arguments="{}"),
ToolCallItem(name="fn2", arguments='{"a": 1}'),
],
),
]
events: list[str] = []
async for event in generate_claude_stream(
COMMAND_ID, "test-model", _chunks_to_stream(chunks)
):
events.append(event)
parsed = _parse_sse_events(events)
# Two tool block starts (at indices 1 and 2)
tool_starts = [
e
for e in parsed
if e.get("type") == "content_block_start"
and cast(dict[str, Any], e.get("content_block", {})).get("type")
== "tool_use"
]
assert len(tool_starts) == 2
assert tool_starts[0]["index"] == 1
assert tool_starts[1]["index"] == 2
# Two tool block stops (at indices 1 and 2), plus text block stop at 0
block_stops = [e for e in parsed if e.get("type") == "content_block_stop"]
stop_indices = [e["index"] for e in block_stops]
assert 0 in stop_indices
assert 1 in stop_indices
assert 2 in stop_indices

View File

@@ -7,15 +7,14 @@ from loguru import logger
from exo.master.main import Master
from exo.routing.router import get_node_id_keypair
from exo.shared.models.model_cards import ModelCard, ModelId, ModelTask
from exo.shared.types.api import ChatCompletionMessage, ChatCompletionTaskParams
from exo.shared.models.model_cards import ModelCard, ModelTask
from exo.shared.types.commands import (
ChatCompletion,
CommandId,
ForwarderCommand,
PlaceInstance,
TextGeneration,
)
from exo.shared.types.common import NodeId, SessionId
from exo.shared.types.common import ModelId, NodeId, SessionId
from exo.shared.types.events import (
ForwarderEvent,
IndexedEvent,
@@ -27,8 +26,9 @@ from exo.shared.types.memory import Memory
from exo.shared.types.profiling import (
MemoryUsage,
)
from exo.shared.types.tasks import ChatCompletion as ChatCompletionTask
from exo.shared.types.tasks import TaskStatus
from exo.shared.types.tasks import TextGeneration as TextGenerationTask
from exo.shared.types.text_generation import InputMessage, TextGenerationTaskParams
from exo.shared.types.worker.instances import (
InstanceMeta,
MlxRingInstance,
@@ -127,19 +127,17 @@ async def test_master():
logger.info("wait for an instance")
while len(master.state.instances.keys()) == 0:
await anyio.sleep(0.001)
logger.info("inject a ChatCompletion Command")
logger.info("inject a TextGeneration Command")
await command_sender.send(
ForwarderCommand(
origin=node_id,
command=(
ChatCompletion(
TextGeneration(
command_id=CommandId(),
request_params=ChatCompletionTaskParams(
model="llama-3.2-1b",
messages=[
ChatCompletionMessage(
role="user", content="Hello, how are you?"
)
task_params=TextGenerationTaskParams(
model=ModelId("llama-3.2-1b"),
input=[
InputMessage(role="user", content="Hello, how are you?")
],
),
)
@@ -190,12 +188,10 @@ async def test_master():
assert created_instance.ephemeral_port > 0
assert isinstance(events[2].event, TaskCreated)
assert events[2].event.task.task_status == TaskStatus.Pending
assert isinstance(events[2].event.task, ChatCompletionTask)
assert events[2].event.task.task_params == ChatCompletionTaskParams(
model="llama-3.2-1b",
messages=[
ChatCompletionMessage(role="user", content="Hello, how are you?")
],
assert isinstance(events[2].event.task, TextGenerationTask)
assert events[2].event.task.task_params == TextGenerationTaskParams(
model=ModelId("llama-3.2-1b"),
input=[InputMessage(role="user", content="Hello, how are you?")],
)
await master.shutdown()

View File

@@ -0,0 +1,48 @@
"""Tests for OpenAI Responses API wire types.
ResponsesRequest is the API wire type for the Responses endpoint.
The responses adapter converts it to TextGenerationTaskParams for the pipeline.
"""
import pydantic
import pytest
from exo.shared.types.common import ModelId
from exo.shared.types.openai_responses import (
ResponseInputMessage,
ResponsesRequest,
)
class TestResponsesRequestValidation:
"""Tests for OpenAI Responses API request validation."""
def test_request_requires_model(self):
with pytest.raises(pydantic.ValidationError):
ResponsesRequest.model_validate(
{
"input": "Hello",
}
)
def test_request_requires_input(self):
with pytest.raises(pydantic.ValidationError):
ResponsesRequest.model_validate(
{
"model": "gpt-4o",
}
)
def test_request_accepts_string_input(self):
request = ResponsesRequest(
model=ModelId("gpt-4o"),
input="Hello",
)
assert request.input == "Hello"
def test_request_accepts_message_array_input(self):
request = ResponsesRequest(
model=ModelId("gpt-4o"),
input=[ResponseInputMessage(role="user", content="Hello")],
)
assert len(request.input) == 1

View File

@@ -5,6 +5,7 @@ from exo.master.placement_utils import (
filter_cycles_by_memory,
get_mlx_jaccl_coordinators,
get_shard_assignments,
get_shard_assignments_for_pipeline_parallel,
get_smallest_cycles,
)
from exo.master.tests.conftest import (
@@ -20,7 +21,11 @@ from exo.shared.types.profiling import (
NodeNetworkInfo,
)
from exo.shared.types.topology import Connection, SocketConnection
from exo.shared.types.worker.shards import Sharding
from exo.shared.types.worker.shards import (
CfgShardMetadata,
PipelineShardMetadata,
Sharding,
)
def test_filter_cycles_by_memory():
@@ -487,3 +492,193 @@ def test_get_shard_assignments_insufficient_memory_raises():
get_shard_assignments(
model_card, selected_cycle, Sharding.Pipeline, node_memory
)
class TestCfgParallelPlacement:
def _create_ring_topology(self, node_ids: list[NodeId]) -> Topology:
topology = Topology()
for node_id in node_ids:
topology.add_node(node_id)
for i, node_id in enumerate(node_ids):
next_node = node_ids[(i + 1) % len(node_ids)]
conn = Connection(
source=node_id,
sink=next_node,
edge=create_socket_connection(i + 1),
)
topology.add_connection(conn)
return topology
def test_two_nodes_cfg_model_uses_cfg_parallel(self):
"""Two nodes with CFG model should use CFG parallel (no pipeline)."""
node_a = NodeId()
node_b = NodeId()
topology = self._create_ring_topology([node_a, node_b])
cycles = [c for c in topology.get_cycles() if len(c) == 2]
cycle = cycles[0]
node_memory = {
node_a: create_node_memory(1000 * 1024),
node_b: create_node_memory(1000 * 1024),
}
model_card = ModelCard(
model_id=ModelId("qwen-image-test"),
n_layers=60,
storage_size=Memory.from_kb(1000),
hidden_size=1,
supports_tensor=False,
uses_cfg=True,
tasks=[ModelTask.TextToImage],
)
assignments = get_shard_assignments_for_pipeline_parallel(
model_card, cycle, node_memory
)
shards = list(assignments.runner_to_shard.values())
assert len(shards) == 2
# CFG models should get CfgShardMetadata
for shard in shards:
assert isinstance(shard, CfgShardMetadata)
# Both nodes should have all layers (no pipeline split)
assert shard.start_layer == 0
assert shard.end_layer == 60
assert shard.cfg_world_size == 2
# Each node is the only stage in its pipeline group
assert shard.pipeline_world_size == 1
assert shard.pipeline_rank == 0
cfg_ranks = sorted(
s.cfg_rank for s in shards if isinstance(s, CfgShardMetadata)
)
assert cfg_ranks == [0, 1]
def test_four_nodes_cfg_model_uses_hybrid(self):
"""Four nodes with CFG model should use 2 CFG groups x 2 pipeline stages."""
nodes = [NodeId() for _ in range(4)]
topology = self._create_ring_topology(nodes)
cycles = [c for c in topology.get_cycles() if len(c) == 4]
cycle = cycles[0]
node_memory = {n: create_node_memory(1000 * 1024) for n in nodes}
model_card = ModelCard(
model_id=ModelId("qwen-image-test"),
n_layers=60,
storage_size=Memory.from_kb(1000),
hidden_size=1,
supports_tensor=False,
uses_cfg=True,
tasks=[ModelTask.TextToImage],
)
assignments = get_shard_assignments_for_pipeline_parallel(
model_card, cycle, node_memory
)
shards = list(assignments.runner_to_shard.values())
assert len(shards) == 4
# CFG models should get CfgShardMetadata
for shard in shards:
assert isinstance(shard, CfgShardMetadata)
assert shard.cfg_world_size == 2
assert shard.pipeline_world_size == 2
assert shard.pipeline_rank in [0, 1]
# Check we have 2 nodes in each CFG group
cfg_0_shards = [
s for s in shards if isinstance(s, CfgShardMetadata) and s.cfg_rank == 0
]
cfg_1_shards = [
s for s in shards if isinstance(s, CfgShardMetadata) and s.cfg_rank == 1
]
assert len(cfg_0_shards) == 2
assert len(cfg_1_shards) == 2
# Both CFG groups should have the same layer assignments
cfg_0_layers = [(s.start_layer, s.end_layer) for s in cfg_0_shards]
cfg_1_layers = [(s.start_layer, s.end_layer) for s in cfg_1_shards]
assert sorted(cfg_0_layers) == sorted(cfg_1_layers)
def test_three_nodes_cfg_model_uses_sequential_cfg(self):
"""Three nodes (odd) with CFG model should use sequential CFG (PipelineShardMetadata)."""
nodes = [NodeId() for _ in range(3)]
topology = self._create_ring_topology(nodes)
cycles = [c for c in topology.get_cycles() if len(c) == 3]
cycle = cycles[0]
node_memory = {n: create_node_memory(1000 * 1024) for n in nodes}
model_card = ModelCard(
model_id=ModelId("qwen-image-test"),
n_layers=60,
storage_size=Memory.from_kb(1000),
hidden_size=1,
supports_tensor=False,
uses_cfg=True,
tasks=[ModelTask.TextToImage],
)
assignments = get_shard_assignments_for_pipeline_parallel(
model_card, cycle, node_memory
)
shards = list(assignments.runner_to_shard.values())
assert len(shards) == 3
# Odd node count with CFG model falls back to PipelineShardMetadata (sequential CFG)
for shard in shards:
assert isinstance(shard, PipelineShardMetadata)
def test_two_nodes_non_cfg_model_uses_pipeline(self):
"""Two nodes with non-CFG model should use pure pipeline (PipelineShardMetadata)."""
node_a = NodeId()
node_b = NodeId()
topology = self._create_ring_topology([node_a, node_b])
cycles = [c for c in topology.get_cycles() if len(c) == 2]
cycle = cycles[0]
node_memory = {
node_a: create_node_memory(1000 * 1024),
node_b: create_node_memory(1000 * 1024),
}
model_card = ModelCard(
model_id=ModelId("flux-test"),
n_layers=57,
storage_size=Memory.from_kb(1000),
hidden_size=1,
supports_tensor=False,
uses_cfg=False, # Non-CFG model
tasks=[ModelTask.TextToImage],
)
assignments = get_shard_assignments_for_pipeline_parallel(
model_card, cycle, node_memory
)
shards = list(assignments.runner_to_shard.values())
assert len(shards) == 2
# Non-CFG models should get PipelineShardMetadata
for shard in shards:
assert isinstance(shard, PipelineShardMetadata)
# Should have actual layer sharding (pipeline)
layer_ranges = sorted(
(s.start_layer, s.end_layer)
for s in shards
if isinstance(s, PipelineShardMetadata)
)
# First shard starts at 0, last shard ends at 57
assert layer_ranges[0][0] == 0
assert layer_ranges[-1][1] == 57

View File

@@ -25,6 +25,8 @@ from exo.shared.types.events import (
TestEvent,
TopologyEdgeCreated,
TopologyEdgeDeleted,
TracesCollected,
TracesMerged,
)
from exo.shared.types.profiling import (
NodeIdentity,
@@ -55,7 +57,12 @@ def event_apply(event: Event, state: State) -> State:
"""Apply an event to state."""
match event:
case (
TestEvent() | ChunkGenerated() | TaskAcknowledged() | InputChunkReceived()
TestEvent()
| ChunkGenerated()
| TaskAcknowledged()
| InputChunkReceived()
| TracesCollected()
| TracesMerged()
): # Pass-through events that don't modify state
return state
case InstanceCreated():

View File

@@ -2,6 +2,8 @@ import os
import sys
from pathlib import Path
from exo.utils.dashboard_path import find_dashboard, find_resources
_EXO_HOME_ENV = os.environ.get("EXO_HOME", None)
@@ -31,6 +33,14 @@ EXO_MODELS_DIR = (
if _EXO_MODELS_DIR_ENV is None
else Path.home() / _EXO_MODELS_DIR_ENV
)
_RESOURCES_DIR_ENV = os.environ.get("EXO_RESOURCES_DIR", None)
RESOURCES_DIR = (
find_resources() if _RESOURCES_DIR_ENV is None else Path.home() / _RESOURCES_DIR_ENV
)
_DASHBOARD_DIR_ENV = os.environ.get("EXO_DASHBOARD_DIR", None)
DASHBOARD_DIR = (
find_dashboard() if _DASHBOARD_DIR_ENV is None else Path.home() / _DASHBOARD_DIR_ENV
)
# Log files (data/logs or cache)
EXO_LOG = EXO_CACHE_HOME / "exo.log"
@@ -48,8 +58,13 @@ LIBP2P_COMMANDS_TOPIC = "commands"
EXO_MAX_CHUNK_SIZE = 512 * 1024
EXO_CUSTOM_MODEL_CARDS_DIR = EXO_DATA_HOME / "custom_model_cards"
EXO_IMAGE_CACHE_DIR = EXO_CACHE_HOME / "images"
EXO_TRACING_CACHE_DIR = EXO_CACHE_HOME / "traces"
EXO_ENABLE_IMAGE_MODELS = (
os.getenv("EXO_ENABLE_IMAGE_MODELS", "false").lower() == "true"
)
EXO_TRACING_ENABLED = os.getenv("EXO_TRACING_ENABLED", "false").lower() == "true"

View File

@@ -12,16 +12,47 @@ from pydantic import (
BaseModel,
Field,
PositiveInt,
ValidationError,
field_validator,
model_validator,
)
from tomlkit.exceptions import TOMLKitError
from exo.shared.constants import EXO_ENABLE_IMAGE_MODELS
from exo.shared.constants import (
EXO_CUSTOM_MODEL_CARDS_DIR,
EXO_ENABLE_IMAGE_MODELS,
RESOURCES_DIR,
)
from exo.shared.types.common import ModelId
from exo.shared.types.memory import Memory
from exo.utils.pydantic_ext import CamelCaseModel
_card_cache: dict[str, "ModelCard"] = {}
# kinda ugly...
# TODO: load search path from config.toml
_custom_cards_dir = Path(str(EXO_CUSTOM_MODEL_CARDS_DIR))
_csp = [Path(RESOURCES_DIR) / "inference_model_cards", _custom_cards_dir]
if EXO_ENABLE_IMAGE_MODELS:
_csp.append(Path(RESOURCES_DIR) / "image_model_cards")
CARD_SEARCH_PATH = _csp
_card_cache: dict[ModelId, "ModelCard"] = {}
async def _refresh_card_cache():
for path in CARD_SEARCH_PATH:
async for toml_file in path.rglob("*.toml"):
try:
card = await ModelCard.load_from_path(toml_file)
_card_cache[card.model_id] = card
except (ValidationError, TOMLKitError):
pass
async def get_model_cards() -> list["ModelCard"]:
if len(_card_cache) == 0:
await _refresh_card_cache()
return list(_card_cache.values())
class ModelTask(str, Enum):
@@ -34,9 +65,9 @@ class ComponentInfo(CamelCaseModel):
component_name: str
component_path: str
storage_size: Memory
n_layers: PositiveInt | None
n_layers: PositiveInt | None = None
can_shard: bool
safetensors_index_filename: str | None
safetensors_index_filename: str | None = None
class ModelCard(CamelCaseModel):
@@ -47,6 +78,11 @@ class ModelCard(CamelCaseModel):
supports_tensor: bool
tasks: list[ModelTask]
components: list[ComponentInfo] | None = None
family: str = ""
quantization: str = ""
base_model: str = ""
capabilities: list[str] = []
uses_cfg: bool = False
@field_validator("tasks", mode="before")
@classmethod
@@ -55,31 +91,37 @@ class ModelCard(CamelCaseModel):
async def save(self, path: Path) -> None:
async with await open_file(path, "w") as f:
py = self.model_dump()
py = self.model_dump(exclude_none=True)
data = tomlkit.dumps(py) # pyright: ignore[reportUnknownMemberType]
await f.write(data)
async def save_to_custom_dir(self) -> None:
await aios.makedirs(str(_custom_cards_dir), exist_ok=True)
await self.save(_custom_cards_dir / (self.model_id.normalize() + ".toml"))
@staticmethod
async def load_from_path(path: Path) -> "ModelCard":
async with await open_file(path, "r") as f:
py = tomlkit.loads(await f.read())
return ModelCard.model_validate(py)
# Is it okay that model card.load defaults to network access if the card doesn't exist? do we want to be more explicit here?
@staticmethod
async def load(model_id: ModelId) -> "ModelCard":
for card in MODEL_CARDS.values():
if card.model_id == model_id:
return card
return await ModelCard.from_hf(model_id)
@staticmethod
async def from_hf(model_id: ModelId) -> "ModelCard":
"""Fetches storage size and number of layers for a Hugging Face model, returns Pydantic ModelMeta."""
if model_id not in _card_cache:
await _refresh_card_cache()
if (mc := _card_cache.get(model_id)) is not None:
return mc
config_data = await get_config_data(model_id)
return await ModelCard.fetch_from_hf(model_id)
@staticmethod
async def fetch_from_hf(model_id: ModelId) -> "ModelCard":
"""Fetches storage size and number of layers for a Hugging Face model, returns Pydantic ModelMeta."""
# TODO: failure if files do not exist
config_data = await fetch_config_data(model_id)
num_layers = config_data.layer_count
mem_size_bytes = await get_safetensors_size(model_id)
mem_size_bytes = await fetch_safetensors_size(model_id)
mc = ModelCard(
model_id=ModelId(model_id),
@@ -89,630 +131,29 @@ class ModelCard(CamelCaseModel):
supports_tensor=config_data.supports_tensor,
tasks=[ModelTask.TextGeneration],
)
await mc.save_to_custom_dir()
_card_cache[model_id] = mc
return mc
MODEL_CARDS: dict[str, ModelCard] = {
# deepseek v3
"deepseek-v3.1-4bit": ModelCard(
model_id=ModelId("mlx-community/DeepSeek-V3.1-4bit"),
storage_size=Memory.from_gb(378),
n_layers=61,
hidden_size=7168,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"deepseek-v3.1-8bit": ModelCard(
model_id=ModelId("mlx-community/DeepSeek-V3.1-8bit"),
storage_size=Memory.from_gb(713),
n_layers=61,
hidden_size=7168,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
# kimi k2
"kimi-k2-instruct-4bit": ModelCard(
model_id=ModelId("mlx-community/Kimi-K2-Instruct-4bit"),
storage_size=Memory.from_gb(578),
n_layers=61,
hidden_size=7168,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"kimi-k2-thinking": ModelCard(
model_id=ModelId("mlx-community/Kimi-K2-Thinking"),
storage_size=Memory.from_gb(658),
n_layers=61,
hidden_size=7168,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"kimi-k2.5": ModelCard(
model_id=ModelId("mlx-community/Kimi-K2.5"),
storage_size=Memory.from_gb(617),
n_layers=61,
hidden_size=7168,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
# llama-3.1
"llama-3.1-8b": ModelCard(
model_id=ModelId("mlx-community/Meta-Llama-3.1-8B-Instruct-4bit"),
storage_size=Memory.from_mb(4423),
n_layers=32,
hidden_size=4096,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"llama-3.1-8b-8bit": ModelCard(
model_id=ModelId("mlx-community/Meta-Llama-3.1-8B-Instruct-8bit"),
storage_size=Memory.from_mb(8540),
n_layers=32,
hidden_size=4096,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"llama-3.1-8b-bf16": ModelCard(
model_id=ModelId("mlx-community/Meta-Llama-3.1-8B-Instruct-bf16"),
storage_size=Memory.from_mb(16100),
n_layers=32,
hidden_size=4096,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"llama-3.1-70b": ModelCard(
model_id=ModelId("mlx-community/Meta-Llama-3.1-70B-Instruct-4bit"),
storage_size=Memory.from_mb(38769),
n_layers=80,
hidden_size=8192,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
# llama-3.2
"llama-3.2-1b": ModelCard(
model_id=ModelId("mlx-community/Llama-3.2-1B-Instruct-4bit"),
storage_size=Memory.from_mb(696),
n_layers=16,
hidden_size=2048,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"llama-3.2-3b": ModelCard(
model_id=ModelId("mlx-community/Llama-3.2-3B-Instruct-4bit"),
storage_size=Memory.from_mb(1777),
n_layers=28,
hidden_size=3072,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"llama-3.2-3b-8bit": ModelCard(
model_id=ModelId("mlx-community/Llama-3.2-3B-Instruct-8bit"),
storage_size=Memory.from_mb(3339),
n_layers=28,
hidden_size=3072,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
# llama-3.3
"llama-3.3-70b": ModelCard(
model_id=ModelId("mlx-community/Llama-3.3-70B-Instruct-4bit"),
storage_size=Memory.from_mb(38769),
n_layers=80,
hidden_size=8192,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"llama-3.3-70b-8bit": ModelCard(
model_id=ModelId("mlx-community/Llama-3.3-70B-Instruct-8bit"),
storage_size=Memory.from_mb(73242),
n_layers=80,
hidden_size=8192,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"llama-3.3-70b-fp16": ModelCard(
model_id=ModelId("mlx-community/llama-3.3-70b-instruct-fp16"),
storage_size=Memory.from_mb(137695),
n_layers=80,
hidden_size=8192,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
# qwen3
"qwen3-0.6b": ModelCard(
model_id=ModelId("mlx-community/Qwen3-0.6B-4bit"),
storage_size=Memory.from_mb(327),
n_layers=28,
hidden_size=1024,
supports_tensor=False,
tasks=[ModelTask.TextGeneration],
),
"qwen3-0.6b-8bit": ModelCard(
model_id=ModelId("mlx-community/Qwen3-0.6B-8bit"),
storage_size=Memory.from_mb(666),
n_layers=28,
hidden_size=1024,
supports_tensor=False,
tasks=[ModelTask.TextGeneration],
),
"qwen3-30b": ModelCard(
model_id=ModelId("mlx-community/Qwen3-30B-A3B-4bit"),
storage_size=Memory.from_mb(16797),
n_layers=48,
hidden_size=2048,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"qwen3-30b-8bit": ModelCard(
model_id=ModelId("mlx-community/Qwen3-30B-A3B-8bit"),
storage_size=Memory.from_mb(31738),
n_layers=48,
hidden_size=2048,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"qwen3-80b-a3B-4bit": ModelCard(
model_id=ModelId("mlx-community/Qwen3-Next-80B-A3B-Instruct-4bit"),
storage_size=Memory.from_mb(44800),
n_layers=48,
hidden_size=2048,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"qwen3-80b-a3B-8bit": ModelCard(
model_id=ModelId("mlx-community/Qwen3-Next-80B-A3B-Instruct-8bit"),
storage_size=Memory.from_mb(84700),
n_layers=48,
hidden_size=2048,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"qwen3-80b-a3B-thinking-4bit": ModelCard(
model_id=ModelId("mlx-community/Qwen3-Next-80B-A3B-Thinking-4bit"),
storage_size=Memory.from_mb(84700),
n_layers=48,
hidden_size=2048,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"qwen3-80b-a3B-thinking-8bit": ModelCard(
model_id=ModelId("mlx-community/Qwen3-Next-80B-A3B-Thinking-8bit"),
storage_size=Memory.from_mb(84700),
n_layers=48,
hidden_size=2048,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"qwen3-235b-a22b-4bit": ModelCard(
model_id=ModelId("mlx-community/Qwen3-235B-A22B-Instruct-2507-4bit"),
storage_size=Memory.from_gb(132),
n_layers=94,
hidden_size=4096,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"qwen3-235b-a22b-8bit": ModelCard(
model_id=ModelId("mlx-community/Qwen3-235B-A22B-Instruct-2507-8bit"),
storage_size=Memory.from_gb(250),
n_layers=94,
hidden_size=4096,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"qwen3-coder-480b-a35b-4bit": ModelCard(
model_id=ModelId("mlx-community/Qwen3-Coder-480B-A35B-Instruct-4bit"),
storage_size=Memory.from_gb(270),
n_layers=62,
hidden_size=6144,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"qwen3-coder-480b-a35b-8bit": ModelCard(
model_id=ModelId("mlx-community/Qwen3-Coder-480B-A35B-Instruct-8bit"),
storage_size=Memory.from_gb(540),
n_layers=62,
hidden_size=6144,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
# gpt-oss
"gpt-oss-120b-MXFP4-Q8": ModelCard(
model_id=ModelId("mlx-community/gpt-oss-120b-MXFP4-Q8"),
storage_size=Memory.from_kb(68_996_301),
n_layers=36,
hidden_size=2880,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"gpt-oss-20b-MXFP4-Q8": ModelCard(
model_id=ModelId("mlx-community/gpt-oss-20b-MXFP4-Q8"),
storage_size=Memory.from_kb(11_744_051),
n_layers=24,
hidden_size=2880,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
# glm 4.5
"glm-4.5-air-8bit": ModelCard(
# Needs to be quantized g32 or g16 to work with tensor parallel
model_id=ModelId("mlx-community/GLM-4.5-Air-8bit"),
storage_size=Memory.from_gb(114),
n_layers=46,
hidden_size=4096,
supports_tensor=False,
tasks=[ModelTask.TextGeneration],
),
"glm-4.5-air-bf16": ModelCard(
model_id=ModelId("mlx-community/GLM-4.5-Air-bf16"),
storage_size=Memory.from_gb(214),
n_layers=46,
hidden_size=4096,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
# glm 4.7
"glm-4.7-4bit": ModelCard(
model_id=ModelId("mlx-community/GLM-4.7-4bit"),
storage_size=Memory.from_bytes(198556925568),
n_layers=91,
hidden_size=5120,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"glm-4.7-6bit": ModelCard(
model_id=ModelId("mlx-community/GLM-4.7-6bit"),
storage_size=Memory.from_bytes(286737579648),
n_layers=91,
hidden_size=5120,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"glm-4.7-8bit-gs32": ModelCard(
model_id=ModelId("mlx-community/GLM-4.7-8bit-gs32"),
storage_size=Memory.from_bytes(396963397248),
n_layers=91,
hidden_size=5120,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
# glm 4.7 flash
"glm-4.7-flash-4bit": ModelCard(
model_id=ModelId("mlx-community/GLM-4.7-Flash-4bit"),
storage_size=Memory.from_gb(18),
n_layers=47,
hidden_size=2048,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"glm-4.7-flash-5bit": ModelCard(
model_id=ModelId("mlx-community/GLM-4.7-Flash-5bit"),
storage_size=Memory.from_gb(21),
n_layers=47,
hidden_size=2048,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"glm-4.7-flash-6bit": ModelCard(
model_id=ModelId("mlx-community/GLM-4.7-Flash-6bit"),
storage_size=Memory.from_gb(25),
n_layers=47,
hidden_size=2048,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"glm-4.7-flash-8bit": ModelCard(
model_id=ModelId("mlx-community/GLM-4.7-Flash-8bit"),
storage_size=Memory.from_gb(32),
n_layers=47,
hidden_size=2048,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
# minimax-m2
"minimax-m2.1-8bit": ModelCard(
model_id=ModelId("mlx-community/MiniMax-M2.1-8bit"),
storage_size=Memory.from_bytes(242986745856),
n_layers=61,
hidden_size=3072,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
"minimax-m2.1-3bit": ModelCard(
model_id=ModelId("mlx-community/MiniMax-M2.1-3bit"),
storage_size=Memory.from_bytes(100086644736),
n_layers=61,
hidden_size=3072,
supports_tensor=True,
tasks=[ModelTask.TextGeneration],
),
}
_IMAGE_BASE_MODEL_CARDS: dict[str, ModelCard] = {
"flux1-schnell": ModelCard(
model_id=ModelId("exolabs/FLUX.1-schnell"),
storage_size=Memory.from_bytes(23782357120 + 9524621312),
n_layers=57,
hidden_size=1,
supports_tensor=False,
tasks=[ModelTask.TextToImage],
components=[
ComponentInfo(
component_name="text_encoder",
component_path="text_encoder/",
storage_size=Memory.from_kb(0),
n_layers=12,
can_shard=False,
safetensors_index_filename=None,
),
ComponentInfo(
component_name="text_encoder_2",
component_path="text_encoder_2/",
storage_size=Memory.from_bytes(9524621312),
n_layers=24,
can_shard=False,
safetensors_index_filename="model.safetensors.index.json",
),
ComponentInfo(
component_name="transformer",
component_path="transformer/",
storage_size=Memory.from_bytes(23782357120),
n_layers=57,
can_shard=True,
safetensors_index_filename="diffusion_pytorch_model.safetensors.index.json",
),
ComponentInfo(
component_name="vae",
component_path="vae/",
storage_size=Memory.from_kb(0),
n_layers=None,
can_shard=False,
safetensors_index_filename=None,
),
],
),
"flux1-dev": ModelCard(
model_id=ModelId("exolabs/FLUX.1-dev"),
storage_size=Memory.from_bytes(23782357120 + 9524621312),
n_layers=57,
hidden_size=1,
supports_tensor=False,
tasks=[ModelTask.TextToImage],
components=[
ComponentInfo(
component_name="text_encoder",
component_path="text_encoder/",
storage_size=Memory.from_kb(0),
n_layers=12,
can_shard=False,
safetensors_index_filename=None,
),
ComponentInfo(
component_name="text_encoder_2",
component_path="text_encoder_2/",
storage_size=Memory.from_bytes(9524621312),
n_layers=24,
can_shard=False,
safetensors_index_filename="model.safetensors.index.json",
),
ComponentInfo(
component_name="transformer",
component_path="transformer/",
storage_size=Memory.from_bytes(23802816640),
n_layers=57,
can_shard=True,
safetensors_index_filename="diffusion_pytorch_model.safetensors.index.json",
),
ComponentInfo(
component_name="vae",
component_path="vae/",
storage_size=Memory.from_kb(0),
n_layers=None,
can_shard=False,
safetensors_index_filename=None,
),
],
),
"flux1-krea-dev": ModelCard(
model_id=ModelId("exolabs/FLUX.1-Krea-dev"),
storage_size=Memory.from_bytes(23802816640 + 9524621312), # Same as dev
n_layers=57,
hidden_size=1,
supports_tensor=False,
tasks=[ModelTask.TextToImage],
components=[
ComponentInfo(
component_name="text_encoder",
component_path="text_encoder/",
storage_size=Memory.from_kb(0),
n_layers=12,
can_shard=False,
safetensors_index_filename=None,
),
ComponentInfo(
component_name="text_encoder_2",
component_path="text_encoder_2/",
storage_size=Memory.from_bytes(9524621312),
n_layers=24,
can_shard=False,
safetensors_index_filename="model.safetensors.index.json",
),
ComponentInfo(
component_name="transformer",
component_path="transformer/",
storage_size=Memory.from_bytes(23802816640),
n_layers=57,
can_shard=True,
safetensors_index_filename="diffusion_pytorch_model.safetensors.index.json",
),
ComponentInfo(
component_name="vae",
component_path="vae/",
storage_size=Memory.from_kb(0),
n_layers=None,
can_shard=False,
safetensors_index_filename=None,
),
],
),
"qwen-image": ModelCard(
model_id=ModelId("exolabs/Qwen-Image"),
storage_size=Memory.from_bytes(16584333312 + 40860802176),
n_layers=60,
hidden_size=1,
supports_tensor=False,
tasks=[ModelTask.TextToImage],
components=[
ComponentInfo(
component_name="text_encoder",
component_path="text_encoder/",
storage_size=Memory.from_bytes(16584333312),
n_layers=12,
can_shard=False,
safetensors_index_filename=None,
),
ComponentInfo(
component_name="transformer",
component_path="transformer/",
storage_size=Memory.from_bytes(40860802176),
n_layers=60,
can_shard=True,
safetensors_index_filename="diffusion_pytorch_model.safetensors.index.json",
),
ComponentInfo(
component_name="vae",
component_path="vae/",
storage_size=Memory.from_kb(0),
n_layers=None,
can_shard=False,
safetensors_index_filename=None,
),
],
),
"qwen-image-edit-2509": ModelCard(
model_id=ModelId("exolabs/Qwen-Image-Edit-2509"),
storage_size=Memory.from_bytes(16584333312 + 40860802176),
n_layers=60,
hidden_size=1,
supports_tensor=False,
tasks=[ModelTask.ImageToImage],
components=[
ComponentInfo(
component_name="text_encoder",
component_path="text_encoder/",
storage_size=Memory.from_bytes(16584333312),
n_layers=12,
can_shard=False,
safetensors_index_filename=None,
),
ComponentInfo(
component_name="transformer",
component_path="transformer/",
storage_size=Memory.from_bytes(40860802176),
n_layers=60,
can_shard=True,
safetensors_index_filename="diffusion_pytorch_model.safetensors.index.json",
),
ComponentInfo(
component_name="vae",
component_path="vae/",
storage_size=Memory.from_kb(0),
n_layers=None,
can_shard=False,
safetensors_index_filename=None,
),
],
),
}
async def delete_custom_card(model_id: ModelId) -> bool:
"""Delete a user-added custom model card. Returns True if deleted."""
card_path = _custom_cards_dir / (ModelId(model_id).normalize() + ".toml")
if await card_path.exists():
await card_path.unlink()
_card_cache.pop(model_id, None)
return True
return False
def _generate_image_model_quant_variants(
base_name: str,
base_card: ModelCard,
) -> dict[str, ModelCard]:
"""Create quantized variants of an image model card.
def is_custom_card(model_id: ModelId) -> bool:
"""Check if a model card exists in the custom cards directory."""
import os
Only the transformer component is quantized; text encoders stay at bf16.
Sizes are calculated exactly from the base card's component sizes.
"""
if base_card.components is None:
raise ValueError(f"Image model {base_name} must have components defined")
# quantizations = [8, 6, 5, 4, 3]
quantizations = [8, 4]
num_transformer_bytes = next(
c.storage_size.in_bytes
for c in base_card.components
if c.component_name == "transformer"
card_path = Path(str(EXO_CUSTOM_MODEL_CARDS_DIR)) / (
ModelId(model_id).normalize() + ".toml"
)
transformer_bytes = Memory.from_bytes(num_transformer_bytes)
remaining_bytes = Memory.from_bytes(
sum(
c.storage_size.in_bytes
for c in base_card.components
if c.component_name != "transformer"
)
)
def with_transformer_size(new_size: Memory) -> list[ComponentInfo]:
assert base_card.components is not None
return [
ComponentInfo(
component_name=c.component_name,
component_path=c.component_path,
storage_size=new_size
if c.component_name == "transformer"
else c.storage_size,
n_layers=c.n_layers,
can_shard=c.can_shard,
safetensors_index_filename=c.safetensors_index_filename,
)
for c in base_card.components
]
variants = {
base_name: ModelCard(
model_id=base_card.model_id,
storage_size=transformer_bytes + remaining_bytes,
n_layers=base_card.n_layers,
hidden_size=base_card.hidden_size,
supports_tensor=base_card.supports_tensor,
tasks=base_card.tasks,
components=with_transformer_size(transformer_bytes),
)
}
for quant in quantizations:
quant_transformer_bytes = Memory.from_bytes(
(num_transformer_bytes * quant) // 16
)
total_bytes = remaining_bytes + quant_transformer_bytes
model_id = ModelId(base_card.model_id + f"-{quant}bit")
variants[f"{base_name}-{quant}bit"] = ModelCard(
model_id=model_id,
storage_size=total_bytes,
n_layers=base_card.n_layers,
hidden_size=base_card.hidden_size,
supports_tensor=base_card.supports_tensor,
tasks=base_card.tasks,
components=with_transformer_size(quant_transformer_bytes),
)
return variants
_image_model_cards: dict[str, ModelCard] = {}
for _base_name, _base_card in _IMAGE_BASE_MODEL_CARDS.items():
_image_model_cards |= _generate_image_model_quant_variants(_base_name, _base_card)
_IMAGE_MODEL_CARDS = _image_model_cards
if EXO_ENABLE_IMAGE_MODELS:
MODEL_CARDS.update(_IMAGE_MODEL_CARDS)
return os.path.isfile(str(card_path))
class ConfigData(BaseModel):
@@ -767,7 +208,7 @@ class ConfigData(BaseModel):
return data
async def get_config_data(model_id: ModelId) -> ConfigData:
async def fetch_config_data(model_id: ModelId) -> ConfigData:
"""Downloads and parses config.json for a model."""
from exo.download.download_utils import (
download_file_with_retry,
@@ -789,7 +230,7 @@ async def get_config_data(model_id: ModelId) -> ConfigData:
return ConfigData.model_validate_json(await f.read())
async def get_safetensors_size(model_id: ModelId) -> Memory:
async def fetch_safetensors_size(model_id: ModelId) -> Memory:
"""Gets model size from safetensors index or falls back to HF API."""
from exo.download.download_utils import (
download_file_with_retry,

238
src/exo/shared/tracing.py Normal file
View File

@@ -0,0 +1,238 @@
from __future__ import annotations
import json
import time
from collections import defaultdict
from collections.abc import Generator
from contextlib import contextmanager
from contextvars import ContextVar
from dataclasses import dataclass, field
from pathlib import Path
from typing import cast, final
from exo.shared.constants import EXO_TRACING_ENABLED
from exo.worker.runner.bootstrap import logger
# Context variable to track the current trace category for hierarchical nesting
_current_category: ContextVar[str | None] = ContextVar("current_category", default=None)
@final
@dataclass(frozen=True)
class TraceEvent:
name: str
start_us: int
duration_us: int
rank: int
category: str
@final
@dataclass
class CategoryStats:
total_us: int = 0
count: int = 0
min_us: int = 0
max_us: int = 0
def add(self, duration_us: int) -> None:
if self.count == 0:
self.min_us = duration_us
self.max_us = duration_us
else:
self.min_us = min(self.min_us, duration_us)
self.max_us = max(self.max_us, duration_us)
self.total_us += duration_us
self.count += 1
@property
def avg_us(self) -> float:
return self.total_us / self.count if self.count > 0 else 0.0
@final
@dataclass
class TraceStats:
total_wall_time_us: int = 0
by_category: dict[str, CategoryStats] = field(default_factory=dict)
by_rank: dict[int, dict[str, CategoryStats]] = field(default_factory=dict)
# Global trace buffer - each rank accumulates traces here
_trace_buffer: list[TraceEvent] = []
def _record_span(
name: str, start_us: int, duration_us: int, rank: int, category: str
) -> None:
_trace_buffer.append(
TraceEvent(
name=name,
start_us=start_us,
duration_us=duration_us,
rank=rank,
category=category,
)
)
@contextmanager
def trace(
name: str,
rank: int,
category: str = "compute",
) -> Generator[None, None, None]:
"""Context manager to trace any operation.
Nested traces automatically inherit the parent category, creating hierarchical
categories like "sync/compute" or "async/comms".
Args:
name: Name of the operation (e.g., "recv 0", "send 1", "joint_blocks")
rank: This rank's ID
category: Category for grouping in trace viewer ("comm", "compute", "step")
Example:
with trace(f"sync {t}", rank, "sync"):
with trace("joint_blocks", rank, "compute"):
# Recorded with category "sync/compute"
hidden_states = some_computation(...)
"""
if not EXO_TRACING_ENABLED:
yield
return
# Combine with parent category if nested
parent = _current_category.get()
full_category = f"{parent}/{category}" if parent else category
# Set as current for nested traces
token = _current_category.set(full_category)
try:
start_us = int(time.time() * 1_000_000)
start_perf = time.perf_counter()
yield
duration_us = int((time.perf_counter() - start_perf) * 1_000_000)
_record_span(name, start_us, duration_us, rank, full_category)
finally:
_current_category.reset(token)
def get_trace_buffer() -> list[TraceEvent]:
return list(_trace_buffer)
def clear_trace_buffer() -> None:
_trace_buffer.clear()
def export_trace(traces: list[TraceEvent], output_path: Path) -> None:
trace_events: list[dict[str, object]] = []
for event in traces:
# Chrome trace format uses "X" for complete events (with duration)
chrome_event: dict[str, object] = {
"name": event.name,
"cat": event.category,
"ph": "X",
"ts": event.start_us,
"dur": event.duration_us,
"pid": 0,
"tid": event.rank,
"args": {"rank": event.rank},
}
trace_events.append(chrome_event)
ranks_seen = set(t.rank for t in traces)
for rank in ranks_seen:
trace_events.append(
{
"name": "thread_name",
"ph": "M", # Metadata event
"pid": 0,
"tid": rank,
"args": {"name": f"Rank {rank}"},
}
)
chrome_trace = {"traceEvents": trace_events}
try:
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, "w") as f:
json.dump(chrome_trace, f, indent=2)
except OSError as e:
logger.warning("Failed to export trace to %s: %s", output_path, e)
def load_trace_file(path: Path) -> list[TraceEvent]:
with open(path) as f:
data = cast(dict[str, list[dict[str, object]]], json.load(f))
events = data.get("traceEvents", [])
traces: list[TraceEvent] = []
for event in events:
# Skip metadata events
if event.get("ph") == "M":
continue
name = str(event.get("name", ""))
category = str(event.get("cat", ""))
ts_value = event.get("ts", 0)
dur_value = event.get("dur", 0)
tid_value = event.get("tid", 0)
start_us = int(ts_value) if isinstance(ts_value, (int, float, str)) else 0
duration_us = int(dur_value) if isinstance(dur_value, (int, float, str)) else 0
# Get rank from tid or args
rank = int(tid_value) if isinstance(tid_value, (int, float, str)) else 0
args = event.get("args")
if isinstance(args, dict):
args_dict = cast(dict[str, object], args)
rank_from_args = args_dict.get("rank")
if isinstance(rank_from_args, (int, float, str)):
rank = int(rank_from_args)
traces.append(
TraceEvent(
name=name,
start_us=start_us,
duration_us=duration_us,
rank=rank,
category=category,
)
)
return traces
def compute_stats(traces: list[TraceEvent]) -> TraceStats:
stats = TraceStats()
if not traces:
return stats
# Calculate wall time from earliest start to latest end
min_start = min(t.start_us for t in traces)
max_end = max(t.start_us + t.duration_us for t in traces)
stats.total_wall_time_us = max_end - min_start
# Initialize nested dicts
by_category: dict[str, CategoryStats] = defaultdict(CategoryStats)
by_rank: dict[int, dict[str, CategoryStats]] = defaultdict(
lambda: defaultdict(CategoryStats)
)
for event in traces:
# By category
by_category[event.category].add(event.duration_us)
# By rank and category
by_rank[event.rank][event.category].add(event.duration_us)
stats.by_category = dict(by_category)
stats.by_rank = {k: dict(v) for k, v in by_rank.items()}
return stats

View File

@@ -2,7 +2,6 @@ import time
from collections.abc import Generator
from typing import Annotated, Any, Literal
from fastapi import UploadFile
from pydantic import BaseModel, Field, field_validator
from pydantic_core import PydanticUseDefault
@@ -11,7 +10,7 @@ from exo.shared.types.common import CommandId, NodeId
from exo.shared.types.memory import Memory
from exo.shared.types.worker.instances import Instance, InstanceId, InstanceMeta
from exo.shared.types.worker.shards import Sharding, ShardMetadata
from exo.utils.pydantic_ext import CamelCaseModel, ConfigDict, TaggedModel
from exo.utils.pydantic_ext import CamelCaseModel
FinishReason = Literal[
"stop", "length", "tool_calls", "content_filter", "function_call", "error"
@@ -43,6 +42,11 @@ class ModelListModel(BaseModel):
storage_size_megabytes: int = Field(default=0)
supports_tensor: bool = Field(default=False)
tasks: list[str] = Field(default=[])
is_custom: bool = Field(default=False)
family: str = Field(default="")
quantization: str = Field(default="")
base_model: str = Field(default="")
capabilities: list[str] = Field(default_factory=list)
class ModelList(BaseModel):
@@ -174,10 +178,8 @@ class StreamOptions(BaseModel):
include_usage: bool = False
class ChatCompletionTaskParams(TaggedModel):
model_config = ConfigDict(extra="ignore")
model: str
class ChatCompletionRequest(BaseModel):
model: ModelId
frequency_penalty: float | None = None
messages: list[ChatCompletionMessage]
logit_bias: dict[str, int] | None = None
@@ -193,16 +195,30 @@ class ChatCompletionTaskParams(TaggedModel):
stream_options: StreamOptions | None = None
temperature: float | None = None
top_p: float | None = None
top_k: int | None = None
tools: list[dict[str, Any]] | None = None
tool_choice: str | dict[str, Any] | None = None
parallel_tool_calls: bool | None = None
user: str | None = None
class BenchChatCompletionTaskParams(ChatCompletionTaskParams):
class BenchChatCompletionRequest(ChatCompletionRequest):
pass
class AddCustomModelParams(BaseModel):
model_id: ModelId
class HuggingFaceSearchResult(BaseModel):
id: str
author: str = ""
downloads: int = 0
likes: int = 0
last_modified: str = ""
tags: list[str] = Field(default_factory=list)
class PlaceInstanceParams(BaseModel):
model_id: ModelId
sharding: Sharding = Sharding.Pipeline
@@ -283,28 +299,7 @@ class BenchImageGenerationTaskParams(ImageGenerationTaskParams):
class ImageEditsTaskParams(BaseModel):
image: UploadFile
prompt: str
background: str | None = None
input_fidelity: float | None = None
mask: UploadFile | None = None
model: str
n: int | None = 1
output_compression: int | None = None
output_format: Literal["png", "jpeg", "webp"] = "png"
partial_images: int | None = 0
quality: Literal["high", "medium", "low"] | None = "medium"
response_format: Literal["url", "b64_json"] | None = "b64_json"
size: str | None = "1024x1024"
stream: bool | None = False
user: str | None = None
advanced_params: AdvancedImageParams | None = None
# Internal flag for benchmark mode - set by API, preserved through serialization
bench: bool = False
class ImageEditsInternalParams(BaseModel):
"""Serializable version of ImageEditsTaskParams for distributed task execution."""
"""Internal task params for image-editing requests."""
image_data: str = "" # Base64-encoded image (empty when using chunked transfer)
total_input_chunks: int = 0
@@ -373,3 +368,45 @@ class StartDownloadResponse(CamelCaseModel):
class DeleteDownloadResponse(CamelCaseModel):
command_id: CommandId
class TraceEventResponse(CamelCaseModel):
name: str
start_us: int
duration_us: int
rank: int
category: str
class TraceResponse(CamelCaseModel):
task_id: str
traces: list[TraceEventResponse]
class TraceCategoryStats(CamelCaseModel):
total_us: int
count: int
min_us: int
max_us: int
avg_us: float
class TraceRankStats(CamelCaseModel):
by_category: dict[str, TraceCategoryStats]
class TraceStatsResponse(CamelCaseModel):
task_id: str
total_wall_time_us: int
by_category: dict[str, TraceCategoryStats]
by_rank: dict[int, TraceRankStats]
class TraceListItem(CamelCaseModel):
task_id: str
created_at: str
file_size: int
class TraceListResponse(CamelCaseModel):
traces: list[TraceListItem]

View File

@@ -0,0 +1,214 @@
"""Claude Messages API types for request/response conversion."""
from typing import Any, Literal
from pydantic import BaseModel, Field
from exo.shared.types.common import ModelId
# Tool definition types
ClaudeToolInputSchema = dict[str, Any]
class ClaudeToolDefinition(BaseModel, frozen=True):
"""Tool definition in Claude Messages API request."""
name: str
description: str | None = None
input_schema: ClaudeToolInputSchema
# Type aliases
ClaudeRole = Literal["user", "assistant"]
ClaudeStopReason = Literal["end_turn", "max_tokens", "stop_sequence", "tool_use"]
# Content block types
class ClaudeTextBlock(BaseModel, frozen=True):
"""Text content block in Claude Messages API."""
type: Literal["text"] = "text"
text: str
class ClaudeImageSource(BaseModel, frozen=True):
"""Image source for Claude image blocks."""
type: Literal["base64", "url"]
media_type: str | None = None
data: str | None = None
url: str | None = None
class ClaudeImageBlock(BaseModel, frozen=True):
"""Image content block in Claude Messages API."""
type: Literal["image"] = "image"
source: ClaudeImageSource
class ClaudeToolUseBlock(BaseModel, frozen=True):
"""Tool use content block in Claude Messages API."""
type: Literal["tool_use"] = "tool_use"
id: str
name: str
input: dict[str, Any]
class ClaudeToolResultBlock(BaseModel, frozen=True):
"""Tool result content block in Claude Messages API request."""
type: Literal["tool_result"] = "tool_result"
tool_use_id: str
content: str | list[ClaudeTextBlock] | None = None
is_error: bool | None = None
cache_control: dict[str, str] | None = None
ClaudeContentBlock = ClaudeTextBlock | ClaudeImageBlock | ClaudeToolUseBlock
# Input content blocks can also include tool_result (sent by user after tool_use)
ClaudeInputContentBlock = (
ClaudeTextBlock | ClaudeImageBlock | ClaudeToolUseBlock | ClaudeToolResultBlock
)
# Request types
class ClaudeMessage(BaseModel, frozen=True):
"""Message in Claude Messages API request."""
role: ClaudeRole
content: str | list[ClaudeInputContentBlock]
class ClaudeMessagesRequest(BaseModel):
"""Request body for Claude Messages API."""
model: ModelId
max_tokens: int
messages: list[ClaudeMessage]
system: str | list[ClaudeTextBlock] | None = None
stop_sequences: list[str] | None = None
stream: bool = False
temperature: float | None = None
top_p: float | None = None
top_k: int | None = None
tools: list[ClaudeToolDefinition] | None = None
metadata: dict[str, str] | None = None
# Response types
class ClaudeUsage(BaseModel, frozen=True):
"""Token usage in Claude Messages API response."""
input_tokens: int
output_tokens: int
class ClaudeMessagesResponse(BaseModel, frozen=True):
"""Response body for Claude Messages API."""
id: str
type: Literal["message"] = "message"
role: Literal["assistant"] = "assistant"
content: list[ClaudeContentBlock]
model: str
stop_reason: ClaudeStopReason | None = None
stop_sequence: str | None = None
usage: ClaudeUsage
# Streaming event types
class ClaudeMessageStart(BaseModel, frozen=True):
"""Partial message in message_start event."""
id: str
type: Literal["message"] = "message"
role: Literal["assistant"] = "assistant"
content: list[ClaudeTextBlock] = Field(default_factory=list)
model: str
stop_reason: ClaudeStopReason | None = None
stop_sequence: str | None = None
usage: ClaudeUsage
class ClaudeMessageStartEvent(BaseModel, frozen=True):
"""Event sent at start of message stream."""
type: Literal["message_start"] = "message_start"
message: ClaudeMessageStart
class ClaudeContentBlockStartEvent(BaseModel, frozen=True):
"""Event sent at start of a content block."""
type: Literal["content_block_start"] = "content_block_start"
index: int
content_block: ClaudeTextBlock | ClaudeToolUseBlock
class ClaudeTextDelta(BaseModel, frozen=True):
"""Delta for text content block."""
type: Literal["text_delta"] = "text_delta"
text: str
class ClaudeInputJsonDelta(BaseModel, frozen=True):
"""Delta for tool use input JSON content block."""
type: Literal["input_json_delta"] = "input_json_delta"
partial_json: str
class ClaudeContentBlockDeltaEvent(BaseModel, frozen=True):
"""Event sent for content block delta."""
type: Literal["content_block_delta"] = "content_block_delta"
index: int
delta: ClaudeTextDelta | ClaudeInputJsonDelta
class ClaudeContentBlockStopEvent(BaseModel, frozen=True):
"""Event sent at end of a content block."""
type: Literal["content_block_stop"] = "content_block_stop"
index: int
class ClaudeMessageDeltaUsage(BaseModel, frozen=True):
"""Usage in message_delta event."""
output_tokens: int
class ClaudeMessageDelta(BaseModel, frozen=True):
"""Delta in message_delta event."""
stop_reason: ClaudeStopReason | None = None
stop_sequence: str | None = None
class ClaudeMessageDeltaEvent(BaseModel, frozen=True):
"""Event sent with final message delta."""
type: Literal["message_delta"] = "message_delta"
delta: ClaudeMessageDelta
usage: ClaudeMessageDeltaUsage
class ClaudeMessageStopEvent(BaseModel, frozen=True):
"""Event sent at end of message stream."""
type: Literal["message_stop"] = "message_stop"
ClaudeStreamEvent = (
ClaudeMessageStartEvent
| ClaudeContentBlockStartEvent
| ClaudeContentBlockDeltaEvent
| ClaudeContentBlockStopEvent
| ClaudeMessageDeltaEvent
| ClaudeMessageStopEvent
)

View File

@@ -2,13 +2,12 @@ from pydantic import Field
from exo.shared.models.model_cards import ModelCard, ModelId
from exo.shared.types.api import (
BenchChatCompletionTaskParams,
ChatCompletionTaskParams,
ImageEditsInternalParams,
ImageEditsTaskParams,
ImageGenerationTaskParams,
)
from exo.shared.types.chunks import InputImageChunk
from exo.shared.types.common import CommandId, NodeId
from exo.shared.types.text_generation import TextGenerationTaskParams
from exo.shared.types.worker.instances import Instance, InstanceId, InstanceMeta
from exo.shared.types.worker.shards import Sharding, ShardMetadata
from exo.utils.pydantic_ext import CamelCaseModel, TaggedModel
@@ -22,16 +21,16 @@ class TestCommand(BaseCommand):
__test__ = False
class ChatCompletion(BaseCommand):
request_params: ChatCompletionTaskParams | BenchChatCompletionTaskParams
class TextGeneration(BaseCommand):
task_params: TextGenerationTaskParams
class ImageGeneration(BaseCommand):
request_params: ImageGenerationTaskParams
task_params: ImageGenerationTaskParams
class ImageEdits(BaseCommand):
request_params: ImageEditsInternalParams
task_params: ImageEditsTaskParams
class PlaceInstance(BaseCommand):
@@ -79,7 +78,7 @@ DownloadCommand = StartDownload | DeleteDownload
Command = (
TestCommand
| RequestEventLog
| ChatCompletion
| TextGeneration
| ImageGeneration
| ImageEdits
| PlaceInstance

View File

@@ -1,4 +1,5 @@
from datetime import datetime
from typing import final
from pydantic import Field
@@ -10,7 +11,7 @@ from exo.shared.types.worker.downloads import DownloadProgress
from exo.shared.types.worker.instances import Instance, InstanceId
from exo.shared.types.worker.runners import RunnerId, RunnerStatus
from exo.utils.info_gatherer.info_gatherer import GatheredInfo
from exo.utils.pydantic_ext import CamelCaseModel, TaggedModel
from exo.utils.pydantic_ext import CamelCaseModel, FrozenModel, TaggedModel
class EventId(Id):
@@ -109,6 +110,28 @@ class TopologyEdgeDeleted(BaseEvent):
conn: Connection
@final
class TraceEventData(FrozenModel):
name: str
start_us: int
duration_us: int
rank: int
category: str
@final
class TracesCollected(BaseEvent):
task_id: TaskId
rank: int
traces: list[TraceEventData]
@final
class TracesMerged(BaseEvent):
task_id: TaskId
traces: list[TraceEventData]
Event = (
TestEvent
| TaskCreated
@@ -127,6 +150,8 @@ Event = (
| InputChunkReceived
| TopologyEdgeCreated
| TopologyEdgeDeleted
| TracesCollected
| TracesMerged
)

Some files were not shown because too many files have changed in this diff Show More