- Wire up KVPrefixCache to runner and generate
- Fix exact match to return deepcopy (was returning reference)
- Fix trim_prompt_cache argument (was using wrong calculation)
- Fix token slicing to use best_snapshot_length (not index)
- Add _cache_length() using .offset for compatibility with older mlx_lm
- Fix prefill() to use max_tokens=1 with trim (workaround for mlx_lm bug)
- Add clear() method for single-cache behavior
- Remove KEEP_KV_SIZE limit from prefix matching
- Add minimal logging for cache hits/misses
Fix type errors and KV cache implementation
Type fixes for CI:
- Add KVCacheType alias matching make_kv_cache return type
- Update function signatures to use consistent cache types
- Add explicit type annotations
KV cache fixes to actually reduce TTFT:
- get_kv_cache now prefills internally and returns only last token
- stream_generate receives 1 token on cache hit instead of full prompt
- Extract encode_prompt as standalone function for reuse
Refactor KV cache: move prefill to generate.py, add shared KVCacheType
Address PR feedback:
- Move KVCacheType to shared/types/mlx.py for reuse across codebase
- Move prefill logic from cache.py to generate.py
- get_kv_cache now only returns cache + remaining tokens (no prefill)
- Caller (mlx_generate) is responsible for prefilling
Fix types: regenerate mlx stubs, remove type ignores
- Regenerate cache.pyi and tokenizer_utils.pyi stubs for latest mlx_lm
- Remove # type: ignore from cache.py (now fully typed)
- Remove unnecessary type ignores from generate.py
- Use mx.equal() instead of == for proper array typing
Fix encode_prompt to not add special tokens for chat-templated prompts
Chat templates (like Kimi-K2's <|im_user|>, <|im_middle|>, etc.) already
include their own structure markers. Adding BOS/EOS tokens on top of this
corrupts the prompt structure and can slow down prefill.
Use add_special_tokens=False since the chat template defines its own structure.
Add prefill logging with progress callbacks and timing stats
The prettier-svelte formatter depended on the full dashboard build
(dashboardFull), causing the devshell to rebuild whenever any dashboard
source file changed.
Created a deps-only dream2nix derivation (deps.nix) that uses a stub
source containing only package.json, package-lock.json, and minimal
files for vite to succeed. Updated prettier-svelte to use this
derivation instead of dashboardFull.
The stub source is constant unless lockfiles change, so prettier-svelte
and the devshell no longer rebuild when dashboard source files are
modified.
Test plan:
- nix flake check passed
- nix fmt successfully formatted svelte files
## Motivation
Image editing was missing UI controls for quality, output format, and
advanced parameters that text-to-image generation already supported.
## Changes
- Added quality, output_format, and advanced_params to image edit API
endpoints
- Extended isImageModel check to include image editing models
## Why It Works
The API now accepts and forwards these settings for image edits, and the
UI displays the appropriate controls for image editing models.
## Test Plan
### Manual Testing
Verified parameters can be set in UI and that they progagate through to
model inference
## Motivation
Support OpenAI API `n` setting
## Changes
- Users can select `n` to generate more than one image with the same
prompt
- each image uses a different seed -> different results
- `stream` and `partial_images` settings can be overwritten in UI
## Motivation
Fixes tool calling crashes with GLM-4.7-Flash and Kimi-K2 models.
Related: #1254
Two distinct issues were causing crashes:
1. **Tool parser crashes** - The upstream GLM47 and Kimi tool parsers
call `.group()` on regex matches without checking for `None`, causing
`AttributeError` when the model outputs malformed tool calls
2. **Chat template crashes** - GLM's chat template expects
`tool_calls[].function.arguments` to be a dict, but OpenAI format
provides it as a JSON string, causing `'str object' has no attribute
'items'`
## Changes
**`src/exo/worker/runner/runner.py`:**
- Add `patch_glm_tokenizer()` - fixed version of mlx_lm's glm47 parser
with None checks
- Fix `patch_kimi_tokenizer()` - add None checks before calling
`.group()` on regex matches
- Add `ValueError` and `AttributeError` to exception handling in
`parse_tool_calls()`
**`src/exo/worker/engines/mlx/utils_mlx.py`:**
- Add `_normalize_tool_calls()` - parses
`tool_calls[].function.arguments` from JSON string to dict for templates
that expect dicts (like GLM-4.7-Flash)
## Why It Works
1. **Parser fixes**: By checking if regex matches are `None` before
calling `.group()`, we can raise a proper `ValueError` instead of
crashing with `AttributeError`
2. **Template fix**: The GLM-4.7-Flash chat template iterates over
arguments with `.items()`:
```jinja2
{% set _args = tc.arguments %}{% for k, v in _args.items() %}
```
OpenAI format has `arguments` as a JSON string.
`_normalize_tool_calls()` parses this to a dict before passing to the
template.
## Test Plan
### Manual Testing
- Hardware: Mac with GLM-4.7-Flash-4bit model
- Tested tool calling with GLM model - no longer crashes
### Automated Testing
- Existing tests pass (`uv run pytest`)
- Type checking passes (`uv run basedpyright`)
- Linting passes (`uv run ruff check`)
---------
Co-authored-by: Claude <noreply@anthropic.com>
## Motivation
When selecting a model for placement, users often want to see placements
that utilize specific nodes in their cluster. Currently there's no way
to filter the placement previews to focus on configurations that include
particular machines.
## Changes
- **Backend**: Added `node_ids` query parameter to the
`/placement-previews` API endpoint. When provided, the endpoint filters
the topology to only include the specified nodes before generating
placements using the new `Topology.filter_to_nodes()` method.
- **Topology class**: Added `filter_to_nodes(node_ids)` method that
creates a new topology containing only the specified nodes and edges
between them.
- **App store**: Added `previewNodeFilter` state to track selected
nodes, with methods to toggle/clear the filter. Automatically cleans up
filter when nodes are removed from the cluster and re-fetches previews
when topology changes.
- **TopologyGraph component**: Added click handlers to toggle node
filter selection, hover effects to indicate clickable nodes, and visual
styling (yellow highlight for selected, dimmed for filtered-out nodes).
- **Main page**: Added filter indicator in top-right corner of topology
showing active filter count with a clear button.
## Why It Works
The filtering happens at the backend/placement generation level rather
than just filtering the results. This ensures we see all valid placement
combinations for the selected nodes, not just a subset that happened to
be generated for the full topology.
The visual feedback uses the same rendering approach as the existing
highlight system - state is tracked in Svelte and applied during render,
so it persists across data updates without flickering.
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
- Click a node in topology → should show yellow highlight and filter
indicator
- Click another node → indicator shows "2 nodes", previews update to
show only placements using both
- Hover over nodes → subtle yellow highlight indicates they're clickable
- Click X on filter indicator → clears filter, shows all placements
again
- Disconnect a node while it's in filter → filter auto-removes that node
### Automated Testing
- Existing tests cover the Topology class; the new `filter_to_nodes`
method follows the same patterns
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Motivation
The previous approach installed a LaunchDaemon plist that ran
periodically to disable Thunderbolt Bridge. This required full admin
privileges upfront and ran regardless of whether a problematic loop
existed.
This change replaces that with dynamic detection - only prompting the
user when an actual TB bridge loop with 3+ machines is detected, and
using fine-grained SCPreferences authorization instead of full admin.
## Changes
**Backend (Python):**
- Added `ThunderboltBridgeStatus` model to track bridge enabled/exists
state per node
- Added `node_thunderbolt_bridge` and `thunderbolt_bridge_cycles` fields
to State
- Added `get_thunderbolt_bridge_cycles()` method to Topology class
- **Robust TB bridge detection:**
- Finds bridge network services from `-listnetworkserviceorder` (not
`-listallhardwareports` which can miss bridges)
- Checks each bridge's member interfaces via `ifconfig` to verify it
contains Thunderbolt interfaces
- Handles varying service names (e.g., "TB Bridge", "Thunderbolt
Bridge", "Bridge (bridge0)")
- Includes `service_name` in status for correct disable commands
- Added warning logs for all error cases in detection
- Updated `apply.py` to handle the new event type and recompute cycles
on node timeout
**Swift App:**
- New `ThunderboltBridgeService` that monitors for cycles from cluster
state
- Shows NSAlert when a cycle with >2 machines is detected
- Uses `SCPreferencesCreateWithAuthorization` with
`system.services.systemconfiguration.network` right for targeted
permissions
- **Auto-cleanup of legacy LaunchDaemon:** On app startup, checks for
and removes old plist/scripts (non-fatal if user cancels)
- **Periodic local network checking:** Re-checks every 10s so the
warning disappears when user grants permission
- **Fixed ClusterState model:** Updated to decode new granular state
fields (`nodeIdentities`, `nodeMemory`, `nodeSystem`,
`nodeThunderboltBridge`) with computed `nodeProfiles` property for
backwards compatibility
- **Fixed Topology model:** Updated to match actual JSON structure where
`nodes` is an array of strings (not objects) and `connections` is a
nested map (not flat array)
- Cleaned up `NetworkSetupHelper` by removing daemon installation code
(now only handles uninstall)
**Dashboard:**
- Added yellow warning badge on topology when TB bridge cycle detected
- On hover: highlights affected nodes in yellow on the topology graph
- Shows which machines are in the cycle with friendly names
- Provides copy-paste terminal command with the correct service name:
```
sudo networksetup -setnetworkserviceenabled "<service-name>" off
```
- Warning appears in all topology views (full, welcome, and minimized
chat sidebar)
- **Debug mode:** Shows "TB:ON" or "TB:OFF" status next to each node in
the topology
## Why It Works
- Cycle detection happens on the backend where we have full topology
information
- Only cycles with 3+ machines are flagged (2-node connections are fine)
- TB bridge detection is robust:
- Uses `-listnetworkserviceorder` to find bridges (works on all machines
tested)
- Verifies bridge membership via `ifconfig` to confirm Thunderbolt
interfaces
- Handles different service names across machines
- The Swift app reacts to detected cycles and prompts the user once per
cycle
- The dashboard provides visual feedback and actionable instructions
- `SCPreferencesCreateWithAuthorization` provides the minimal
permissions needed to modify network service state
- Legacy LaunchDaemon is automatically cleaned up on first launch with
this version
## Test Plan
### Manual Testing
Here EXO detected a TB bridge cycle:
#### Dashboard:
<img width="1363" height="884" alt="Screenshot 2026-01-21 at 10 07
30 PM"
src="https://github.com/user-attachments/assets/7da9c621-0c91-42c4-898e-4952188a1f61"
/>
#### Hovering the warning:
<img width="359" height="279" alt="Screenshot 2026-01-21 at 16 30 57"
src="https://github.com/user-attachments/assets/05501dcf-3d4a-4704-9f38-257748c05a53"
/>
#### macOS app warning popup:
<img width="270" height="410" alt="Screenshot 2026-01-21 at 16 29 08"
src="https://github.com/user-attachments/assets/45714427-08c3-4fb4-9e61-144925c51adf"
/>
### Which then asks for the user's password:
<img width="263" height="372" alt="Screenshot 2026-01-21 at 16 29 28"
src="https://github.com/user-attachments/assets/7502e591-596d-4128-8cf5-6a12674e27bc"
/>
Which when entered, successfully disables bridge and no longer shows the
warning on dashboard.
#### When it fails it shows the error message:
<img width="263" height="234" alt="Screenshot 2026-01-21 at 14 45 38"
src="https://github.com/user-attachments/assets/2d10b3d5-69d7-46ea-b631-d52d8651ab41"
/>
### Automated Testing
- Type checker: 0 errors (`uv run basedpyright`)
- Linter: All checks passed (`uv run ruff check`)
- Tests: 118 passed (`uv run pytest`)
- Dashboard: Builds successfully (`npm run build`)
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Motivation
black-forest-labs models require hf auth and signup to download. We
don't handle this gracefully.
https://github.com/exo-explore/exo/issues/1242
## Changes
- Handle auth errors
- Surface error to UI and suggest resolution
- Support using HF_TOKEN env variable for auto
- Hide image functionality behind `EXO_ENABLE_IMAGE_MODELS=true` for now
## Why It Works
Users are presented with actionable feedback when issue occurs
## Test Plan
### Manual Testing
Confirmed loading black-forest-labs model in UI presents the issue in
the UI.
Confirmed both `hf auto login` and setting `HF_TOKEN` resolve the issue
## Motivation
<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->
## Changes
<!-- Describe what you changed in detail -->
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->
### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
## Motivation
Image generation feature is not stable and causing issues for users.
Fixes#1242
## Changes
- Commented out image model cards (flux1-schnell, flux1-dev, qwen-image,
qwen-image-edit-2509) in `src/exo/shared/models/model_cards.py`
- Added reference to issue #1242 in the comment explaining why they are
disabled
## Why It Works
By commenting out the model cards, these image models will no longer
appear in the model list, preventing users from attempting to use the
unstable feature until it is stabilized.
## Test Plan
### Manual Testing
- Run exo and verify image models no longer appear in the model list
### Automated Testing
- No changes to automated tests needed - this simply removes models from
the available list
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
The downloads page showed "0B / 0B" for models that haven't started
downloading yet because the download progress data only gets populated
after the file list is fetched from HuggingFace.
Added a fetch to the /models API endpoint on page mount and created a
helper function that falls back to storage_size_megabytes when the
download's totalBytes is 0.
This allows users to see the actual model size (e.g., "0 / 25GB")
before a download begins, which is helpful for a future feature that
lets you download models explicitly.
Test plan:
- Deployed to a cluster, the previous 0B now show sensible values.
## Motivation
When `get_shard_download_status()` runs, it iterates over all models in
`MODEL_CARDS` and calls `build_full_shard()` → `build_base_shard()` →
`ModelCard.from_hf()`. This unconditionally tried to download
`config.json` from HuggingFace, but image models (FLUX, Qwen-Image)
don't have a root-level config.json file, causing errors:
```
Error downloading shard: File not found: https://huggingface.co/black-forest-labs/FLUX.1-dev/resolve/main/config.json
Error downloading shard: File not found: https://huggingface.co/black-forest-labs/FLUX.1-schnell/resolve/main/config.json
Error downloading shard: File not found: https://huggingface.co/Qwen/Qwen-Image/resolve/main/config.json
Error downloading shard: File not found: https://huggingface.co/Qwen/Qwen-Image-Edit-2509/resolve/main/config.json
```
## Changes
### ModelCard.load() fix
- `build_base_shard()` now uses `ModelCard.load()` instead of
`ModelCard.from_hf()`
- `ModelCard.load()` iterates through `MODEL_CARDS.values()` to find a
match by `model_id`
### exo-bench fixes
- Use `name` field instead of `id` for model resolution
- Pass `full_model_id` to `/instance/previews` endpoint
- Make model name matching case-insensitive
- Update README example model name
## Why It Works
`MODEL_CARDS` uses short names as keys (e.g., `"flux1-schnell"`) but the
`model_id` values are HuggingFace paths (e.g.,
`"black-forest-labs/FLUX.1-schnell"`). When `ModelCard.load()` was
called with the HF path, it didn't match any key and fell back to
`from_hf()` which tried to download config.json.
The fix iterates through `MODEL_CARDS.values()` to find a match by
`model_id`, ensuring predefined models (including image models) use
their registry entries directly without network calls. A key lookup is
unnecessary since `load()` is always called with HF paths which don't
match the short-name keys.
## Test Plan
### Manual Testing
- Run exo and verify no more "Error downloading shard: File not found:
.../config.json" errors for image models
- Run exo-bench and verify model resolution works correctly
### Automated Testing
- `uv run basedpyright` - passes with 0 errors
- `uv run pytest` - all tests pass
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Motivation
Way too much spam. Some logs were also obsolete, leading to users
thinking there was something wrong during expected behaviour.
## Changes
<!-- Describe what you changed in detail -->
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->
### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
Package prettier with Svelte support and add it to treefmt-nix to format
the dashboard. This change is brutal, I spent a long time trying to get
it nicer but it doesn't seem there's a good way to make this minimal.
Sorry for the noise!
This will make it easier for new contributors to get the formatting
right first time.
Also removes the `.prettierrc` because it turns out treefmt-nix was
ignoring it.
Test plan:
- CI
## Motivation
Enable distributed image generation across exo clusters
## Changes
- Added OpenAI-compatible /v1/images/generations and /v1/images/edits
API endpoints - Added /bench/images/generations and /bench/images/edits
endpoints that return generation statistics (timing, throughput metrics)
- Implemented PipeFusion distributed inference for diffusion models,
enabling patch-based parallelism across nodes
- Added model adapters for Flux (schnell, dev) and Qwen image models
## Why It Works
https://arxiv.org/abs/2405.14430
## Test Plan
### Manual Testing
- Generate images using /v1/images/generations endpoint with single and
multi-node clusters
- Test image editing via /v1/images/edits with source images
- Verify streaming partial images appear progressively in the dashboard
- Use /bench/images/generations to measure generation performance
- Test both Flux and Qwen model families
---------
Co-authored-by: Sami Khan <smsak99@gmail.com>
## Motivation
Test failed due to circular import
## Changes
<!-- Describe what you changed in detail -->
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
Tried importing and calling the functions, worked fine.
### Automated Testing
Tests pass again
## Motivation
MLX LM has given GPT OSS a shard method, but MLX does not have an update
to match.
## Changes
<!-- Describe what you changed in detail -->
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->
### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
## Motivation
Kimi K2 Thinking Pipeline RDMA was broken before.
## Why It Works
No clue tbh
## Test Plan
### Manual Testing
Kimi K2 Thinking and GPT OSS work at the same time on Pipeline RDMA.
Needs exo bench to check more thoroughly
### Automated Testing
Layer composition tests still pass.
This change uses the stronger typed ModelId, and introduces some
convenience methods. It also cleans up some code left over from #1204.
## Changes
`model_id: str -> model_id: ModelId`
`repo_id: str -> model_id: ModelId`
Introduces methods on ModelId, in particular ModelId.normalize() to
replace `/` with `--`.
This PR did introduce some circular imports, so has moved some code
around to try and limit them.
## Test Plan
Tests still pass, types still check. As this is about metadata, I
haven't tested inference.
## Motivation
The `download_progress_for_local_path` function and the "Handle local
paths" code block in `download_shard` are dead code that cannot be
reached in normal usage. The code checks if `model_id` (e.g.,
"mlx-community/Llama-3.2-3B-Instruct-4bit") exists as a filesystem path,
but model IDs are constrained to HuggingFace repo format and there's no
API pathway to pass local paths.
## Changes
- Removed `download_progress_for_local_path()` function (45 lines)
- Removed the "Handle local paths" block in `download_shard()` (7 lines)
## Why It Works
This code was added in PR #669 as part of a "feature-local-models"
branch, but the feature was never fully integrated. The check
`aios.path.exists(str(shard.model_card.model_id))` would only return
true if a directory literally named
"mlx-community/Llama-3.2-3B-Instruct-4bit" existed in the cwd, which
doesn't happen in practice. Offline caching is already handled by
`fetch_file_list_with_cache`.
## Test Plan
### Manual Testing
- Run exo normally and verify downloads still work
### Automated Testing
- Existing tests pass (this code had no test coverage)
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Motivation
When `skip_download=True`, exo was logging a lot of unnecessary messages during periodic download status checks. This resulted in spammy logs that made it hard to see important messages.
## Changes
- Only log "Downloading ... with allow_patterns=..." when actually downloading (not when skip_download is true)
- Changed periodic download progress check logs from INFO to DEBUG level
## Why It Works
The `skip_download=True` parameter is used when checking download status without actually downloading. By guarding the log behind `if not skip_download:`, we avoid logging on every status check. Changing the periodic emitting logs to DEBUG level reduces noise while still keeping them available for debugging.
## Test Plan
### Manual Testing
- Run exo and observe that logs are less spammy during normal operation
- Use -v or -vv flags to see DEBUG logs when needed
### Automated Testing
- Existing tests cover this code path
## Motivation
<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->
## Changes
<!-- Describe what you changed in detail -->
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->
### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
## Motivation
We have a lot of unneeded data in the model card - lets just keep the
necessary stuff and add back more data when we need it
## Test Plan
EXO still runs! (pipeline on 2)
Co-authored-by: rltakashige <rl.takashige@gmail.com>
## Motivation
Add support for GLM-4.7-Flash, a lighter variant of GLM-4.7 with the
`glm4_moe_lite` architecture. These models are smaller and faster while
maintaining good performance.
## Changes
1. **Added 4 new model cards** for GLM-4.7-Flash variants:
- `glm-4.7-flash-4bit` (~18 GB)
- `glm-4.7-flash-5bit` (~21 GB)
- `glm-4.7-flash-6bit` (~25 GB)
- `glm-4.7-flash-8bit` (~32 GB)
All variants have:
- `n_layers`: 47 (vs 91 in GLM-4.7)
- `hidden_size`: 2048 (vs 5120 in GLM-4.7)
- `supports_tensor`: True (native `shard()` method)
2. **Bumped mlx from 0.30.1 to 0.30.3** - required by mlx-lm 0.30.4
3. **Updated mlx-lm from 0.30.2 to 0.30.4** - adds `glm4_moe_lite`
architecture support
4. **Added type ignores** in `auto_parallel.py` for stricter type
annotations in new mlx-lm
5. **Fixed EOS token IDs** for GLM-4.7-Flash - uses different tokenizer
with IDs `[154820, 154827, 154829]` vs other GLM models' `[151336,
151329, 151338]`
6. **Renamed `MLX_IBV_DEVICES` to `MLX_JACCL_DEVICES`** - env var name
changed in new mlx
## Why It Works
The model cards follow the same pattern as existing GLM-4.7 models.
Tensor parallel support is enabled because GLM-4.7-Flash implements the
native `shard()` method in mlx-lm 0.30.4, which is automatically
detected in `auto_parallel.py`.
GLM-4.7-Flash uses a new tokenizer with different special token IDs.
Without the correct EOS tokens, generation wouldn't stop properly.
## Test Plan
### Manual Testing
Tested generation with GLM-4.7-Flash-4bit - now correctly stops at EOS
tokens.
### Automated Testing
- `basedpyright`: 0 errors
- `ruff check`: All checks passed
- `pytest`: 162/162 tests pass (excluding pre-existing
`test_distributed_fix.py` timeout failures)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Motivation
Certain models hang at model loading in tensor parallel.
Hopefully closes#1205
## Changes
- Load layer by layer for tensor parallel sharding
- Move eval_with_timeout to auto_parallel.py to resolve circular import.
## Why It Works
The naive way to fix this is to use load model with lazy = False and
then shard in tensor parallel. However, this requires the entire model
to be loaded into memory.
Instead, we can load layer by layer and shard after loading. There is a
very small memory footprint to this, but it is negligible.
I tried loading layer by layer after the sharding, and this allowed
model loading but got stuck at warming up.
## Test Plan
### Manual Testing
GPT OSS loads with TP and FAST SYNCH. Kimi does too.
### Automated Testing
We need to run a suite of exo_bench before merging this!
## Motivation
For thinking models like GLM-4.7, the `<think>` tag is inserted by the
tokenizer's `apply_chat_template()` into the **prompt** (input). The
model generates tokens starting *after* this tag, so `<think>` never
appears in the streamed output. The frontend expects
`<think>...</think>` tags to extract and display thinking content.
**Log evidence:**
```
[gMASK]<sop><|system|>...<|user|>...<|assistant|><think>
```
The prompt ends with `<think>`, so the model generates content after it,
never returning the opening tag.
## Changes
- Added `detect_thinking_prompt_suffix()` helper function in
`utils_mlx.py` to detect if a prompt ends with `<think>` tag
- Added `parse_thinking_models()` generator wrapper in `runner.py` that
prepends the thinking tag to the output stream
- Modified the main generation loop to use the thinking wrapper for
non-GptOssModel models when a thinking prefix is detected
- Updated test mocks to handle the new `apply_chat_template` call
## Why It Works
The solution follows the same pattern as `parse_gpt_oss()` - a generator
wrapper that transforms the output stream. When the chat template ends
with `<think>`, we prepend this tag to the first generated token so the
frontend receives the complete `<think>...</think>` structure it
expects.
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
- Run exo: `uv run exo`
- Send a chat request to GLM-4.7:
```bash
curl http://localhost:52415/v1/chat/completions -H "Content-Type:
application/json" -d '{
"model": "mlx-community/GLM-4.7-8bit-gs32",
"messages": [{"role": "user", "content": "What is 2+2?"}],
"stream": true
}'
```
- Verify the streamed response starts with `<think>` tag
- Verify the frontend dashboard correctly shows the thinking section
collapsed
### Automated Testing
- All 72 worker tests pass: `uv run pytest src/exo/worker/`
- Type checker passes: `uv run basedpyright`
- Linter passes: `uv run ruff check`
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>
## Motivation
The current `NodePerformanceProfile` is a monolithic object where every
update (even 1-second memory updates) replaces the entire profile,
touching unrelated data. Different fields update at vastly different
frequencies:
| Data | Update Frequency |
|------|------------------|
| Memory, System | 1 second |
| Thunderbolt | 5 seconds |
| Network interfaces | 10 seconds |
| Friendly name | 60 seconds |
| Model/Chip ID | Once at startup |
## Changes
Split into separate state mappings so each data type updates
independently:
- `node_identities`: Static and slow-changing data (model_id, chip_id,
friendly_name)
- `node_memory`: RAM and swap usage
- `node_system`: GPU usage, temperature, power, CPU metrics
- `node_network`: Network interface information
- `node_thunderbolt`: Thunderbolt interface identifiers
Added a backwards-compatible `node_profiles` property that reconstructs
`NodePerformanceProfile` from the granular mappings for dashboard
compatibility.
**Files modified:**
- `src/exo/shared/types/profiling.py` - Added `NodeIdentity`,
`NodeNetworkInfo`, `NodeThunderboltInfo` types
- `src/exo/shared/types/state.py` - Added 5 new mappings +
`node_profiles` property
- `src/exo/shared/apply.py` - Updated `apply_node_gathered_info` and
`apply_node_timed_out`
## Why It Works
Each info type now writes only to its specific mapping, avoiding
unnecessary updates to unrelated data. The `MacThunderboltConnections`
handler reads from `node_thunderbolt` instead of the old `node_profiles`
for RDMA connection mapping. The backwards-compatible property ensures
the dashboard continues to work unchanged.
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
- Start exo and verify dashboard shows node info
- Verify memory/GPU updates stream correctly
- Check that node timeout properly cleans up all mappings
### Automated Testing
- All 162 existing tests pass
- basedpyright: 0 errors
- ruff check: All checks passed
- nix fmt: Applied
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Motivation
GPU timeouts often when prompt size > profile_step_size. It also happens
for seemingly random models.
## Changes
Add mx.depends for cache on the logits.
All gather at the model level rather than the layer level, reducing the
amount of data sent.
## Why It Works
mlx_lm's prefill loop only evaluates cache state, not logits.
When prompt > prefill_step_size, the all_gather is never evaluated,
causing GPU timeout.
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->
### Automated Testing
Added failing test cases and then resolved them.
The daemonAlreadyInstalled() function checked that the startup script
file existed and validated plist properties, but did not compare the
actual script content. If the setupScript constant was updated in a new
app version, the stale on-disk script would not be detected or replaced.
Added a guard clause that reads the installed script from disk and
compares it against the expected setupScript content (with whitespace
normalization). When content differs, the function returns false,
triggering the reinstallation flow with an admin privileges prompt.
Test plan:
- Installed on a cluster that had the previous network config. Got the
popup asking for permissions. After accepting I could run Kimi K2
Thinking Tensor RDMA on all 4 nodes.
## Motivation
Information gathering is tightly coupled to MacMon - we should start
generalizing our information sources so we can add more in future.
## Changes
Added a new system to gather any information. Currently, it is attached
to the Worker - though this is mostly to keep the data processing logic
simple. It could be made independent quite easily.
I also refactored topology to include different kinds of connections as
we can gather RDMA connections without having a pre-existing socket
connection, and made the relevant placement updates. We should no longer
need the network locations script in the app.
Other sources of information now include:
- static node information like "model" and "chip" (macos, "Unknown"
fallback)
- device friendly name (macos, falls back to device hostname)
- network interfaces + ips (cross platform)
- thunderbolt interfaces (macos)
- thunderbolt connections (macos)
- RAM usage (cross platform)
- per-device configuration written to EXO_HOME/config.toml
## Limitations
Model and Chip are not cross platform concepts.
We do not differentiate between unified and non-unified memory systems.
A lot of this data collection is based on simple timers. Watching the SC
store on macos is the correct way to gather some of this information,
but requires a detour into rust for macos.
## Why It Works
The InfoGatherer is a generic subsystem which returns a union of metric
datatypes. It writes them to an event, which is applied to state. It is
currently re-spawned with the worker so each cluster receives the
correct information.
As for topology, macOS identifies TB ports with a uuid in
SPThunderboltDataType, and also stores remote uuids if it can find them.
These changes read that data with the system_profiler, hopefully not so
often as to cause notable performance impacts (though this should be
tuned) but frequently enough for moderate responsiveness.
As we can identify TB connections between devices without needing ips
attached to each interface, we can remove the network setup script
(almost) completely.
## Test Plan
### Manual Testing
Spawn RDMA instances without enabling DHCP on the RDMA interfaces.
### Automated Testing
Updated the current master and shared tests to cover the topology
refactor and new events.
---------
Co-authored-by: Sami Khan <smsak99@gmail.com>
Co-authored-by: Alex Cheema <alexcheema123@gmail.com>
Co-authored-by: Jake Hillion <jake@hillion.co.uk>
Updated README.md with documentation for four new features:
- added a "Benchmarking" section documenting the exo-bench tool for
measuring model performance across different placement configurations
- documented the custom namespace feature for cluster isolation in the
macOS app section
- added a "Configuration Options" subsection explaining the --no-worker
CLI flag for coordinator-only nodes
- added a "File Locations (Linux)" subsection documenting XDG Base
Directory Specification compliance on Linux systems
Issue #930
The on_progress callback was synchronous but always invoked from async
contexts, forcing the use of send_nowait() which could raise WouldBlock
if the channel buffer was full, potentially dropping progress updates.
Changed the callback type from `Callable[[ShardMetadata,
RepoDownloadProgress], None]` to return a coroutine, updated all
implementations to be async, and replaced send_nowait() with await
send() in the worker's download progress handler.
This allows proper backpressure handling when sending download progress
events through the channel, eliminating the "Footgun!" that was
previously documented in the code.
Test plan:
- Built a DMG and ran it on one node. All existing models showed as
downloaded.
- Downloaded a new model. The progress bar on the download page worked.
- Downloaded another new model. The progress bar on the home page
worked.
## Motivation
When models output LaTeX-formatted math proofs, the dashboard was not
rendering them correctly. Issues included:
- `\documentclass`, `\begin{document}`, `\usepackage` showing as raw
text
- `$...$` inline math with complex expressions (like `\frac`, `\ldots`)
not rendering due to markdown escaping backslashes
- `\begin{align*}...\end{align*}` and other math environments showing as
raw text
- `\emph{...}`, `\textbf{...}` LaTeX formatting commands not being
converted
- `$\require{...}$` (MathJax-specific) causing KaTeX errors
- `\begin{proof}...\end{proof}` showing as raw text
## Changes
Enhanced `MarkdownContent.svelte` with comprehensive LaTeX support:
**Math extraction before markdown processing:**
- Extract `$...$`, `$$...$$`, `\(...\)`, `\[...\]` into placeholders
before markdown processes the text
- Use alphanumeric placeholders (`MATHPLACEHOLDERINLINE0END`) that won't
be interpreted as HTML tags
- Restore and render with KaTeX after markdown processing
**LaTeX document command removal:**
- Strip `\documentclass{...}`, `\usepackage{...}`, `\begin{document}`,
`\end{document}`
- Strip `\maketitle`, `\title{...}`, `\author{...}`, `\date{...}`
- Strip `\require{...}` (MathJax-specific, not KaTeX)
- Replace `tikzpicture` environments with `[diagram]` placeholder
- Strip `\label{...}` cross-reference commands
**LaTeX math environments:**
- Convert `\begin{align*}`, `\begin{equation}`, `\begin{gather}`, etc.
to display math blocks
**LaTeX text formatting:**
- `\emph{...}` and `\textit{...}` → `<em>...</em>`
- `\textbf{...}` → `<strong>...</strong>`
- `\texttt{...}` → `<code>...</code>`
- `\underline{...}` → `<u>...</u>`
**LaTeX environments styling:**
- `\begin{proof}...\end{proof}` → styled proof block with QED symbol
- `\begin{theorem}`, `\begin{lemma}`, etc. → styled theorem blocks
**Display math enhancements:**
- Wrapped in styled container with subtle gold border
- "LaTeX" label and copy button appear on hover
- Dark theme KaTeX color overrides for better readability
- Custom scrollbar for overflow
## Why It Works
The key insight is that markdown processing was escaping backslashes in
LaTeX before KaTeX could see them. By extracting all math expressions
into alphanumeric placeholders *before* markdown runs, then restoring
them *after*, the LaTeX content passes through to KaTeX unmodified.
Using purely alphanumeric placeholders like `MATHPLACEHOLDERINLINE0END`
instead of `<<MATH_INLINE_0>>` prevents markdown from interpreting them
as HTML tags and stripping them.
## Test Plan
### Manual Testing
- Hardware: Any machine with the dashboard
- What you did:
- Ask model to "write a proof in latex"
- Verify inline math like `$x \in S$` renders correctly
- Verify display math like `\begin{align*}...\end{align*}` renders as
block
- Verify `\documentclass`, `\begin{document}` are stripped (not shown)
- Verify `\emph{...}` converts to italics
- Verify copy button works on display math blocks
- Test edge cases: `$5` (currency) stays as text, `\$50` (escaped)
becomes `$50`
Before:
<img width="799" height="637" alt="Screenshot 2026-01-19 at 11 51 22 AM"
src="https://github.com/user-attachments/assets/62a705b8-b3c2-47b8-afd0-5d0c1b240e44"
/>
After:
<img width="809" height="642" alt="Screenshot 2026-01-19 at 11 46 58 AM"
src="https://github.com/user-attachments/assets/4f35fa1d-333c-4285-bc68-58a50f8f148e"
/>
### Automated Testing
- Dashboard builds successfully with `npm run build`
- Existing functionality preserved
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 14:50:41 +00:00
351 changed files with 27363 additions and 11056 deletions
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.