mirror/exo - exo - Gitea: Git with a cup of tea

mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2026-02-05 19:52:16 -05:00

Author	SHA1	Message	Date
Ryuichi Leo Takashige	e7f3f47754	jeez that was dumb	2026-02-02 19:14:19 +00:00
Ryuichi Leo Takashige	d935c7a372	maybe fix?	2026-02-02 19:08:32 +00:00
Ryuichi Leo Takashige	bd089b30d7	raise timeouts	2026-02-02 18:50:26 +00:00
Ryuichi Leo Takashige	13b397a3c9	raise max concurrency	2026-02-02 18:45:29 +00:00
Ryuichi Leo Takashige	cf5fddf3f8	oops	2026-02-02 18:40:41 +00:00
Ryuichi Leo Takashige	c9df4ff004	save properly	2026-02-02 18:30:53 +00:00
Ryuichi Leo Takashige	4f7869b91b	cleanup after control c	2026-02-02 18:23:42 +00:00
Ryuichi Leo Takashige	b08ec25ef6	better limit?	2026-02-02 18:22:39 +00:00
Ryuichi Leo Takashige	f235019c28	make control c exit cleanly and add --limit	2026-02-02 18:04:58 +00:00
Ryuichi Leo Takashige	8456e3f74b	actually fix exo eval	2026-02-02 17:37:37 +00:00
Ryuichi Leo Takashige	49dc7a8798	livecodebench fix	2026-02-02 17:30:34 +00:00
Ryuichi Leo Takashige	dea52342ca	livecodebench fix	2026-02-02 17:27:59 +00:00
Ryuichi Leo Takashige	aae28d8e8b	livecodebench eval	2026-02-02 17:14:56 +00:00
Ryuichi Leo Takashige	04a0690746	faster prompt sizer	2026-02-02 14:50:04 +00:00
Ryuichi Leo Takashige	970717f1bb	dont time out pleaseee	2026-02-02 13:49:31 +00:00
Ryuichi Leo Takashige	774eb1756a	fix	2026-02-02 13:31:32 +00:00
Ryuichi Leo Takashige	061e58ce39	add livebench	2026-02-02 13:26:36 +00:00
Ryuichi Leo Takashige	e8b6ec131b	fix exo bench	2026-02-02 13:12:50 +00:00
Ryuichi Leo Takashige	24a6adf022	Add metadata to results.json	2026-01-29 13:02:35 +00:00
Ryuichi Leo Takashige	1e1eb8f8a1	Format	2026-01-28 14:01:45 +00:00
Ryuichi Leo Takashige	7823fd7b1a	fix exo eval	2026-01-27 22:20:04 +00:00
Ryuichi Leo Takashige	34fcafa68a	Ignore timeout	2026-01-27 21:10:59 +00:00
Ryuichi Leo Takashige	5152789e00	lengthen timeout	2026-01-27 18:25:54 +00:00
Ryuichi Leo Takashige	13ee17428e	Use 1200s timeout	2026-01-27 13:36:25 +00:00
Ryuichi Leo Takashige	5e3cd73a9e	fix batch handler for tp	2026-01-27 12:53:40 +00:00
Ryuichi Leo Takashige	eb89c2e4b9	Use tensor rdma minimax	2026-01-27 11:46:18 +00:00
Ryuichi Leo Takashige	b9ec8b0a44	fix	2026-01-23 12:58:36 +00:00
Ryuichi Leo Takashige	00442b3cfd	Add more llm stuff	2026-01-23 12:55:13 +00:00
Ryuichi Leo Takashige	aa41da8541	Add more llm stuff	2026-01-23 12:47:04 +00:00
Ryuichi Leo Takashige	86e5d7b101	optimize further and get usage stats	2026-01-22 22:13:00 +00:00
Ryuichi Leo Takashige	4591301767	Add a bunch of LLM generated slop	2026-01-22 20:44:40 +00:00
Ryuichi Leo Takashige	8b0b5e1b88	Add completions endpoint	2026-01-22 17:26:52 +00:00
Ryuichi Leo Takashige	bd6287727a	Add basic exo eval	2026-01-22 16:48:12 +00:00
Alex Cheema	8f6726d6be	Fix config.json download errors for image models (#1245 ) ## Motivation When `get_shard_download_status()` runs, it iterates over all models in `MODEL_CARDS` and calls `build_full_shard()` → `build_base_shard()` → `ModelCard.from_hf()`. This unconditionally tried to download `config.json` from HuggingFace, but image models (FLUX, Qwen-Image) don't have a root-level config.json file, causing errors: ``` Error downloading shard: File not found: https://huggingface.co/black-forest-labs/FLUX.1-dev/resolve/main/config.json Error downloading shard: File not found: https://huggingface.co/black-forest-labs/FLUX.1-schnell/resolve/main/config.json Error downloading shard: File not found: https://huggingface.co/Qwen/Qwen-Image/resolve/main/config.json Error downloading shard: File not found: https://huggingface.co/Qwen/Qwen-Image-Edit-2509/resolve/main/config.json ``` ## Changes ### ModelCard.load() fix - `build_base_shard()` now uses `ModelCard.load()` instead of `ModelCard.from_hf()` - `ModelCard.load()` iterates through `MODEL_CARDS.values()` to find a match by `model_id` ### exo-bench fixes - Use `name` field instead of `id` for model resolution - Pass `full_model_id` to `/instance/previews` endpoint - Make model name matching case-insensitive - Update README example model name ## Why It Works `MODEL_CARDS` uses short names as keys (e.g., `"flux1-schnell"`) but the `model_id` values are HuggingFace paths (e.g., `"black-forest-labs/FLUX.1-schnell"`). When `ModelCard.load()` was called with the HF path, it didn't match any key and fell back to `from_hf()` which tried to download config.json. The fix iterates through `MODEL_CARDS.values()` to find a match by `model_id`, ensuring predefined models (including image models) use their registry entries directly without network calls. A key lookup is unnecessary since `load()` is always called with HF paths which don't match the short-name keys. ## Test Plan ### Manual Testing - Run exo and verify no more "Error downloading shard: File not found: .../config.json" errors for image models - Run exo-bench and verify model resolution works correctly ### Automated Testing - `uv run basedpyright` - passes with 0 errors - `uv run pytest` - all tests pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 21:30:48 +00:00
rltakashige	5fd55594c9	Wrap pipeline models for explicit mx.depends between cache and logits (#1206 ) ## Motivation GPU timeouts often when prompt size > profile_step_size. It also happens for seemingly random models. ## Changes Add mx.depends for cache on the logits. All gather at the model level rather than the layer level, reducing the amount of data sent. ## Why It Works mlx_lm's prefill loop only evaluates cache state, not logits. When prompt > prefill_step_size, the all_gather is never evaluated, causing GPU timeout. ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing Added failing test cases and then resolved them.	2026-01-19 17:49:42 +00:00
rltakashige	5c8a237940	Handle model timeouts (#1177 ) - Add eval with a timeout. - Add fast synch flag ## Motivation Because of the experimental FAST SYNCH flag, some models may not work. This PR catches when this occurs and allows users to specify a run without fast synch ## Changes - Adds a flag to enable or disable fast synch (--fast-synch and --no-fast-synch) - Adds a heuristic timeout - Reduces exo_bench default timeout to 10 minutes. ## Why It Works Heuristic timeout assumes normal loading times on Mac devices (60 + model size in gb / 5: e.g. DeepSeek takes up to 120 seconds to load on tensor parallel, and timeout is set to 60 + 120 = 180s. We could raise this value if necessary. ## Test Plan ### Manual Testing Catches that GPT OSS fails to load in Tensor RDMA Can launch with --no-fast-synch flag to launch GPT OSS. GPT OSS 20B TP with fast synch <img width="3064" height="456" alt="image" src="https://github.com/user-attachments/assets/f6e25cd8-8621-4e99-99fe-292ee05c4035" /> TP without fast synch <img width="3098" height="496" alt="image" src="https://github.com/user-attachments/assets/d36453d9-6686-4cfe-aa7c-a7d458369d4d" /> [Note: the performance is really not great as fast synch is off] (As a sanity check) PP with fast synch <img width="3124" height="496" alt="image" src="https://github.com/user-attachments/assets/e97d4547-c6fa-483d-badb-4b371b900b4c" /> PP without fast synch <img width="3078" height="508" alt="image" src="https://github.com/user-attachments/assets/b2e20dfd-4b0e-4295-8a92-417dfe745c28" /> PP without RDMA <img width="3070" height="498" alt="image" src="https://github.com/user-attachments/assets/a8509d68-0aef-4cda-bca5-a67d39a0801e" /> TP without RDMA <img width="3068" height="496" alt="image" src="https://github.com/user-attachments/assets/b5691429-89f4-4369-bcf2-8fde2ad7154a" />	2026-01-16 20:25:12 +00:00
rltakashige	745343c705	Return error responses for Chat Completions (#1173 ) - Error chunks - Use error handling in exo_bench.py ## Motivation Return when an error occurs so that generation stops. Adding timeouts is a separate TODO for model loading and chat completions. ## Changes - Return HTTP exceptions as JSON responses in an OpenAI compatible format. - Context manager for generation to catch and return error messages. - Use error handling in exo_bench.py. ## Test Plan ### Manual Testing Manually tested that exo_bench returns on failures within and outside generation ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-01-16 19:24:37 +00:00
rltakashige	4b3de6b984	Fix exo bench for transformers 5.x (#1168 ) ## Motivation Prompt Sizer was broken as transformers 5.x tokenizers create BatchEncodings which are essentially a dictionary of {input_ids: []} instead of the list of input ids. ## Test Plan ### Manual Testing Tested that exo bench runs as expected. ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-01-16 12:39:22 +00:00
rltakashige	077b1bc732	exo-bench (Benchmark model pp & tg speed) (#1099 ) ## Motivation This PR implements benchmarking in the style of llama-bench. The main difficulty here is the fact that exo is not a library - it exposes an endpoint. This means that benchmarking numbers will be inaccurate if the API is measured. The solution assumes nodes are set up with uv run exo (or via the app), and then hits the new endpoint /bench/chat/completions to retrieve generation statistics directly from mlx_lm. <!-- Why is this change needed? What problem does it solve? --> This will allow us to release benchmarks for models and perform regression tests. TODO: Performance benchmarking. <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> - Adds /bench/chat/completions endpoint - Adds BenchChatCompletion/Response - Adds a logits processor to prevent response from ending early - Adds a "Prompt Sizer" which downloads the tokenizer and dynamically adjusts the prompt of "a" to fit the desired prompt size. - Reduce prefill step size to 2048 for now (in future, dynamically adjust this value) <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> Benchmarked Llama, Qwen, DeepSeek and Kimi models. Will require several fixes to run consistently on all configurations (to be done in the future). Manually tested the normal API to verify chat requests complete as expected. ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - --> Not really possible. Type checker passes.	2026-01-06 17:39:09 +00:00

39 Commits