exo/TODO.md at 16f724e24c58bd81b0335b0a5d08919b41a25312

mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2025-12-23 22:27:50 -05:00

Files

rltakashige 16f724e24c Update staging 14

Co-authored-by: Evan <evanev7@gmail.com>
Co-authored-by: Alex Cheema <alexcheema123@gmail.com>
Co-authored-by: David Munha Canas Correia <dmunha@MacBook-David.local>
Co-authored-by: github-actions bot <github-actions@users.noreply.github.com>

2025-11-05 01:44:24 +00:00

2.8 KiB

Raw Blame History

Currently EXO just doesn't start cleanly a lot of the time. I see two kinds of issues: b. EXO starts but then after creating an instance that instance never loads (either gets stuck in Loading of Inactive).
Currently a lot of requests from the API are timing out, but we still process those requests internally. If an API request times out, we should cancel all corresponding tasks to that API request (why process a request with nobody listening).
I'd like to see profiled network latency / bandwidth.
I'd like to see how much bandwidth each link is using.
We should handle the case where one machine doesn't have the model downloaded and then other machines are waiting on it. In this case we get loads of timeout errors because the others are waiting for the one that needs to download the model.
Solve the problem of in continuous batching when a new prompt comes in, it will block decode of the current batch until the prefill is complete.
We want people to be able to copy models over to a new device without ever connecting EXO to the internet. Right now EXO require internet connection once to cache some files to check if a download is complete. Instead, we should simply check if there is a non-empty model folder locally with no .partial files. This indicates it's a fully downloaded model that can be loaded.
More granular control over how to deploy instances.
Nix is great but installing it is a pain and we have ended up in a lot of cases having PATH issues or installation issues. For example, after rebooting mike it seemed to no longer have a nix installation and needed reinstalling. It has a bunch of broken symlinks left over from nix that caused ssh to fail, making it even harder to debug. We need consistent environments (perhaps MDM) so we can guarantee nix is installed properly on each machine.
Memory pressure instead of memory used.
Show the type of each connection (TB5, Ethernet, etc.) in the UI. Refer to old exo: 56f783b38d/exo/helpers.py (L251)
Prioritise certain connection types (or by latency). TB5 > Ethernet > WiFi. Refer to old exo: 56f783b38d/exo/helpers.py (L251)
Dynamically switch to higher priority connection when it becomes available. Probably bring back InstanceReplacedAtomically.
Faster model loads by streaming model from other devices in cluster.
Add support for specifying the type of network connection to use in a test. Depends on 15/16.
Fix mx.distributed.Group typing.
Add chat completion cancellations (e.g OpenWebUI has something for cancelling an ongoing request).
Make two separate things: tensor or pipeline, and ring or ibv.

Potential refactors:

Make ForwarderEvent typed
Topology can be simplified
Get rid of InstanceReplacedAtomically

2.8 KiB Raw Blame History

2.8 KiB

Raw Blame History