exo/TODO.md at 9b39c30498e970c6a4fd522c8cd4f6829fbcd708

mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2026-01-28 15:52:56 -05:00

Files

Evan Quiney 2202685c3e refactor all information sources (including ipless rdma discovery) (#928 )

## Motivation

Information gathering is tightly coupled to MacMon - we should start
generalizing our information sources so we can add more in future.

## Changes

Added a new system to gather any information. Currently, it is attached
to the Worker - though this is mostly to keep the data processing logic
simple. It could be made independent quite easily.

I also refactored topology to include different kinds of connections as
we can gather RDMA connections without having a pre-existing socket
connection, and made the relevant placement updates. We should no longer
need the network locations script in the app.

Other sources of information now include:
- static node information like "model" and "chip" (macos, "Unknown"
fallback)
- device friendly name (macos, falls back to device hostname)
- network interfaces + ips (cross platform)
- thunderbolt interfaces (macos)
- thunderbolt connections (macos)
- RAM usage (cross platform)
- per-device configuration written to EXO_HOME/config.toml

## Limitations

Model and Chip are not cross platform concepts.

We do not differentiate between unified and non-unified memory systems.

A lot of this data collection is based on simple timers. Watching the SC
store on macos is the correct way to gather some of this information,
but requires a detour into rust for macos.

## Why It Works

The InfoGatherer is a generic subsystem which returns a union of metric
datatypes. It writes them to an event, which is applied to state. It is
currently re-spawned with the worker so each cluster receives the
correct information.

As for topology, macOS identifies TB ports with a uuid in
SPThunderboltDataType, and also stores remote uuids if it can find them.
These changes read that data with the system_profiler, hopefully not so
often as to cause notable performance impacts (though this should be
tuned) but frequently enough for moderate responsiveness.
As we can identify TB connections between devices without needing ips
attached to each interface, we can remove the network setup script
(almost) completely.

## Test Plan

### Manual Testing
Spawn RDMA instances without enabling DHCP on the RDMA interfaces.

### Automated Testing
Updated the current master and shared tests to cover the topology
refactor and new events.

---------

Co-authored-by: Sami Khan <smsak99@gmail.com>
Co-authored-by: Alex Cheema <alexcheema123@gmail.com>
Co-authored-by: Jake Hillion <jake@hillion.co.uk>

2026-01-19 16:58:09 +00:00

3.2 KiB

Raw Blame History

Currently a lot of requests from the API are timing out, but we still process those requests internally. If an API request times out, we should cancel all corresponding tasks to that API request (why process a request with nobody listening).
Task cancellation. When API http request gets cancelled, it should cancel corresponding task.
I'd like to see profiled network latency / bandwidth.
I'd like to see how much bandwidth each link is using.
We should handle the case where one machine doesn't have the model downloaded and then other machines are waiting on it. In this case we get loads of timeout errors because the others are waiting for the one that needs to download the model.
Solve the problem of in continuous batching when a new prompt comes in, it will block decode of the current batch until the prefill is complete.
We want people to be able to copy models over to a new device without ever connecting EXO to the internet. Right now EXO require internet connection once to cache some files to check if a download is complete. Instead, we should simply check if there is a non-empty model folder locally with no .partial files. This indicates it's a fully downloaded model that can be loaded.
More granular control over how to deploy instances.
Nix is great but installing it is a pain and we have ended up in a lot of cases having PATH issues or installation issues. For example, after rebooting mike it seemed to no longer have a nix installation and needed reinstalling. It has a bunch of broken symlinks left over from nix that caused ssh to fail, making it even harder to debug. We need consistent environments (perhaps MDM) so we can guarantee nix is installed properly on each machine.
Memory pressure instead of memory used.
Show the type of each connection (TB5, Ethernet, etc.) in the UI. Refer to old exo: 56f783b38d/exo/helpers.py (L251)
Prioritise certain connection types (or by latency). TB5 > Ethernet > WiFi. Refer to old exo: 56f783b38d/exo/helpers.py (L251)
Dynamically switch to higher priority connection when it becomes available. Probably bring back InstanceReplacedAtomically.
Faster model loads by streaming model from other devices in cluster.
Add support for specifying the type of network connection to use in a test. Depends on 15/16.
Add chat completion cancellations (e.g OpenWebUI has something for cancelling an ongoing request).
Do we need cache_limit? We went back and forth on that a lot because we thought it might be causing issues. One problem is it sets it relative to model size. So if you have multiple models loaded in it will take the most recent model size for the cache_limit. This is problematic if you launch DeepSeek -> Llama for example.
further openai/lmstudio api compatibility
Rethink retry logic
Task cancellation. When API http request gets cancelled, it should cancel corresponding task.
Log cleanup - per-module log filters and default to DEBUG log levels
Validate RDMA connections with ibv_devinfo in the info gatherer

Potential refactors:

Topology can be simplified

Random errors we've run into:

3.2 KiB Raw Blame History

3.2 KiB

Raw Blame History