Commit Graph

1954 Commits

Author SHA1 Message Date
Evan
ca321cfcc2 mapping of conns 2026-01-18 21:14:03 +00:00
Evan
bce4cb458d fix tests 2026-01-18 21:14:03 +00:00
Evan
80a8e83348 review response 2026-01-18 21:14:03 +00:00
Evan
2645beea42 i hate that test 2026-01-18 21:14:03 +00:00
Evan
b8842e8081 rebase lint fmt 2026-01-18 21:14:03 +00:00
Evan
74a50d71af think that was the bug 2026-01-18 21:14:03 +00:00
Evan
652528b32c update log message + assertion 2026-01-18 21:14:03 +00:00
Evan
729c4ccaa2 add a test to gather TB connectivity data 2026-01-18 21:14:03 +00:00
Alex Cheema
7b6d49448b fix: dashboard TypeScript errors and friendly name showing "Unknown"
Dashboard fixes (TypeScript errors from `npm run check`):
- TopologyGraph.svelte: remove reference to deleted sendBackMultiaddr
  property, fix type inference for debug edge labels
- ModelCard.svelte: add missing topoWidth/topoHeight to early return
- +page.svelte: fix nested property access for deviceRank

Backend fix:
- info_gatherer.py: send initial MiscData on startup so friendly name
  appears immediately instead of showing "Unknown" until it changes

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-18 21:14:03 +00:00
Evan
5f31c7884c lint fmt 2026-01-18 21:14:03 +00:00
Evan
46637e8ca9 bug 2026-01-18 21:14:03 +00:00
Evan
52d9ef17b2 still use ibv_devices 2026-01-18 21:14:03 +00:00
Evan
566a1688bd fix the dashboard 2026-01-18 21:14:03 +00:00
Evan
912b8303ec forgot how weird this platform is 2026-01-18 21:14:03 +00:00
Evan
aa4b0ede9f fmt ts 2026-01-18 21:14:03 +00:00
Evan
64fab01822 remove the old network script functionality 2026-01-18 21:14:03 +00:00
Evan
9c5467aa35 add to the test server 2026-01-18 21:14:03 +00:00
Evan
d6951feac3 lint fmt 2026-01-18 21:14:03 +00:00
Evan
b714bc4562 switch from sequence to map of connections 2026-01-18 21:14:03 +00:00
Evan
bce23eac3f pydantic types are now coherent 2026-01-18 21:14:03 +00:00
Sami Khan
5acb026c1e parsing api fix 2026-01-18 21:14:03 +00:00
Evan
a7bba1e29b code review followup 2026-01-18 21:14:03 +00:00
Evan
8add86fdd4 rename channel test 2026-01-18 21:14:03 +00:00
Evan
c289702ca4 move macmon test 2026-01-18 21:14:03 +00:00
Evan
63bd024d48 cleanup after rebase 2026-01-18 21:14:03 +00:00
Evan
741450987d dedup connections 2026-01-18 21:14:03 +00:00
Evan
ae38c594e7 freeze those models 2026-01-18 21:14:03 +00:00
Evan
4a8a8fd296 format 2026-01-18 21:14:03 +00:00
Evan
9d552cdc38 tidy 2026-01-18 21:14:03 +00:00
Evan
695708ae27 all mastet tests pass 2026-01-18 21:14:03 +00:00
Evan
a282478951 ibv -> jaccl 2026-01-18 21:14:03 +00:00
Evan
9e4a0049f9 tidying some horrible logic 2026-01-18 21:14:02 +00:00
Evan
84780eb538 fix download test 2026-01-18 21:14:02 +00:00
Evan
7ba8217e64 fix all master tests except rdma placement 2026-01-18 21:14:02 +00:00
Evan
3159a2d038 fix topology tests 2026-01-18 21:14:02 +00:00
Evan
2b5a368977 actually update the topology 2026-01-18 21:14:02 +00:00
Evan
272e36345c incorrect log 2026-01-18 21:14:02 +00:00
Evan
77062e4fef handle an error 2026-01-18 21:14:02 +00:00
Evan
19c6758a87 fix pydantic validation 2026-01-18 21:14:02 +00:00
Evan
37ca32dc33 type checks outside of tests, time to test 2026-01-18 21:14:02 +00:00
Evan
e30c24aac8 wuff 2026-01-18 21:14:02 +00:00
Evan
fc5acf8cfb rework topology 2026-01-18 21:14:02 +00:00
Evan
0fcbbfabac update placement 2026-01-18 21:14:02 +00:00
Evan
00b15ce20d mvp 2026-01-18 21:14:02 +00:00
Evan
287e03daa3 tidy config 2026-01-18 21:14:02 +00:00
rltakashige
618cee5223 Resolve test event ordering flakiness (#1194)
## Motivation

mp sender occasionally does not have time to flush its events before
collect() is called, making the event ordering test fail.

## Changes

- Replace mp_channel with simple collector for event ordering test
- Also suppress warning for <frozen importlib._bootstrap>:488 <frozen
importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject
has no __module__ attribute


## Why It Works

<!-- Explain why your approach solves the problem -->

## Test Plan

### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->

### Automated Testing
Ran the test 100 times without it failing.
2026-01-18 20:33:20 +00:00
Antonio Lujano Luna
9c29eb7d48 Add proxy and custom SSL certificate support for corporate networks (#1189)
Support HTTPS_PROXY/HTTP_PROXY environment variables for proxy
configuration and SSL_CERT_FILE for custom CA certificates, enabling use
in corporate environments with SSL inspection.

## Motivation
Users in corporate environments often need to route traffic through HTTP
proxies and use custom CA certificates for SSL inspection. Without this
support, exo cannot download models in these network configurations.

## Changes
- Added `HTTPS_PROXY`/`HTTP_PROXY` environment variable support to
`create_http_session()` in `download_utils.py`
- Added `SSL_CERT_FILE` environment variable support for custom CA
certificate bundles, falling back to certifi's default bundle

## Why It Works
- `aiohttp.ClientSession` natively supports the `proxy` parameter for
routing requests through HTTP proxies
- `ssl.create_default_context(cafile=...)` accepts a custom CA bundle
path, allowing corporate CAs to be trusted
- Using environment variables is consistent with the codebase's existing
configuration patterns (e.g., `EXO_HOME`, `HF_ENDPOINT`)

## Test Plan
### Manual Testing
- Set `HTTPS_PROXY` environment variable and verified model downloads
route through proxy
- Set `SSL_CERT_FILE` to custom CA bundle and verified SSL verification
succeeds with corporate SSL inspection

### Automated Testing
- No automated tests added; this change is configuration-only and does
not alter existing behavior when environment variables are unset
2026-01-18 12:05:50 +00:00
Alex Cheema
c5158bee53 Add pre-commit checks documentation to AGENTS.md (#1184)
## Motivation

CI failures can be avoided by running checks locally before committing.
This adds clear documentation to AGENTS.md so that AI agents (and
humans) know exactly which checks must pass before pushing code.

## Changes

Added a new "Pre-Commit Checks (REQUIRED)" section to AGENTS.md that:
- Lists all 4 required checks (basedpyright, ruff, nix fmt, pytest)
- Provides a one-liner to run all checks in sequence
- Notes that `nix fmt` changes must be staged before committing
- Explains that CI runs `nix flake check` which verifies everything

## Why It Works

Clear documentation prevents CI failures by ensuring contributors run
checks locally first. The one-liner command makes it easy to run all
checks before committing.

## Test Plan

### Manual Testing
- Verified the documented commands work correctly

### Automated Testing
- N/A - documentation only change

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-17 21:50:24 +00:00
rltakashige
5c8a237940 Handle model timeouts (#1177)
- Add eval with a timeout.
- Add fast synch flag

## Motivation

Because of the experimental FAST SYNCH flag, some models may not work.
This PR catches when this occurs and allows users to specify a run
without fast synch

## Changes

- Adds a flag to enable or disable fast synch (--fast-synch and
--no-fast-synch)
- Adds a heuristic timeout
- Reduces exo_bench default timeout to 10 minutes.

## Why It Works

Heuristic timeout assumes normal loading times on Mac devices (60 +
model size in gb / 5: e.g. DeepSeek takes up to 120 seconds to load on
tensor parallel, and timeout is set to 60 + 120 = 180s.

We could raise this value if necessary.

## Test Plan

### Manual Testing
Catches that GPT OSS fails to load in Tensor RDMA
Can launch with --no-fast-synch flag to launch GPT OSS.

**GPT OSS 20B**
TP with fast synch
<img width="3064" height="456" alt="image"
src="https://github.com/user-attachments/assets/f6e25cd8-8621-4e99-99fe-292ee05c4035"
/>

TP without fast synch
<img width="3098" height="496" alt="image"
src="https://github.com/user-attachments/assets/d36453d9-6686-4cfe-aa7c-a7d458369d4d"
/>
[Note: the performance is really not great as fast synch is off]

(As a sanity check)
PP with fast synch
<img width="3124" height="496" alt="image"
src="https://github.com/user-attachments/assets/e97d4547-c6fa-483d-badb-4b371b900b4c"
/>

PP without fast synch
<img width="3078" height="508" alt="image"
src="https://github.com/user-attachments/assets/b2e20dfd-4b0e-4295-8a92-417dfe745c28"
/>

PP without RDMA
<img width="3070" height="498" alt="image"
src="https://github.com/user-attachments/assets/a8509d68-0aef-4cda-bca5-a67d39a0801e"
/>

TP without RDMA
<img width="3068" height="496" alt="image"
src="https://github.com/user-attachments/assets/b5691429-89f4-4369-bcf2-8fde2ad7154a"
/>
2026-01-16 20:25:12 +00:00
rltakashige
745343c705 Return error responses for Chat Completions (#1173)
- Error chunks
- Use error handling in exo_bench.py

## Motivation

Return when an error occurs so that generation stops. Adding timeouts is
a separate TODO for model loading and chat completions.

## Changes

- Return HTTP exceptions as JSON responses in an OpenAI compatible
format.
- Context manager for generation to catch and return error messages.
- Use error handling in exo_bench.py.

## Test Plan

### Manual Testing
Manually tested that exo_bench returns on failures within and outside
generation

### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
2026-01-16 19:24:37 +00:00