Fix draft release detection (attempt 3) (#1176 )

## Motivation Previous fix still failed in CI. Suspecting permissions issue with GITHUB_TOKEN not being able to see draft releases via API. ## Changes 1. Add explicit `permissions: contents: write` to the job 2. Use `gh release list` first to check if draft exists (this uses a different code path that might work better) 3. Add debug echo statements ## Test Plan Delete v1.0.63 tag and re-push after merging. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Fix draft release detection query (#1175 )
2026-01-21 20:39:59 -05:00 · 2026-01-16 17:26:06 +00:00 · 2026-01-16 17:05:24 +00:00 · 2026-01-16 16:47:33 +00:00 · 2026-01-16 14:05:47 +00:00 · 2026-01-16 13:20:03 +00:00
15 changed files with 262 additions and 5433 deletions
--- a/.github/workflows/build-app.yml
+++ b/.github/workflows/build-app.yml
@@ -1,5 +1,16 @@
 name: Build EXO macOS DMG

+# Release workflow:
+# 1. Create a draft GitHub Release with the tag name (e.g. v1.0.0) and write release notes in markdown
+# 2. Push the tag: git tag v1.0.0 && git push origin v1.0.0
+# 3. This workflow builds, signs, and notarizes the DMG
+# 4. Release notes are embedded in appcast.xml for Sparkle (rendered as markdown)
+# 5. DMG and appcast.xml are uploaded to S3
+# 6. The draft GitHub Release is published with the DMG attached
+#
+# For alpha releases (e.g. v1.0.0-alpha.1): draft release and notes are optional.
+# If no draft exists, a release is auto-created with generated notes.
+
 on:
  workflow_dispatch:
  push:
@@ -11,8 +22,10 @@ on:
 jobs:
  build-macos-app:
    runs-on: "macos-26"
+    permissions:
+      contents: write
    env:
-      SPARKLE_VERSION: 2.8.1
+      SPARKLE_VERSION: 2.9.0-beta.1
      SPARKLE_DOWNLOAD_PREFIX: ${{ secrets.SPARKLE_DOWNLOAD_PREFIX }}
      SPARKLE_FEED_URL: ${{ secrets.SPARKLE_FEED_URL }}
      SPARKLE_ED25519_PUBLIC: ${{ secrets.SPARKLE_ED25519_PUBLIC }}
@@ -87,6 +100,52 @@ jobs:
            exit 1
          fi

+      - name: Fetch and validate release notes
+        if: github.ref_type == 'tag'
+        env:
+          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        run: |
+          # Find draft release by name using gh release list (more reliable with default token)
+          echo "Looking for draft release named '$GITHUB_REF_NAME'..."
+          DRAFT_EXISTS=$(gh release list --json name,isDraft --jq ".[] | select(.isDraft == true) | select(.name == \"$GITHUB_REF_NAME\") | .name" 2>/dev/null || echo "")
+
+          if [[ -z "$DRAFT_EXISTS" ]]; then
+            if [[ "$IS_ALPHA" == "true" ]]; then
+              echo "No draft release found for alpha tag $GITHUB_REF_NAME (optional for alphas)"
+              echo "HAS_RELEASE_NOTES=false" >> $GITHUB_ENV
+              exit 0
+            fi
+            echo "ERROR: No draft release found for tag $GITHUB_REF_NAME"
+            echo "Please create a draft release with release notes before pushing the tag."
+            exit 1
+          fi
+
+          # Fetch full release details via API to get body and ID
+          echo "Found draft release, fetching details..."
+          RELEASE_JSON=$(gh api repos/${{ github.repository }}/releases --jq ".[] | select(.draft == true) | select(.name == \"$GITHUB_REF_NAME\")" 2>/dev/null || echo "")
+
+          # Extract release notes
+          NOTES=$(echo "$RELEASE_JSON" | jq -r '.body // ""')
+          if [[ -z "$NOTES" || "$NOTES" == "null" ]]; then
+            if [[ "$IS_ALPHA" == "true" ]]; then
+              echo "Draft release has no notes (optional for alphas)"
+              echo "HAS_RELEASE_NOTES=false" >> $GITHUB_ENV
+              exit 0
+            fi
+            echo "ERROR: Draft release exists but has no release notes"
+            echo "Please add release notes to the draft release before pushing the tag."
+            exit 1
+          fi
+
+          # Save release ID for later publishing
+          RELEASE_ID=$(echo "$RELEASE_JSON" | jq -r '.id')
+          echo "DRAFT_RELEASE_ID=$RELEASE_ID" >> $GITHUB_ENV
+          echo "HAS_RELEASE_NOTES=true" >> $GITHUB_ENV
+
+          echo "Found draft release (ID: $RELEASE_ID), saving release notes..."
+          echo "$NOTES" > /tmp/release_notes.md
+          echo "RELEASE_NOTES_FILE=/tmp/release_notes.md" >> $GITHUB_ENV
+
      # ============================================================
      # Install dependencies
      # ============================================================
@@ -304,6 +363,28 @@ jobs:
            $CHANNEL_FLAG \
            .

+      - name: Inject release notes into appcast
+        if: github.ref_type == 'tag' && env.HAS_RELEASE_NOTES == 'true'
+        env:
+          RELEASE_VERSION: ${{ env.RELEASE_VERSION }}
+        run: |
+          # Inject markdown release notes with sparkle:format="markdown" (Sparkle 2.9+)
+          export NOTES=$(cat "$RELEASE_NOTES_FILE")
+
+          # Insert description after the enclosure tag for this version
+          awk '
+            /<enclosure[^>]*>/ && index($0, ENVIRON["RELEASE_VERSION"]) {
+              print
+              print "            <description sparkle:format=\"markdown\"><![CDATA["
+              print ENVIRON["NOTES"]
+              print "            ]]></description>"
+              next
+            }
+            { print }
+          ' output/appcast.xml > output/appcast.xml.tmp && mv output/appcast.xml.tmp output/appcast.xml
+
+          echo "Injected markdown release notes for version $RELEASE_VERSION"
+
      # ============================================================
      # Upload artifacts
      # ============================================================
@@ -336,3 +417,26 @@ jobs:
            aws s3 cp "$DMG_NAME" "s3://${SPARKLE_S3_BUCKET}/${PREFIX}EXO-latest.dmg"
            aws s3 cp appcast.xml "s3://${SPARKLE_S3_BUCKET}/${PREFIX}appcast.xml" --content-type application/xml --cache-control no-cache
          fi
+
+      - name: Publish GitHub Release
+        if: github.ref_type == 'tag'
+        env:
+          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        run: |
+          DMG_PATH="output/EXO-${RELEASE_VERSION}.dmg"
+
+          if [[ "$HAS_RELEASE_NOTES" == "true" ]]; then
+            # Update the draft release with the tag and upload DMG
+            gh api --method PATCH "repos/${{ github.repository }}/releases/$DRAFT_RELEASE_ID" \
+              -f tag_name="$GITHUB_REF_NAME" \
+              -F draft=false
+            gh release upload "$GITHUB_REF_NAME" "$DMG_PATH" --clobber
+            echo "Published release $GITHUB_REF_NAME with DMG attached"
+          else
+            # Alpha without draft release - create one with auto-generated notes
+            gh release create "$GITHUB_REF_NAME" "$DMG_PATH" \
+              --title "$GITHUB_REF_NAME" \
+              --generate-notes \
+              --prerelease
+            echo "Created alpha release $GITHUB_REF_NAME with auto-generated notes"
+          fi
--- a/app/EXO/EXO.xcodeproj/project.pbxproj
+++ b/app/EXO/EXO.xcodeproj/project.pbxproj
@@ -585,7 +585,7 @@
 			repositoryURL = "https://github.com/sparkle-project/Sparkle.git";
 			requirement = {
 				kind = upToNextMajorVersion;
-				minimumVersion = 2.8.1;
+				minimumVersion = 2.9.0-beta.1;
 			};
 		};
 /* End XCRemoteSwiftPackageReference section */
--- a/app/EXO/EXO.xcodeproj/project.xcworkspace/xcshareddata/swiftpm/Package.resolved
+++ b/app/EXO/EXO.xcodeproj/project.xcworkspace/xcshareddata/swiftpm/Package.resolved
@@ -6,8 +6,8 @@
      "kind" : "remoteSourceControl",
      "location" : "https://github.com/sparkle-project/Sparkle.git",
      "state" : {
-        "revision" : "5581748cef2bae787496fe6d61139aebe0a451f6",
-        "version" : "2.8.1"
+        "revision" : "e641adb41915a8409895e2e30666aa64e487b637",
+        "version" : "2.9.0-beta.1"
      }
    }
  ],
--- a/app/EXO/EXO/ContentView.swift
+++ b/app/EXO/EXO/ContentView.swift
@@ -56,6 +56,11 @@ struct ContentView: View {
    }

    private var shouldShowLocalNetworkWarning: Bool {
+        // Show warning if local network is not working and EXO is running.
+        // The checker uses a longer timeout on first launch to allow time for
+        // the permission prompt, so this correctly handles both:
+        // 1. User denied permission on first launch
+        // 2. Permission broke after restart (macOS TCC bug)
        if case .notWorking = localNetworkChecker.status {
            return controller.status != .stopped
        }
--- a/app/EXO/EXO/Services/LocalNetworkChecker.swift
+++ b/app/EXO/EXO/Services/LocalNetworkChecker.swift
@@ -5,8 +5,8 @@ import os.log
 /// Checks if the app's local network permission is actually functional.
 ///
 /// macOS local network permission can appear enabled in System Preferences but not
-/// actually work after a restart. This service detects this by creating a UDP
-/// connection to the mDNS multicast address (224.0.0.251:5353).
+/// actually work after a restart. This service uses NWConnection to mDNS multicast
+/// to verify actual connectivity.
@MainActor
 final class LocalNetworkChecker: ObservableObject {
    enum Status: Equatable {
@@ -35,30 +35,43 @@ final class LocalNetworkChecker: ObservableObject {
    }

    private static let logger = Logger(subsystem: "io.exo.EXO", category: "LocalNetworkChecker")
+    private static let hasCompletedInitialCheckKey = "LocalNetworkChecker.hasCompletedInitialCheck"

    @Published private(set) var status: Status = .unknown
-    @Published private(set) var lastConnectionState: String = "none"

    private var connection: NWConnection?
    private var checkTask: Task<Void, Never>?

+    /// Whether we've completed at least one check (stored in UserDefaults)
+    private var hasCompletedInitialCheck: Bool {
+        get { UserDefaults.standard.bool(forKey: Self.hasCompletedInitialCheckKey) }
+        set { UserDefaults.standard.set(newValue, forKey: Self.hasCompletedInitialCheckKey) }
+    }
+
    /// Checks if local network access is working.
    func check() {
        checkTask?.cancel()
        status = .checking
-        lastConnectionState = "connecting"
+
+        // Use longer timeout on first launch to allow time for permission prompt
+        let isFirstCheck = !hasCompletedInitialCheck
+        let timeout: UInt64 = isFirstCheck ? 30_000_000_000 : 3_000_000_000

        checkTask = Task { [weak self] in
            guard let self else { return }
-            let result = await self.performCheck()
+
+            Self.logger.info("Checking local network connectivity (first check: \(isFirstCheck))")
+            let result = await self.checkConnectivity(timeout: timeout)
            self.status = result
+            self.hasCompletedInitialCheck = true
+
            Self.logger.info("Local network check complete: \(result.displayText)")
        }
    }

-    private func performCheck() async -> Status {
-        Self.logger.info("Checking local network access via UDP multicast")
-
+    /// Checks connectivity using NWConnection to mDNS multicast.
+    /// The connection attempt triggers the permission prompt if not yet shown.
+    private func checkConnectivity(timeout: UInt64) async -> Status {
        connection?.cancel()
        connection = nil

@@ -84,22 +97,7 @@ final class LocalNetworkChecker: ObservableObject {
                continuation.resume(returning: status)
            }

-            conn.stateUpdateHandler = { [weak self] state in
-                let stateStr: String
-                switch state {
-                case .setup: stateStr = "setup"
-                case .preparing: stateStr = "preparing"
-                case .ready: stateStr = "ready"
-                case .waiting(let e): stateStr = "waiting(\(e))"
-                case .failed(let e): stateStr = "failed(\(e))"
-                case .cancelled: stateStr = "cancelled"
-                @unknown default: stateStr = "unknown"
-                }
-
-                Task { @MainActor in
-                    self?.lastConnectionState = stateStr
-                }
-
+            conn.stateUpdateHandler = { state in
                switch state {
                case .ready:
                    resumeOnce(.working)
@@ -108,6 +106,7 @@ final class LocalNetworkChecker: ObservableObject {
                    if errorStr.contains("54") || errorStr.contains("ECONNRESET") {
                        resumeOnce(.notWorking(reason: "Connection blocked"))
                    }
+                // Otherwise keep waiting - might be showing permission prompt
                case .failed(let error):
                    let errorStr = "\(error)"
                    if errorStr.contains("65") || errorStr.contains("EHOSTUNREACH")
@@ -127,7 +126,7 @@ final class LocalNetworkChecker: ObservableObject {
            conn.start(queue: .main)

            Task {
-                try? await Task.sleep(nanoseconds: 3_000_000_000)
+                try? await Task.sleep(nanoseconds: timeout)
                let state = conn.state
                switch state {
                case .ready:
--- a/bench/exo_bench.py
+++ b/bench/exo_bench.py
@@ -241,6 +241,9 @@ class PromptSizer:
            ids = tokenizer.apply_chat_template(
                messages, tokenize=True, add_generation_prompt=True
            )
+            # Fix for transformers 5.x
+            if hasattr(ids, "input_ids"):
+                ids = ids.input_ids
            return int(len(ids))

        return count_fn
--- a/bench/exo_eval.py
+++ b/bench/exo_eval.py
@@ -1,378 +0,0 @@
-#!/usr/bin/env python3
-# pyright: reportAny=false, reportUnknownMemberType=false, reportUnknownVariableType=false, reportUnknownArgumentType=false, reportMissingTypeStubs=false
-"""
-exo-eval: Run SWE-bench evaluation against exo using OpenHands SDK (local, no Docker).
-"""
-from __future__ import annotations
-
-import argparse
-import json
-import os
-import subprocess
-import tempfile
-import time
-from dataclasses import dataclass
-from enum import Enum
-from pathlib import Path
-from typing import Any
-
-from datasets import load_dataset
-from loguru import logger
-from openhands.sdk import LLM, Agent, Conversation, Tool
-from openhands.tools.file_editor import FileEditorTool
-from openhands.tools.terminal import TerminalTool
-
-
-class EvalStatus(str, Enum):
-    Resolved = "Resolved"
-    Failed = "Failed"
-    Error = "Error"
-    Timeout = "Timeout"
-
-
-@dataclass
-class EvalResult:
-    instance_id: str
-    repo: str
-    status: EvalStatus
-    elapsed_seconds: float
-    tests_passed: list[str]
-    tests_failed: list[str]
-    error_message: str | None = None
-
-
-def load_swe_bench(
-    split: str = "lite",
-    limit: int | None = None,
-    instance_ids: list[str] | None = None,
-) -> list[dict[str, Any]]:
-    """Load SWE-bench dataset from HuggingFace."""
-    # SWE-bench Lite is a curated 300-instance subset
-    dataset_name = (
-        "princeton-nlp/SWE-bench_Lite" if split == "lite" else "princeton-nlp/SWE-bench"
-    )
-    actual_split = "test" if split == "lite" else split
-
-    ds = load_dataset(dataset_name, split=actual_split)
-    instances = [dict(row) for row in ds]
-
-    if instance_ids:
-        instances = [i for i in instances if i["instance_id"] in instance_ids]
-
-    if limit:
-        instances = instances[:limit]
-
-    return instances
-
-
-def clone_repo_at_commit(repo: str, commit: str, dest: Path) -> None:
-    """Clone a repo at a specific commit."""
-    repo_url = f"https://github.com/{repo}.git"
-
-    subprocess.run(
-        ["git", "clone", "--depth", "1", repo_url, str(dest)],
-        check=True,
-        capture_output=True,
-    )
-
-    subprocess.run(
-        ["git", "fetch", "--depth", "1", "origin", commit],
-        cwd=dest,
-        check=True,
-        capture_output=True,
-    )
-
-    subprocess.run(
-        ["git", "checkout", commit],
-        cwd=dest,
-        check=True,
-        capture_output=True,
-    )
-
-
-def build_agent_prompt(instance: dict[str, Any]) -> str:
-    """Build the prompt for the agent."""
-    return f"""You are a software engineer fixing a bug in the {instance['repo']} repository.
-
-## Problem Statement
-{instance['problem_statement']}
-
-## Instructions
-1. Explore the codebase to understand the issue
-2. Identify the files that need to be modified
-3. Make the necessary changes to fix the issue
-4. The fix should be minimal and targeted
-
-You have access to:
- terminal: Run shell commands (git, grep, python, etc.)
- file_editor: View and edit files
-
-Start by exploring the repository structure to understand where the relevant code is.
-"""
-
-
-def parse_fail_to_pass(fail_to_pass_str: str) -> list[str]:
-    """Parse the FAIL_TO_PASS field into a list of test names."""
-    try:
-        return json.loads(fail_to_pass_str)
-    except json.JSONDecodeError:
-        return [t.strip() for t in fail_to_pass_str.split(",") if t.strip()]
-
-
-def run_tests(workspace: Path, tests: list[str]) -> tuple[list[str], list[str]]:
-    """Run tests and return (passed, failed) lists."""
-    passed = []
-    failed = []
-
-    for test in tests:
-        try:
-            result = subprocess.run(
-                ["python", "-m", "pytest", "-xvs", test],
-                cwd=workspace,
-                capture_output=True,
-                timeout=300,
-            )
-            if result.returncode == 0:
-                passed.append(test)
-            else:
-                failed.append(test)
-        except subprocess.TimeoutExpired:
-            failed.append(test)
-
-    return passed, failed
-
-
-def run_single_eval(
-    instance: dict[str, Any],
-    host: str,
-    port: int,
-    model: str,
-    max_turns: int = 30,
-    timeout: float = 600.0,
-) -> EvalResult:
-    """Evaluate a single SWE-bench instance."""
-    instance_id = instance["instance_id"]
-    repo = instance["repo"]
-    base_commit = instance["base_commit"]
-    fail_to_pass = parse_fail_to_pass(instance["FAIL_TO_PASS"])
-
-    start_time = time.perf_counter()
-
-    try:
-        with tempfile.TemporaryDirectory() as tmpdir:
-            workspace = Path(tmpdir) / "repo"
-
-            # Clone repo at base commit
-            logger.info(f"Cloning {repo} at {base_commit[:8]}...")
-            clone_repo_at_commit(repo, base_commit, workspace)
-
-            # Setup OpenHands agent
-            llm = LLM(
-                model=f"openai/{model}",
-                base_url=f"http://{host}:{port}/v1",
-                api_key="not-needed",
-            )
-
-            agent = Agent(
-                llm=llm,
-                tools=[
-                    Tool(name=TerminalTool.name),
-                    Tool(name=FileEditorTool.name),
-                ],
-            )
-
-            # Run agent
-            conversation = Conversation(
-                agent=agent,
-                workspace=str(workspace),
-            )
-
-            logger.info(f"Running agent on {instance_id}...")
-            conversation.send_message(build_agent_prompt(instance))
-
-            for _turn in range(max_turns):
-                if time.perf_counter() - start_time > timeout:
-                    return EvalResult(
-                        instance_id=instance_id,
-                        repo=repo,
-                        status=EvalStatus.Timeout,
-                        elapsed_seconds=time.perf_counter() - start_time,
-                        tests_passed=[],
-                        tests_failed=fail_to_pass,
-                    )
-
-                result = conversation.run(max_turns=1)
-                if result.done:
-                    break
-
-            # Run tests to verify
-            logger.info(f"Running tests for {instance_id}...")
-            passed, failed = run_tests(workspace, fail_to_pass)
-
-            elapsed = time.perf_counter() - start_time
-            status = EvalStatus.Resolved if not failed else EvalStatus.Failed
-
-            return EvalResult(
-                instance_id=instance_id,
-                repo=repo,
-                status=status,
-                elapsed_seconds=elapsed,
-                tests_passed=passed,
-                tests_failed=failed,
-            )
-
-    except Exception as e:
-        return EvalResult(
-            instance_id=instance_id,
-            repo=repo,
-            status=EvalStatus.Error,
-            elapsed_seconds=time.perf_counter() - start_time,
-            tests_passed=[],
-            tests_failed=[],
-            error_message=str(e),
-        )
-
-
-def verify_exo_running(host: str, port: int, model: str) -> str:
-    """Verify exo is running and return full model ID."""
-    import http.client
-
-    conn = http.client.HTTPConnection(host, port, timeout=10)
-    conn.request("GET", "/models")
-    resp = conn.getresponse()
-
-    if resp.status != 200:
-        raise RuntimeError(f"exo not responding at {host}:{port}")
-
-    data = json.loads(resp.read())
-    for m in data.get("data", []):
-        if m.get("id") == model or m.get("hugging_face_id") == model:
-            return m.get("hugging_face_id") or m.get("id")
-
-    raise ValueError(f"Model '{model}' not found in exo")
-
-
-def main() -> int:
-    ap = argparse.ArgumentParser(
-        prog="exo-eval",
-        description="Run SWE-bench evaluation against exo (local, no Docker).",
-    )
-
-    ap.add_argument("--host", default=os.environ.get("EXO_HOST", "localhost"))
-    ap.add_argument(
-        "--port", type=int, default=int(os.environ.get("EXO_PORT", "52415"))
-    )
-    ap.add_argument("--model", required=True, help="exo model ID")
-    ap.add_argument(
-        "--split", default="lite", choices=["lite", "dev", "test", "train"]
-    )
-    ap.add_argument("--limit", type=int, default=10, help="Max instances")
-    ap.add_argument("--instance-ids", nargs="+", help="Specific instance IDs")
-    ap.add_argument("--max-turns", type=int, default=30)
-    ap.add_argument("--timeout", type=float, default=600.0)
-    ap.add_argument("--json-out", default="bench/eval_results.json")
-    ap.add_argument("-v", "--verbose", action="store_true")
-    ap.add_argument("--dry-run", action="store_true")
-
-    args = ap.parse_args()
-
-    # Load dataset first (doesn't require exo to be running)
-    logger.info(f"Loading SWE-bench {args.split} dataset...")
-    instances = load_swe_bench(
-        split=args.split,
-        limit=args.limit,
-        instance_ids=args.instance_ids,
-    )
-    logger.info(f"Loaded {len(instances)} instances")
-
-    if args.dry_run:
-        print(f"\nSWE-bench {args.split} instances ({len(instances)}):")
-        for inst in instances:
-            print(f"  {inst['instance_id']} ({inst['repo']})")
-        return 0
-
-    # Verify exo is running
-    model_id = verify_exo_running(args.host, args.port, args.model)
-    logger.info(f"Using model: {model_id}")
-
-    # Run evaluation
-    results: list[EvalResult] = []
-    for i, instance in enumerate(instances):
-        logger.info(f"[{i+1}/{len(instances)}] {instance['instance_id']}")
-
-        result = run_single_eval(
-            instance=instance,
-            host=args.host,
-            port=args.port,
-            model=model_id,
-            max_turns=args.max_turns,
-            timeout=args.timeout,
-        )
-        results.append(result)
-
-        logger.info(f"  Status: {result.status.value}")
-        if result.tests_passed:
-            logger.info(f"  Passed: {len(result.tests_passed)} tests")
-        if result.tests_failed:
-            logger.info(f"  Failed: {len(result.tests_failed)} tests")
-        if result.error_message:
-            logger.error(f"  Error: {result.error_message}")
-
-    # Compute summary
-    total = len(results)
-    resolved = sum(1 for r in results if r.status == EvalStatus.Resolved)
-    failed = sum(1 for r in results if r.status == EvalStatus.Failed)
-    errors = sum(1 for r in results if r.status == EvalStatus.Error)
-    timeouts = sum(1 for r in results if r.status == EvalStatus.Timeout)
-
-    summary = {
-        "model": model_id,
-        "split": args.split,
-        "total": total,
-        "resolved": resolved,
-        "resolved_rate": resolved / total if total else 0,
-        "failed": failed,
-        "errors": errors,
-        "timeouts": timeouts,
-    }
-
-    output = {
-        "summary": summary,
-        "results": [
-            {
-                "instance_id": r.instance_id,
-                "repo": r.repo,
-                "status": r.status.value,
-                "elapsed_seconds": r.elapsed_seconds,
-                "tests_passed": r.tests_passed,
-                "tests_failed": r.tests_failed,
-                "error_message": r.error_message,
-            }
-            for r in results
-        ],
-    }
-
-    Path(args.json_out).write_text(json.dumps(output, indent=2))
-    logger.info(f"Results written to {args.json_out}")
-
-    # Print summary
-    print("\n" + "=" * 60)
-    print("SWE-bench Evaluation Results")
-    print("=" * 60)
-    print(f"Model:    {model_id}")
-    print(f"Split:    {args.split}")
-    print(f"Total:    {total}")
-    if total:
-        print(f"Resolved: {resolved} ({resolved/total*100:.1f}%)")
-    else:
-        print("Resolved: 0")
-    print(f"Failed:   {failed}")
-    print(f"Errors:   {errors}")
-    print(f"Timeouts: {timeouts}")
-    print("=" * 60)
-
-    return 0
-
-
-if __name__ == "__main__":
-    raise SystemExit(main())
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -23,9 +23,7 @@ dependencies = [
    "tiktoken>=0.12.0", # required for kimi k2 tokenizer
    "hypercorn>=0.18.0",
    "openai-harmony>=0.0.8",
-    "openhands-sdk>=0.1.0", # for exo-eval SWE-bench evaluation
-    "openhands-tools>=0.1.0", # tools for openhands agents
-    "datasets>=3.0.0", # for loading SWE-bench from HuggingFace
+    "httpx>=0.28.1",
 ]

 [project.scripts]
--- a/src/exo/master/api.py
+++ b/src/exo/master/api.py
@@ -1,6 +1,6 @@
 import time
 from collections.abc import AsyncGenerator
-from typing import Any, cast
+from typing import cast

 import anyio
 from anyio import create_task_group
@@ -72,13 +72,7 @@ def chunk_to_response(
        choices=[
            StreamingChoiceResponse(
                index=0,
-                delta=ChatCompletionMessage(
-                    role="assistant",
-                    content=chunk.text if chunk.text else None,
-                    tool_calls=[tc.model_dump() for tc in chunk.tool_calls]
-                    if chunk.tool_calls
-                    else None,
-                ),
+                delta=ChatCompletionMessage(role="assistant", content=chunk.text),
                finish_reason=chunk.finish_reason,
            )
        ],
@@ -430,7 +424,6 @@ class API:
        text_parts: list[str] = []
        model: str | None = None
        finish_reason: FinishReason | None = None
-        all_tool_calls: list[dict[str, Any]] = []

        async for chunk in self._chat_chunk_stream(command_id):
            if model is None:
@@ -438,20 +431,6 @@ class API:

            text_parts.append(chunk.text)

-            # Collect tool calls
-            if chunk.tool_calls:
-                for tc in chunk.tool_calls:
-                    all_tool_calls.append(
-                        {
-                            "id": tc.id,
-                            "type": tc.type,
-                            "function": {
-                                "name": tc.function.name,
-                                "arguments": tc.function.arguments,
-                            },
-                        }
-                    )
-
            if chunk.finish_reason is not None:
                finish_reason = chunk.finish_reason

@@ -467,8 +446,7 @@ class API:
                    index=0,
                    message=ChatCompletionMessage(
                        role="assistant",
-                        content=combined_text if combined_text else None,
-                        tool_calls=all_tool_calls if all_tool_calls else None,
+                        content=combined_text,
                    ),
                    finish_reason=finish_reason,
                )
@@ -481,7 +459,6 @@ class API:
        text_parts: list[str] = []
        model: str | None = None
        finish_reason: FinishReason | None = None
-        all_tool_calls: list[dict[str, Any]] = []

        stats: GenerationStats | None = None

@@ -492,20 +469,6 @@ class API:
            text_parts.append(chunk.text)
            stats = chunk.stats or stats

-            # Collect tool calls
-            if chunk.tool_calls:
-                for tc in chunk.tool_calls:
-                    all_tool_calls.append(
-                        {
-                            "id": tc.id,
-                            "type": tc.type,
-                            "function": {
-                                "name": tc.function.name,
-                                "arguments": tc.function.arguments,
-                            },
-                        }
-                    )
-
            if chunk.finish_reason is not None:
                finish_reason = chunk.finish_reason

@@ -520,9 +483,7 @@ class API:
                ChatCompletionChoice(
                    index=0,
                    message=ChatCompletionMessage(
-                        role="assistant",
-                        content=combined_text if combined_text else None,
-                        tool_calls=all_tool_calls if all_tool_calls else None,
+                        role="assistant", content=combined_text
                    ),
                    finish_reason=finish_reason,
                )
--- a/src/exo/shared/logging.py
+++ b/src/exo/shared/logging.py
@@ -29,6 +29,11 @@ class _InterceptHandler(logging.Handler):

 def logger_setup(log_file: Path | None, verbosity: int = 0):
    """Set up logging for this process - formatting, file handles, verbosity and output"""
+
+    logging.getLogger("exo_pyo3_bindings").setLevel(logging.WARNING)
+    logging.getLogger("httpx").setLevel(logging.WARNING)
+    logging.getLogger("httpcore").setLevel(logging.WARNING)
+
    logger.remove()

    # replace all stdlib loggers with _InterceptHandlers that log to loguru
--- a/src/exo/shared/types/chunks.py
+++ b/src/exo/shared/types/chunks.py
@@ -1,7 +1,4 @@
 from enum import Enum
-from typing import Literal
-
-from pydantic import BaseModel

 from exo.shared.types.api import GenerationStats
 from exo.utils.pydantic_ext import TaggedModel
@@ -15,17 +12,6 @@ class ChunkType(str, Enum):
    Image = "Image"


-class ToolCallFunction(BaseModel, frozen=True):
-    name: str
-    arguments: str
-
-
-class ToolCall(BaseModel, frozen=True):
-    id: str
-    type: Literal["function"] = "function"
-    function: ToolCallFunction
-
-
 class BaseChunk(TaggedModel):
    idx: int
    model: ModelId
@@ -36,7 +22,6 @@ class TokenChunk(BaseChunk):
    token_id: int
    finish_reason: FinishReason | None = None
    stats: GenerationStats | None = None
-    tool_calls: list[ToolCall] | None = None


 class ImageChunk(BaseChunk):
--- a/src/exo/shared/types/worker/runner_response.py
+++ b/src/exo/shared/types/worker/runner_response.py
@@ -1,5 +1,4 @@
 from exo.shared.types.api import FinishReason, GenerationStats
-from exo.shared.types.chunks import ToolCall
 from exo.utils.pydantic_ext import TaggedModel


@@ -17,7 +16,6 @@ class GenerationResponse(BaseRunnerResponse):
    # logprobs: list[float] | None = None # too big. we can change to be top-k
    finish_reason: FinishReason | None = None
    stats: GenerationStats | None = None
-    tool_calls: list[ToolCall] | None = None


 class FinishedResponse(BaseRunnerResponse):
--- a/src/exo/worker/runner/runner.py
+++ b/src/exo/worker/runner/runner.py
@@ -1,13 +1,9 @@
-import json
 import time
 from collections.abc import Generator
 from functools import cache
-from typing import Any
-from uuid import uuid4

 import mlx.core as mx
 from mlx_lm.models.gpt_oss import Model as GptOssModel
-from mlx_lm.tokenizer_utils import TokenizerWrapper
 from openai_harmony import (  # pyright: ignore[reportMissingTypeStubs]
    HarmonyEncodingName,
    Role,
@@ -16,7 +12,7 @@ from openai_harmony import (  # pyright: ignore[reportMissingTypeStubs]
 )

 from exo.shared.types.api import ChatCompletionMessageText
-from exo.shared.types.chunks import TokenChunk, ToolCall, ToolCallFunction
+from exo.shared.types.chunks import TokenChunk
 from exo.shared.types.events import (
    ChunkGenerated,
    Event,
@@ -176,10 +172,7 @@ def main(
                    if isinstance(model, GptOssModel):
                        mlx_generator = parse_gpt_oss(mlx_generator)

-                    # Parse tool calls to place them in the tool calls section
-                    mlx_generator = parse_tool_calls(
-                        mlx_generator, tokenizer, task_params.tools
-                    )
+                    # TODO: Add tool call parser here

                    for response in mlx_generator:
                        match response:
@@ -195,7 +188,6 @@ def main(
                                                token_id=response.token,
                                                finish_reason=response.finish_reason,
                                                stats=response.stats,
-                                                tool_calls=response.tool_calls,
                                            ),
                                        )
                                    )
@@ -269,98 +261,6 @@ def parse_gpt_oss(
            break


-def _generate_tool_call_id() -> str:
-    return f"call_{uuid4().hex[:24]}"
-
-
-def _parse_tool_call_content(
-    content: str,
-    tokenizer: TokenizerWrapper,
-    tools: list[dict[str, Any]] | None,
-) -> ToolCall | None:
-    content = content.strip()
-    if not content:
-        return None
-
-    tool_parser: Any = getattr(tokenizer, "tool_parser", None)
-    if tool_parser is None:
-        logger.warning("No tool_parser available for tokenizer")
-        return None
-
-    try:
-        parsed: dict[str, Any] = tool_parser(content, tools)  # pyright: ignore[reportAny]
-        if parsed and "name" in parsed:
-            arguments: Any = parsed.get("arguments", {})  # pyright: ignore[reportAny]
-            arguments_str: str = (
-                json.dumps(arguments)
-                if not isinstance(arguments, str)
-                else arguments
-            )
-            return ToolCall(
-                id=_generate_tool_call_id(),
-                type="function",
-                function=ToolCallFunction(
-                    name=str(parsed["name"]),  # pyright: ignore[reportAny]
-                    arguments=arguments_str,
-                ),
-            )
-    except Exception as e:
-        logger.warning(f"tool_parser failed: {e}")
-
-    return None
-
-
-def parse_tool_calls(
-    responses: Generator[GenerationResponse],
-    tokenizer: TokenizerWrapper,
-    tools: list[dict[str, Any]] | None,
-) -> Generator[GenerationResponse]:
-    has_tool_calling = getattr(tokenizer, "has_tool_calling", False)
-    if not has_tool_calling or tools is None:
-        yield from responses
-        return
-
-    tool_call_start: str | None = getattr(tokenizer, "tool_call_start", None)
-    tool_call_end: str | None = getattr(tokenizer, "tool_call_end", None)
-
-    if tool_call_start is None or tool_call_end is None:
-        yield from responses
-        return
-
-    in_tool_call = False
-    tool_call_buffer: list[str] = []
-    pending_tool_calls: list[ToolCall] = []
-
-    for response in responses:
-        if response.text == tool_call_start:
-            in_tool_call = True
-            tool_call_buffer = []
-            continue
-
-        if response.text == tool_call_end:
-            in_tool_call = False
-            parsed = _parse_tool_call_content(
-                "".join(tool_call_buffer), tokenizer, tools
-            )
-            if parsed is not None:
-                pending_tool_calls.append(parsed)
-            continue
-
-        if in_tool_call:
-            tool_call_buffer.append(response.text)
-            continue
-
-        if response.finish_reason is None or not pending_tool_calls:
-            yield response
-        else:
-            yield response.model_copy(
-                update={
-                    "finish_reason": "tool_calls",
-                    "tool_calls": pending_tool_calls if pending_tool_calls else None,
-                }
-            )
-
-
 EXO_RUNNER_MUST_FAIL = "EXO RUNNER MUST FAIL"
 EXO_RUNNER_MUST_OOM = "EXO RUNNER MUST OOM"
 EXO_RUNNER_MUST_TIMEOUT = "EXO RUNNER MUST TIMEOUT"
--- a/src/exo/worker/utils/net_profile.py
+++ b/src/exo/worker/utils/net_profile.py
@@ -1,62 +1,64 @@
-import http.client
-import time
-
-from anyio import create_task_group, to_thread
+import anyio
+import httpx
+from anyio import create_task_group
 from loguru import logger

 from exo.shared.topology import Topology
 from exo.shared.types.common import NodeId

-BAD_STATUSLINE_ATTEMPTS = 3
+REACHABILITY_ATTEMPTS = 3


 async def check_reachability(
    target_ip: str,
    expected_node_id: NodeId,
-    self_node_id: NodeId,
    out: dict[NodeId, set[str]],
+    client: httpx.AsyncClient,
 ) -> None:
    """Check if a node is reachable at the given IP and verify its identity."""
+    if ":" in target_ip:
+        # TODO: use real IpAddress types
+        target_ip = f"[{target_ip}]"
+    url = f"http://{target_ip}:52415/node_id"

-    # TODO: use an async http client
-    def _fetch_remote_node_id(*, attempt: int = 1) -> NodeId | None:
-        connection = http.client.HTTPConnection(target_ip, 52415, timeout=3)
+    remote_node_id = None
+    last_error = None
+
+    for _ in range(REACHABILITY_ATTEMPTS):
        try:
-            connection.request("GET", "/node_id")
-            response = connection.getresponse()
-            if response.status != 200:
-                return None
+            r = await client.get(url)
+            if r.status_code != 200:
+                await anyio.sleep(1)
+                continue

-            body = response.read().decode("utf-8").strip()
+            body = r.text.strip().strip('"')
+            if not body:
+                await anyio.sleep(1)
+                continue

-            # Strip quotes if present (JSON string response)
-            if body.startswith('"') and body.endswith('"') and len(body) >= 2:
-                body = body[1:-1]
+            remote_node_id = NodeId(body)
+            break

-            return NodeId(body) or None
-        except OSError:
-            return None
-        except http.client.BadStatusLine:
-            if attempt >= BAD_STATUSLINE_ATTEMPTS:
-                logger.warning(
-                    f"BadStatusLine from {target_ip}, after {attempt} attempts, assuming connection to {expected_node_id} has dropped"
-                )
-                return None
-            time.sleep(1)
-            return _fetch_remote_node_id(attempt=attempt + 1)
-        except http.client.HTTPException as e:
-            logger.warning(f"HTTPException from {target_ip}: {type(e).__name__}: {e}")
-            return None
-        finally:
-            connection.close()
+        # expected failure cases
+        except (
+            httpx.TimeoutException,
+            httpx.NetworkError,
+        ):
+            await anyio.sleep(1)
+
+        # other failures should be logged on last attempt
+        except httpx.HTTPError as e:
+            last_error = e
+            await anyio.sleep(1)
+
+    if last_error is not None:
+        logger.warning(
+            f"connect error {type(last_error).__name__} from {target_ip} after {REACHABILITY_ATTEMPTS} attempts; treating as down"
+        )

-    remote_node_id = await to_thread.run_sync(_fetch_remote_node_id)
    if remote_node_id is None:
        return

-    if remote_node_id == self_node_id:
-        return
-
    if remote_node_id != expected_node_id:
        logger.warning(
            f"Discovered node with unexpected node_id; "
@@ -74,18 +76,33 @@ async def check_reachable(
    topology: Topology, self_node_id: NodeId
 ) -> dict[NodeId, set[str]]:
    """Check which nodes are reachable and return their IPs."""
+
    reachable: dict[NodeId, set[str]] = {}
-    async with create_task_group() as tg:
+
+    # these are intentionally httpx's defaults so we can tune them later
+    timeout = httpx.Timeout(timeout=5.0)
+    limits = httpx.Limits(
+        max_connections=100,
+        max_keepalive_connections=20,
+        keepalive_expiry=5,
+    )
+
+    async with (
+        httpx.AsyncClient(timeout=timeout, limits=limits) as client,
+        create_task_group() as tg,
+    ):
        for node in topology.list_nodes():
            if not node.node_profile:
                continue
+            if node.node_id == self_node_id:
+                continue
            for iface in node.node_profile.network_interfaces:
                tg.start_soon(
                    check_reachability,
                    iface.ip_address,
                    node.node_id,
-                    self_node_id,
                    reachable,
+                    client,
                )

    return reachable
--- a/uv.lock
+++ b/uv.lock
Author	SHA1	Message	Date
Alex Cheema	5e28664c41	Fix draft release detection (attempt 3) (#1176 ) ## Motivation Previous fix still failed in CI. Suspecting permissions issue with GITHUB_TOKEN not being able to see draft releases via API. ## Changes 1. Add explicit `permissions: contents: write` to the job 2. Use `gh release list` first to check if draft exists (this uses a different code path that might work better) 3. Add debug echo statements ## Test Plan Delete v1.0.63 tag and re-push after merging. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-16 17:26:06 +00:00
Alex Cheema	ae0a804ccb	Fix draft release detection query (#1175 ) ## Motivation Fixes the draft release detection that failed on the v1.0.63 release attempt. ## Changes The jq query was piped to `head -1` which truncated multi-line JSON output to just `{`, causing the empty check to fail. Changed to use `first // empty` in jq instead. ## Test Plan Tested locally: ```bash GITHUB_REF_NAME="v1.0.63" gh api repos/exo-explore/exo/releases --jq "[.[] \| select(.draft == true) \| select(.name == \"$GITHUB_REF_NAME\")] \| first // empty" # Returns the full draft release JSON (2711 chars) ``` Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-16 17:05:24 +00:00
Alex Cheema	07cf2c1aa1	Add GitHub releases with Sparkle release notes integration (#1172 ) ## Motivation Closes #1140 Currently releases are uploaded to S3 for Sparkle updates but there's no GitHub Release created, and Sparkle update dialogs don't show release notes. Users have no visibility into what changed. ## Changes - Added release workflow documentation comment at top of `build-app.yml` - Added "Fetch release notes for Sparkle" step that converts markdown from draft GitHub release to HTML - Added "Inject release notes into appcast" step that embeds HTML in appcast.xml with CDATA - Added "Publish GitHub Release" step that attaches DMG and publishes the draft ## Why It Works - Sparkle's `<description>` tag supports HTML wrapped in CDATA for rendering in update dialogs - GitHub's markdown API (`/markdown`) converts the release notes to HTML with proper formatting - Draft releases allow writing polished notes before the build, then the workflow publishes them automatically - The workflow fails if no draft release exists, ensuring release notes are always provided ## Test Plan ### Manual Testing 1. Create a draft GitHub release for a new tag with markdown release notes 2. Push the tag to trigger the workflow 3. Verify the GitHub release is published with DMG attached 4. Download appcast.xml from S3 and verify `<description><![CDATA[...]]></description>` contains HTML 5. Test Sparkle update dialog on macOS to confirm release notes appear ### Automated Testing No automated tests added - this is CI workflow configuration. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-16 16:47:33 +00:00
Evan	83c5285a80	reduce logs previous commits logs were too verbose, this tones them down a bit	2026-01-16 14:05:47 +00:00
Evan Quiney	39ee2bf7bd	switch from synchronous threaded pinging to an async implementation (#1170 ) still seeing churn in our networking - lets properly rate limit it ## changes added an httpx client with max connections with a persistent AsyncClient ## testing deployed on cluster, discovery VASTLY more stable (the only deleted edges were those discovered by mdns)	2026-01-16 13:20:03 +00:00
Sami Khan	991adfbd6f	fix local network warning (#1136 ) ## Motivation Local network warning banner was showing on fresh install even though mDNS was working. The check would fail before the user had a chance to grant permission via the macOS prompt. ## Changes - Added `hasWorkedBefore` flag persisted in UserDefaults - Only show warning if permission previously worked but now doesn't ## Why It Works On fresh install, the check may fail (no permission yet), but `hasWorkedBefore` is false so no warning shows. Once the user grants permission and a check succeeds, we record it. Future failures (zombie permission after restart) will show the warning since `hasWorkedBefore` is now true. ## Test Plan ### Manual Testing Run locally ### Automated Testing N/A	2026-01-16 13:10:50 +00:00
rltakashige	4b3de6b984	Fix exo bench for transformers 5.x (#1168 ) ## Motivation Prompt Sizer was broken as transformers 5.x tokenizers create BatchEncodings which are essentially a dictionary of {input_ids: []} instead of the list of input ids. ## Test Plan ### Manual Testing Tested that exo bench runs as expected. ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-01-16 12:39:22 +00:00
Evan	c8de3b90ea	quiet rust logs rust logs were too verbose - now only warnings propagate to python entirely happy not to merge this and to clean up rust logging instead, but this felt saner right now	2026-01-16 12:34:28 +00:00