mirror of
https://github.com/exo-explore/exo.git
synced 2026-02-06 12:11:22 -05:00
## Motivation Enable parallel classifier-free guidance (CFG) for Qwen image models. CFG requires two forward passes (positive/negative prompts) - this allows them to run on separate nodes simultaneously, reducing latency. ## Changes - Added uses_cfg flag to ModelCard to identify CFG-based models - Extended PipelineShardMetadata with CFG topology fields (cfg_rank, cfg_world_size, peer device info) - Updated placement to create two CFG groups with reversed ordering (places CFG peers as ring neighbors) - Refactored DiffusionRunner to process CFG branches separately with exchange at last pipeline stage - Added get_cfg_branch_data() to PromptData for single-branch embeddings - Fixed seed handling in API for distributed consistency - Fixed image yield to only emit from CFG rank 0 at last stage - Increased num_sync_steps_factor from 0.125 to 0.25 for Qwen ## Why It Works - 2 nodes + CFG: Both run all layers, process different CFG branches in parallel - 4+ even nodes + CFG: Hybrid - 2 CFG groups × N/2 pipeline stages - Odd nodes or non-CFG: Falls back to pure pipeline parallelism Ring topology places CFG peers as neighbors to enable direct exchange. ## Test Plan ### Manual Testing Verified performance gain for Qwen-Image for 2 node and 4 node cluster. Non-CFG models still work ### Automated Testing Added tests in test_placement_utils.py covering 2-node CFG parallel, 4-node hybrid, odd-node fallback, and non-CFG pipeline modes.