From fc589b3fadedb7a7c77fd0bf4957065d73cbe07c Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Sun, 21 Jun 2026 07:09:42 +0000 Subject: [PATCH] analysis: vLLM GB10 advantage is the SCHEDULER, not the kernel (pivot) Code-grounded vLLM v0.23.0 analysis + DGX measurement: vLLM single-stream W4A16 prefill ~800 t/s (~52 TFLOPS) is TIED with llama.cpp MMQ (718/47), using the exact XOR-swizzle + 4-stage cp.async Marlin we proved collapses GB10 occupancy. vLLM has no FP4 cubins on sm_121 (forced W4A16 fallback), so llama.cpp MXFP4 (1153) already beats vLLM single-stream. vLLM's ~24k headline is the aggregate decode multiplier (~56x) from paged KV + chunked prefill + continuous batching - a scheduler win. llama.cpp lacks paged KV + chunked prefill. Kernel work (W4A16 178 t/s, FP4-MMA) banked as not-the-lever; effort pivots to the scheduler. Detail in VLLM_DECOMPOSITION.md; W4A16 plan marked STOPPED. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .../cpp/llama-cpp/paged/VLLM_DECOMPOSITION.md | 53 +++++++++++++++++++ .../paged/W4A16_MARLIN_KERNEL_PLAN.md | 12 +++++ 2 files changed, 65 insertions(+) create mode 100644 backend/cpp/llama-cpp/paged/VLLM_DECOMPOSITION.md diff --git a/backend/cpp/llama-cpp/paged/VLLM_DECOMPOSITION.md b/backend/cpp/llama-cpp/paged/VLLM_DECOMPOSITION.md new file mode 100644 index 000000000..fa4fbcfb9 --- /dev/null +++ b/backend/cpp/llama-cpp/paged/VLLM_DECOMPOSITION.md @@ -0,0 +1,53 @@ +# What makes vLLM fast on GB10 — kernel vs scheduler (code-grounded, measured) + +Decisive analysis (vLLM v0.23.0, torch 2.11+cu130, sm_121, model `RedHatAI/Qwen3-32B-NVFP4A16`, source at tag +`v0.23.0`). **Answer: it's the scheduler, not the kernel.** This closes the kernel track and opens the +scheduler track. + +## The decomposition (measured on the DGX, prefix-cache OFF, unique prompts) + +| | vLLM W4A16 Marlin | llama.cpp | verdict | +|---|---|---|---| +| **single-stream prefill** | ~800 t/s (~52 TFLOPS) | 718 MMQ / **1153 MXFP4** | **tied; llama.cpp MXFP4 wins** | +| decode batch-1 | 11.8 t/s | ~similar | bandwidth-bound (≈190/273 GB/s); no kernel helps | +| **aggregate decode** | 328 (N32) / 569 (N64) / **667 (N128)** | the gap | **~56× multiplier = scheduler** | + +vLLM's single-stream Marlin is **not** at the roofline — it's in the same ~4×-under regime as MMQ. The 24k +headline is entirely the aggregate decode multiplier. + +## The kernel vLLM actually runs on sm_121 (W4A16, forced) + +Dispatch (vLLM v0.23.0): `compressed_tensors.py:704` (NVFP4 + no input-quant → `W4A4Fp4(use_a16=True)`) → +`compressed_tensors_w4a4_nvfp4.py:28` → `kernels/linear/__init__.py:894` (`if use_a16: force_kernel = +MarlinNvFp4LinearKernel`, **unconditional, no cc gate**) → `nvfp4/marlin.py` → `marlin_utils_fp4.py:182` +`ops.marlin_gemm(b_q_type=float4_e2m1f)`, activations FP16/BF16. csrc: `csrc/quantization/marlin/marlin.cu` ++ `marlin_template.h` + `marlin.cuh`. + +Techniques = **exactly the playbook we proved loses on GB10**: XOR shared swizzle (`marlin_template.h:722 +^ (row%8)`), 4-stage cp.async pipeline (`marlin.cu:396 stages=4`, `cp_async_wait`), ldmatrix+mma, +FP16/BF16 acts. Native FP4 (`FlashInferB12xNvFp4LinearKernel`) needs `Sm120BlockScaledDenseGemm` cubins absent +on GB10 → W4A4 hangs → forced W4A16 Marlin fallback. **Nothing to port; vLLM's kernel is occupancy-blocked too.** + +## The scheduler (the real multiplier) — what llama.cpp lacks + +- **Paged KV cache** (`vllm/v1/core/kv_cache_manager.py`, `block_pool.py`): block KV, no fragmentation → very + high concurrent batch. **llama.cpp: NO** (contiguous per-slot KV → fragmentation caps real concurrency). +- **Chunked prefill** (`config/scheduler.py:84 enable_chunked_prefill=True`, default ON): interleaves prefill + chunks with decode so decode batches stay full. **llama.cpp: NO** (a long prefill stalls the decode batch). +- **Continuous batching** (`v1/core/sched/scheduler.py`): per-step admit/evict. **llama.cpp: YES** (`n_parallel`, + rudimentary — we enabled VRAM-scaled slots in #10411). + +## Recommendation + +**Pivot to the scheduler; treat the GEMM kernel as good-enough / roofline-blocked on GB10.** +1. **Ship the MXFP4-dense win now** — 1153 t/s single-stream beats vLLM's 800; a Blackwell dense-quant + recommendation (requantize, no kernel work). Already documented in `BLACKWELL_KERNEL_GAPS.md` §6. +2. **Size the gap first:** measure llama.cpp aggregate decode at `n_parallel` = 32/64/128 vs vLLM's 328/569/667. + This tells us how much of the 56× the existing continuous batching already captures, and how much paged KV + + chunked prefill would add. +3. **Then the two missing scheduler features**, in ROI order from the measurement: **chunked prefill** (keep + decode batches saturated, avoid prefill stalls) and **paged KV** (sustain large concurrent batches without + fragmentation — the contested upstream PR #22569 / the vendored patches in `patches/`). + +Kernel tracks (W4A16 P3b at 178 t/s; FP4-MMA tuning) are **banked, not resumed** — they cannot move the +throughput needle on GB10 because the bottleneck is not the GEMM. diff --git a/backend/cpp/llama-cpp/paged/W4A16_MARLIN_KERNEL_PLAN.md b/backend/cpp/llama-cpp/paged/W4A16_MARLIN_KERNEL_PLAN.md index e46cc6712..3ae2ae30b 100644 --- a/backend/cpp/llama-cpp/paged/W4A16_MARLIN_KERNEL_PLAN.md +++ b/backend/cpp/llama-cpp/paged/W4A16_MARLIN_KERNEL_PLAN.md @@ -1,5 +1,17 @@ # W4A16 Marlin-style GEMM for ggml-cuda on Blackwell (sm_120/121) — implementation plan +> **STOPPED (2026-06-21): the kernel is NOT the lever — validated by a code-grounded vLLM analysis.** +> Measured on the DGX: vLLM's single-stream W4A16 prefill on GB10 = **~800 t/s (~52 TFLOPS), statistically TIED +> with llama.cpp MMQ (718/47)** — and vLLM uses the *exact* XOR-swizzle + 4-stage cp.async Marlin we proved +> collapses GB10 occupancy (vLLM even warns at load that Marlin "may degrade performance for compute-heavy +> workloads"). There is no kernel trick to port. Moreover llama.cpp's **MXFP4 path (1153 t/s) already BEATS +> vLLM single-stream (800)** — vLLM has no FP4 cubins on sm_121 and falls back to slower W4A16 Marlin, so +> llama.cpp is *ahead* on the kernel. **vLLM's entire 24k headline is the aggregate decode multiplier (~56×) +> from paged KV + chunked prefill + continuous batching — a SCHEDULER win.** llama.cpp lacks paged KV + +> chunked prefill. **Effort pivots to the scheduler** (see the paged-attention work). This kernel work is +> banked + resumable (178 t/s, P0/P1/P2/P3/P3b committed) but is not the throughput lever on GB10. Detail: +> `VLLM_DECOMPOSITION.md`. + The committed multi-week kernel. Goal: get 4-bit-weight dense matmul to the GB10 **BF16 ceiling (~213 TFLOP/s ≈ ~3,300 t/s prefill on Qwen3-32B)**, ~4.3× over today's 765. This is the *match-vLLM* path; vLLM's own GB10 dense throughput runs on W4A16 Marlin (its FP4 path is broken on sm_121).