Commit Graph

11 Commits

Author SHA1 Message Date
Ettore Di Giacinto
ae76d42a96 docs(paged): profile MTP graph reuse loss
Record Phase 16 nsys evidence that current MTP serving loses paged decode graph reuse and increases GPU work, explaining the Phase 15 serving regression.

Assisted-by: Codex:gpt-5
2026-07-01 02:32:49 +00:00
Ettore Di Giacinto
4d171e62bb docs(paged): reject MTP serving lever
Add the repeatable MTP serving A/B runner and record Phase 15 results showing current llama-server MTP regresses GB10 serving throughput despite passing inference gates.

Assisted-by: Codex:gpt-5
2026-07-01 02:29:28 +00:00
Ettore Di Giacinto
70394364a3 docs(paged): gate MTP rollback safety
Record Phase 14 MTP rollback evidence, normalized greedy-prefix checks, and canonical inference gates.

Assisted-by: Codex:gpt-5
2026-07-01 02:15:11 +00:00
Ettore Di Giacinto
2074b4fb5b docs(paged): reject GDN global Ai32 prototype
Record the default-off Global-Ai32 implementation, exact md5 gates, GB10 A/B regression, rejected diff artifact, and the resulting stop decision for GDN kernel work on GB10.

Assisted-by: Codex:gpt-5
2026-07-01 01:51:53 +00:00
Ettore Di Giacinto
adabd11919 docs(paged): scope GDN global Ai32 prototype
Record the shared-A/Ai GB10 cost model, the GO decision for one default-off f32 Ai prototype, and the Phase 13 implementation plan.

Assisted-by: Codex:gpt-5
2026-07-01 01:38:51 +00:00
Ettore Di Giacinto
1b5ae227eb docs(paged): reject GDN M5 QS-early phase
Record the Phase 11 default-off QS-early GDN experiment, its canonical md5 gates, the same-session GB10 A/B regression, and the rejected diff artifact.

Assisted-by: Codex:gpt-5
2026-07-01 01:29:44 +00:00
Ettore Di Giacinto
3da3b169fb docs(paged): reject GDN C32 slab phase
Record the default-off C32 slab experiment, its md5 gates, the dense tail-row fix, and the performance regression that rejects the source patch.

Assisted-by: Codex:gpt-5
2026-07-01 01:15:00 +00:00
Ettore Di Giacinto
34c4b5ce8d docs(paged): scope phase7 serving candidates
Mark the Phase 6 serving classifier complete, preserve the old parity final as historical, and scope Phase 7 source candidates with explicit md5 and op gates.

Assisted-by: Codex:gpt-5
2026-06-30 23:12:09 +00:00
Ettore Di Giacinto
de34cd5954 docs(paged): refresh parity handoff state
Reconcile the paged backend pin prose with the current Makefile pin, mark the 0044 patch tracking note as resolved, and add DGX Docker worker idleness to the benchmark preflight.

Assisted-by: Codex:gpt-5
2026-06-30 15:27:44 +00:00
Ettore Di Giacinto
baf1025245 docs(paged): correct decode-serving record to ~86% GPU-steady parity (graph-node-traced)
The decode-serving section characterized the high-N gap as "BW-floored, vLLM
pays equally / 56-68%". A clean uncontended graph-node-traced profile
(dgx ~/highN_prof2 + ~/highN_vllm, 2026-06-30) shows that was a profiling
artifact: decode runs as a replayed CUDA graph, and nsys without
--cuda-graph-trace=node collapses each replay into one opaque launch, so every
prior decode decomposition (159 us/tok, "host-bound", "5.4x more efficient")
was wrong. Corrected via --cuda-graph-trace=node + the ntg=64-minus-ntg=16
difference method.

Real picture (paged npl=256): 99% GPU-busy (idle 1.4%), NOT host-bound. GDN
recurrent scan 553 us/tok (51%, linear in batch, dominant), NVFP4 expert GEMM
254 (23%), bf16 proj 73 (7%), elementwise 57, SSM conv 31. Gap reconciled:
vLLM-server 1177 -> vLLM true GPU-steady 1078 (chunked-prefill overlap inflates
its window ~8pt) -> llama GPU-steady 924 (= 86% of 1078) -> llama-server 718
(61%, the ~17pt S3-recoverable serving graph-reuse overhead). So vs vLLM's true
GPU-steady decode we are ~86%, not 56%. GDN is a shared BW floor where paged
leads (83% vs 79% of 273 GB/s peak; both 1.17-1.18x for 2x batch).

The residual ~14pt is vLLM's mature fused kernels (Marlin MoE +11ms, Triton
elementwise +10ms); both ggml fusions rejected: act-quant-into-MMQ -79.4%
(ggml MMQ re-quantizes y per row-tile x stream-k split, no single-pass tiling),
norm+quant+silu infeasible via ggml_cuda_can_fuse. Added rejected levers:
Q8_0/FP8 projection (regime error, closes <=6%; vLLM FP8-proj confirmed from
hf_quant_config.json MIXED_PRECISION), the two decode fusions; refined BV-block
GDN occupancy to -1.04% (wave-hidden).

Revised verdict: PREFILL genuinely capped (36-43%, not graph-replayed so real);
DECODE-SERVING near-parity ~86% of vLLM true GPU-steady (headline 56% was a
measurement/operating-point artifact). GB10-vs-datacenter framing kept.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-30 14:16:06 +00:00
Ettore Di Giacinto
6edbb56b06 docs(paged): definitive vLLM-parity final-state record (GB10, CLOSED)
Add docs/VLLM_PARITY_FINAL.md: the standing, never-re-litigate record of the
exhaustive GB10 (sm_121) vLLM-parity investigation for the Qwen3.6 NVFP4 hybrid
models. Captures the definitive same-session both-engine benchmark (prefill
S_PP, decode/serving per-seq + aggregate, TTFT, PEAK_GB, paged-as-%-of-vLLM for
both the MoE 35B-A3B and dense 27B models), the complete lever map (every
prefill-GEMM, prefill-GDN, decode and serving/engine attempt with its verdict
and key number), the structural floors (LPDDR5x bandwidth, FP4-MMQ optimality,
GDN O(C^2) intra-chunk + serial recurrence, vLLM's HBM-tuned FLA/Marlin), the
shipped bit-exact wins, and the parity verdict: parity is a hardware ceiling on
GB10, not missing optimizations; the path to parity is datacenter Blackwell.

Every number cites its artifact (dgx:~/bench/COMBINED_DEFINITIVE.txt, the
marlin_gate / gdn_p1_ab A/Bs, PREFILL_GEMM_RESULTS, VLLM_PARITY_LEVER_MAP,
DECODE_SERVING_SCOPE, the patch headers); figures not pinned to an artifact are
marked estimated. Add a section-9 summary + link in the backend README.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-30 11:57:36 +00:00