LocalAI

mirror/LocalAI

Fork 0

mirror of https://github.com/mudler/LocalAI.git synced 2026-07-03 12:57:02 -04:00

Commit Graph

Author	SHA1	Message	Date
Ettore Di Giacinto	de34cd5954	docs(paged): refresh parity handoff state Reconcile the paged backend pin prose with the current Makefile pin, mark the 0044 patch tracking note as resolved, and add DGX Docker worker idleness to the benchmark preflight. Assisted-by: Codex:gpt-5	2026-06-30 15:27:44 +00:00
Ettore Di Giacinto	baf1025245	docs(paged): correct decode-serving record to ~86% GPU-steady parity (graph-node-traced) The decode-serving section characterized the high-N gap as "BW-floored, vLLM pays equally / 56-68%". A clean uncontended graph-node-traced profile (dgx ~/highN_prof2 + ~/highN_vllm, 2026-06-30) shows that was a profiling artifact: decode runs as a replayed CUDA graph, and nsys without --cuda-graph-trace=node collapses each replay into one opaque launch, so every prior decode decomposition (159 us/tok, "host-bound", "5.4x more efficient") was wrong. Corrected via --cuda-graph-trace=node + the ntg=64-minus-ntg=16 difference method. Real picture (paged npl=256): 99% GPU-busy (idle 1.4%), NOT host-bound. GDN recurrent scan 553 us/tok (51%, linear in batch, dominant), NVFP4 expert GEMM 254 (23%), bf16 proj 73 (7%), elementwise 57, SSM conv 31. Gap reconciled: vLLM-server 1177 -> vLLM true GPU-steady 1078 (chunked-prefill overlap inflates its window ~8pt) -> llama GPU-steady 924 (= 86% of 1078) -> llama-server 718 (61%, the ~17pt S3-recoverable serving graph-reuse overhead). So vs vLLM's true GPU-steady decode we are ~86%, not 56%. GDN is a shared BW floor where paged leads (83% vs 79% of 273 GB/s peak; both 1.17-1.18x for 2x batch). The residual ~14pt is vLLM's mature fused kernels (Marlin MoE +11ms, Triton elementwise +10ms); both ggml fusions rejected: act-quant-into-MMQ -79.4% (ggml MMQ re-quantizes y per row-tile x stream-k split, no single-pass tiling), norm+quant+silu infeasible via ggml_cuda_can_fuse. Added rejected levers: Q8_0/FP8 projection (regime error, closes <=6%; vLLM FP8-proj confirmed from hf_quant_config.json MIXED_PRECISION), the two decode fusions; refined BV-block GDN occupancy to -1.04% (wave-hidden). Revised verdict: PREFILL genuinely capped (36-43%, not graph-replayed so real); DECODE-SERVING near-parity ~86% of vLLM true GPU-steady (headline 56% was a measurement/operating-point artifact). GB10-vs-datacenter framing kept. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 14:16:06 +00:00
Ettore Di Giacinto	6edbb56b06	docs(paged): definitive vLLM-parity final-state record (GB10, CLOSED) Add docs/VLLM_PARITY_FINAL.md: the standing, never-re-litigate record of the exhaustive GB10 (sm_121) vLLM-parity investigation for the Qwen3.6 NVFP4 hybrid models. Captures the definitive same-session both-engine benchmark (prefill S_PP, decode/serving per-seq + aggregate, TTFT, PEAK_GB, paged-as-%-of-vLLM for both the MoE 35B-A3B and dense 27B models), the complete lever map (every prefill-GEMM, prefill-GDN, decode and serving/engine attempt with its verdict and key number), the structural floors (LPDDR5x bandwidth, FP4-MMQ optimality, GDN O(C^2) intra-chunk + serial recurrence, vLLM's HBM-tuned FLA/Marlin), the shipped bit-exact wins, and the parity verdict: parity is a hardware ceiling on GB10, not missing optimizations; the path to parity is datacenter Blackwell. Every number cites its artifact (dgx:~/bench/COMBINED_DEFINITIVE.txt, the marlin_gate / gdn_p1_ab A/Bs, PREFILL_GEMM_RESULTS, VLLM_PARITY_LEVER_MAP, DECODE_SERVING_SCOPE, the patch headers); figures not pinned to an artifact are marked estimated. Add a section-9 summary + link in the backend README. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 11:57:36 +00:00

Author

SHA1

Message

Date

Ettore Di Giacinto

de34cd5954

docs(paged): refresh parity handoff state

Reconcile the paged backend pin prose with the current Makefile pin, mark the 0044 patch tracking note as resolved, and add DGX Docker worker idleness to the benchmark preflight.

Assisted-by: Codex:gpt-5

2026-06-30 15:27:44 +00:00

Ettore Di Giacinto

baf1025245

docs(paged): correct decode-serving record to ~86% GPU-steady parity (graph-node-traced)

The decode-serving section characterized the high-N gap as "BW-floored, vLLM
pays equally / 56-68%". A clean uncontended graph-node-traced profile
(dgx ~/highN_prof2 + ~/highN_vllm, 2026-06-30) shows that was a profiling
artifact: decode runs as a replayed CUDA graph, and nsys without
--cuda-graph-trace=node collapses each replay into one opaque launch, so every
prior decode decomposition (159 us/tok, "host-bound", "5.4x more efficient")
was wrong. Corrected via --cuda-graph-trace=node + the ntg=64-minus-ntg=16
difference method.

Real picture (paged npl=256): 99% GPU-busy (idle 1.4%), NOT host-bound. GDN
recurrent scan 553 us/tok (51%, linear in batch, dominant), NVFP4 expert GEMM
254 (23%), bf16 proj 73 (7%), elementwise 57, SSM conv 31. Gap reconciled:
vLLM-server 1177 -> vLLM true GPU-steady 1078 (chunked-prefill overlap inflates
its window ~8pt) -> llama GPU-steady 924 (= 86% of 1078) -> llama-server 718
(61%, the ~17pt S3-recoverable serving graph-reuse overhead). So vs vLLM's true
GPU-steady decode we are ~86%, not 56%. GDN is a shared BW floor where paged
leads (83% vs 79% of 273 GB/s peak; both 1.17-1.18x for 2x batch).

The residual ~14pt is vLLM's mature fused kernels (Marlin MoE +11ms, Triton
elementwise +10ms); both ggml fusions rejected: act-quant-into-MMQ -79.4%
(ggml MMQ re-quantizes y per row-tile x stream-k split, no single-pass tiling),
norm+quant+silu infeasible via ggml_cuda_can_fuse. Added rejected levers:
Q8_0/FP8 projection (regime error, closes <=6%; vLLM FP8-proj confirmed from
hf_quant_config.json MIXED_PRECISION), the two decode fusions; refined BV-block
GDN occupancy to -1.04% (wave-hidden).

Revised verdict: PREFILL genuinely capped (36-43%, not graph-replayed so real);
DECODE-SERVING near-parity ~86% of vLLM true GPU-steady (headline 56% was a
measurement/operating-point artifact). GB10-vs-datacenter framing kept.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-30 14:16:06 +00:00

Ettore Di Giacinto

6edbb56b06

docs(paged): definitive vLLM-parity final-state record (GB10, CLOSED)

Add docs/VLLM_PARITY_FINAL.md: the standing, never-re-litigate record of the
exhaustive GB10 (sm_121) vLLM-parity investigation for the Qwen3.6 NVFP4 hybrid
models. Captures the definitive same-session both-engine benchmark (prefill
S_PP, decode/serving per-seq + aggregate, TTFT, PEAK_GB, paged-as-%-of-vLLM for
both the MoE 35B-A3B and dense 27B models), the complete lever map (every
prefill-GEMM, prefill-GDN, decode and serving/engine attempt with its verdict
and key number), the structural floors (LPDDR5x bandwidth, FP4-MMQ optimality,
GDN O(C^2) intra-chunk + serial recurrence, vLLM's HBM-tuned FLA/Marlin), the
shipped bit-exact wins, and the parity verdict: parity is a hardware ceiling on
GB10, not missing optimizations; the path to parity is datacenter Blackwell.

Every number cites its artifact (dgx:~/bench/COMBINED_DEFINITIVE.txt, the
marlin_gate / gdn_p1_ab A/Bs, PREFILL_GEMM_RESULTS, VLLM_PARITY_LEVER_MAP,
DECODE_SERVING_SCOPE, the patch headers); figures not pinned to an artifact are
marked estimated. Add a section-9 summary + link in the backend README.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-30 11:57:36 +00:00

3 Commits