Add docs/VLLM_PARITY_FINAL.md: the standing, never-re-litigate record of the
exhaustive GB10 (sm_121) vLLM-parity investigation for the Qwen3.6 NVFP4 hybrid
models. Captures the definitive same-session both-engine benchmark (prefill
S_PP, decode/serving per-seq + aggregate, TTFT, PEAK_GB, paged-as-%-of-vLLM for
both the MoE 35B-A3B and dense 27B models), the complete lever map (every
prefill-GEMM, prefill-GDN, decode and serving/engine attempt with its verdict
and key number), the structural floors (LPDDR5x bandwidth, FP4-MMQ optimality,
GDN O(C^2) intra-chunk + serial recurrence, vLLM's HBM-tuned FLA/Marlin), the
shipped bit-exact wins, and the parity verdict: parity is a hardware ceiling on
GB10, not missing optimizations; the path to parity is datacenter Blackwell.
Every number cites its artifact (dgx:~/bench/COMBINED_DEFINITIVE.txt, the
marlin_gate / gdn_p1_ab A/Bs, PREFILL_GEMM_RESULTS, VLLM_PARITY_LEVER_MAP,
DECODE_SERVING_SCOPE, the patch headers); figures not pinned to an artifact are
marked estimated. Add a section-9 summary + link in the backend README.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>