From 19742aee6414b00cc6b23682a2f11f9ed90b9039 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Sat, 20 Jun 2026 03:59:27 +0000
Subject: [PATCH] bench(dense): FORCE_CUBLAS no-op for dense too (720.8 vs
 721.8) - every flag lever exhausted

Confirms parity (dense+MoE, both phases) is strictly the FP4 tensor-core kernel;
no config/flag shortcut remains.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
 backend/cpp/llama-cpp/patches/BENCHMARKS.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/backend/cpp/llama-cpp/patches/BENCHMARKS.md b/backend/cpp/llama-cpp/patches/BENCHMARKS.md
index e4cd79632..d4aaafc76 100644
--- a/backend/cpp/llama-cpp/patches/BENCHMARKS.md
+++ b/backend/cpp/llama-cpp/patches/BENCHMARKS.md
@@ -73,6 +73,10 @@ weight-quant comparison; the difference is the compute kernel.
 3. **Scope decision (the reason for this benchmark): the Lever-3 kernel track must also deliver a NON-grouped
    block-scaled FP4 GEMM for dense**, not only the MoE grouped GEMM. The dense GEMM is the simpler of the two
    (a plain CUTLASS dense GEMM), so it's a good first kernel to land — and it benefits every dense model.
+   - **No cheap lever:** `GGML_CUDA_FORCE_CUBLAS` is a **no-op for dense too** (Q4_K pp512: 720.8 vs 721.8) —
+     dequant→cuBLAS-BF16 doesn't engage / isn't faster than int8-MMQ on GB10. With ubatch (saturates) and
+     nwarps (static_assert) already ruled out for MoE, **every config/flag lever is now exhausted** for both
+     model classes. Parity is strictly the FP4 tensor-core kernel.
 4. **Aside:** full NVFP4 (W4A4) is currently unusable for dense on this vLLM/GB10 build — worth revisiting
    on a newer vLLM, and a point in llama.cpp's favor (its 4-bit dense path at least *runs*).