kernel(P0): record precise op-level baseline (q4_K n=512 = 47 TFLOPS, ~22% of ceiling)

test-backend-ops perf MUL_MAT m=4096 k=14336: q4_K prefill (n=512) = 47.1 TFLOPS,
q4_0 = 49.5; decode (n=1) = 761/817 GFLOPS (memory-bound). The prefill GEMM target
is 47 -> ~213 TFLOPS (~4.5x). Cleaner per-shape target than end-to-end for kernel
iteration.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-20 21:33:50 +00:00
parent dae2679c3b
commit d291e15114

View File

@@ -29,8 +29,17 @@ and **Stream-K** partitioning. Sources: IST-DASLab/marlin, arXiv 2408.11743, vLL
- **Perf baseline:** `llama-bench` dense Q4_K prefill = **~750 t/s (pp512 718 / pp2048 750) ≈ 46 TFLOP/s ≈ 21%
of the 213 BF16 ceiling**. The kernel must beat this toward ~3,300. (`test-backend-ops perf -o MUL_MAT` gives
per-shape GFLOPS too; build it once with the harness.)
- **Op-level baseline (the canonical kernel target), `test-backend-ops perf -o MUL_MAT`, m=4096 k=14336 (FFN):**
| n (tokens) | q4_0 | q4_K | regime |
|---|---|---|---|
| 1 | 817 GFLOPS | 761 GFLOPS | decode / mat-vec (memory-bound) |
| 8 | 5.77 TFLOPS | 4.11 TFLOPS | small-batch |
| **512** | **49.5 TFLOPS** | **47.1 TFLOPS** | **prefill GEMM — ~22% of the 213 ceiling** |
So the prefill GEMM target: lift q4_K n=512 from **47 → toward ~213 TFLOPS** (~4.5×). This per-shape number
is cleaner than end-to-end for kernel iteration.
- **Harness script:** `~/p0harness.sh` on the DGX (build test-backend-ops + correctness + perf). Reusable each
phase: `test-backend-ops test -o MUL_MAT -b CUDA0` must stay 1103/1103; `llama-bench` must climb from 750.
phase: `test-backend-ops test -o MUL_MAT -b CUDA0` must stay 1103/1103; the q4_K n=512 perf must climb from 47.
- test-backend-ops needed `-DLLAMA_BUILD_TESTS=ON`; now built in `~/llama.cpp-pr24423/build`.
### P1 — Dispatch seam (no behavior change)