mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-26 01:16:58 -04:00
feat(w4a16): grow tile to BN128/16w (q4_K +17%, pp512 148->178)
P3b-2 for the Blackwell W4A16 Marlin GEMM. The q4_K dequant wall is partly
cross-N-block-redundant: every N-block re-decodes the same weight strip, so
halving the N-block count (BN 64->128) halves that redundant 6-bit superblock
decode. A BN sweep showed this only pays off when BN is spread across more
warps (16 warps, 8 m16n8 C-tiles/warp) rather than more fragments-per-warp -
the FN=8 / FM=4 variants (16 C-tiles/warp) regressed to ~6.6 TFLOPS on
register pressure. Shipping tile is now WM=4,WN=4,FM=2,FN=4 -> BM=128, BN=128,
16 warps.
Thermally-bracketed cold A/B (q4_K n=512 / q4_0 n=512 via test-backend-ops
perf; pp512/pp2048 via llama-bench Qwen3-32B-Q4_K_M):
BN64/8w (prev): 8.50 / 10.56 TFLOPS, measured 8.45/10.51 again (bracket)
BN128/16w (this): 9.92 / 11.68 TFLOPS, pp512 177.6, pp2048 185.0
-> +17% q4_K, +11% q4_0, +20% pp512 vs the previous commit; +49% pp512 vs
the original block-tiled kernel (119).
Parity gate GGML_CUDA_W4A16=1 test-backend-ops MUL_MAT = 1103/1103, flag set
and unset (byte-identical when unset). Still ~4.7x under MMQ (47 TFLOPS) and
does NOT beat MMQ; BN growth divides the redundant decode but cannot remove
the per-k-step decode itself - the offline weight prepack remains the next
unlock for q4_K. Plan doc P3 table + bottleneck notes updated.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -239,7 +239,7 @@ bool ggml_cuda_w4a16_mul_mat(
|
||||
cudaStream_t stream = ctx.stream();
|
||||
|
||||
// Block tile config: WM*WN warps compute BM(=WM*FM*16) x BN(=WN*FN*8).
|
||||
constexpr int WM = 4, WN = 2, FM = 2, FN = 4; // BM=128, BN=64, 8 warps
|
||||
constexpr int WM = 4, WN = 4, FM = 2, FN = 4; // BM=128, BN=128, 16 warps
|
||||
constexpr int BM = WM*FM*16;
|
||||
constexpr int BN = WN*FN*8;
|
||||
const dim3 grid((unsigned)((M + BM - 1) / BM), (unsigned)((N + BN - 1) / BN), 1);
|
||||
|
||||
Reference in New Issue
Block a user