Files
LocalAI/backend/cpp/llama-cpp
Ettore Di Giacinto 9973fa995a feat(w4a16): P3 step 1 - block-tiled multi-warp Marlin GEMM (GB10)
Replace the P2 1-warp-per-16x8 W4A16 kernel with a block-tiled multi-warp
kernel: blockDim=(32, WM*WN) so threadIdx.x is the warp lane (required by
mma.cuh get_i/get_j) and threadIdx.y is the warp index. WM*WN warps compute a
BM(=WM*FM*16) x BN(=WN*FN*8) output tile, each warp owning an FM x FN grid of
m16n8k16 BF16 mma fragments accumulated in F32. The BM x 16 dequantized Q4
weight strip is staged once per k-step in a small (~4 KB) shared buffer and
reused across the block's whole BN span. Shipping config WM=2,WN=2,FM=2,FN=4.

The P2 launch put all threads on threadIdx.x; with >1 warp that drove the mma
tile get_j past the shared bound (out-of-bounds shared read, caught by
compute-sanitizer). The new (32, nwarps) layout matches mmf.cu and fixes it.

Parity gate holds 1103/1103 (test-backend-ops MUL_MAT CUDA0), flag set and
unset (byte-identical when GGML_CUDA_W4A16 is unset; the seam returns false).

Perf (q4_K m=4096 k=14336 n=512): ~2 TFLOPS (P2) -> ~7-9 TFLOPS (thermal
dependent); llama-bench Qwen3-32B-Q4_K_M pp512 31.75 -> ~118-142 t/s. Still
below the MMQ baseline (47 TFLOPS / 718 t/s): a tile sweep stayed flat and
q4_0 vs q4_K differ by only ~12%, so dequant compute is not the limiter - the
shared-load / mma-feed is. A naive double-buffered cp.async pipeline (32 KB
shared) regressed via occupancy collapse and an ldmatrix swap was neutral
(unswizzled layout bank-conflicts), both reverted. The path to >=150 TFLOPS is
the full Marlin machinery (XOR-swizzled shared layout + offline weight reshuffle
+ tuned async pipeline + Stream-K), deferred to P3 step 4. See
W4A16_MARLIN_KERNEL_PLAN.md for the per-step table and dead-end notes.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-20 23:36:58 +00:00
..
2026-04-12 08:51:30 +02:00