mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 16:19:07 -04:00
Replace the P2 1-warp-per-16x8 W4A16 kernel with a block-tiled multi-warp kernel: blockDim=(32, WM*WN) so threadIdx.x is the warp lane (required by mma.cuh get_i/get_j) and threadIdx.y is the warp index. WM*WN warps compute a BM(=WM*FM*16) x BN(=WN*FN*8) output tile, each warp owning an FM x FN grid of m16n8k16 BF16 mma fragments accumulated in F32. The BM x 16 dequantized Q4 weight strip is staged once per k-step in a small (~4 KB) shared buffer and reused across the block's whole BN span. Shipping config WM=2,WN=2,FM=2,FN=4. The P2 launch put all threads on threadIdx.x; with >1 warp that drove the mma tile get_j past the shared bound (out-of-bounds shared read, caught by compute-sanitizer). The new (32, nwarps) layout matches mmf.cu and fixes it. Parity gate holds 1103/1103 (test-backend-ops MUL_MAT CUDA0), flag set and unset (byte-identical when GGML_CUDA_W4A16 is unset; the seam returns false). Perf (q4_K m=4096 k=14336 n=512): ~2 TFLOPS (P2) -> ~7-9 TFLOPS (thermal dependent); llama-bench Qwen3-32B-Q4_K_M pp512 31.75 -> ~118-142 t/s. Still below the MMQ baseline (47 TFLOPS / 718 t/s): a tile sweep stayed flat and q4_0 vs q4_K differ by only ~12%, so dequant compute is not the limiter - the shared-load / mma-feed is. A naive double-buffered cp.async pipeline (32 KB shared) regressed via occupancy collapse and an ldmatrix swap was neutral (unswizzled layout bank-conflicts), both reverted. The path to >=150 TFLOPS is the full Marlin machinery (XOR-swizzled shared layout + offline weight reshuffle + tuned async pipeline + Stream-K), deferred to P3 step 4. See W4A16_MARLIN_KERNEL_PLAN.md for the per-step table and dead-end notes. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>