Files
LocalAI/backend/cpp/llama-cpp/paged/kernel/w4a16
Ettore Di Giacinto 4de0c3b1b2 feat(cuda): W4A16 P2 correctness-first BF16 GEMM kernel
Replace the P1 dispatch-seam TODO in marlin-w4a16.cu with a real W4A16
GEMM for consumer Blackwell (sm_120/121). In-kernel dequant of Q4 weights
to BF16, mma.sync m16n8k16 f32.bf16.bf16.f32 tensor-core multiply against
BF16-converted f32 activations, f32 accumulate and write, reusing ggml's
mma.cuh tile abstractions.

Handles the contiguous 2D GEMM prefill path for Q4_0 and Q4_K (f32
activations, ne2==ne3==1); batched, broadcast, permuted, non-contiguous
and f16-activation cases return false and fall back to MMQ so the gate
stays green. M/N boundaries are zero-padded in-kernel.

Parity gate (GGML_CUDA_W4A16=1 test-backend-ops MUL_MAT on GB10):
1103/1103 passed; default flag-off build stays byte-identical 1103/1103.
Model sanity: Qwen3-32B-Q4_K_M llama-bench pp512 31.75 t/s (slow is
expected for P2 - the naive single-warp kernel is the correctness
checkpoint; P3 adds the cp.async pipeline and weight reshuffle).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-20 22:09:12 +00:00
..