mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 16:19:07 -04:00
Replace the P1 dispatch-seam TODO in marlin-w4a16.cu with a real W4A16 GEMM for consumer Blackwell (sm_120/121). In-kernel dequant of Q4 weights to BF16, mma.sync m16n8k16 f32.bf16.bf16.f32 tensor-core multiply against BF16-converted f32 activations, f32 accumulate and write, reusing ggml's mma.cuh tile abstractions. Handles the contiguous 2D GEMM prefill path for Q4_0 and Q4_K (f32 activations, ne2==ne3==1); batched, broadcast, permuted, non-contiguous and f16-activation cases return false and fall back to MMQ so the gate stays green. M/N boundaries are zero-padded in-kernel. Parity gate (GGML_CUDA_W4A16=1 test-backend-ops MUL_MAT on GB10): 1103/1103 passed; default flag-off build stays byte-identical 1103/1103. Model sanity: Qwen3-32B-Q4_K_M llama-bench pp512 31.75 t/s (slow is expected for P2 - the naive single-warp kernel is the correctness checkpoint; P3 adds the cp.async pipeline and weight reshuffle). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>