LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 16:19:07 -04:00

Files

Ettore Di Giacinto 4de0c3b1b2 feat(cuda): W4A16 P2 correctness-first BF16 GEMM kernel

Replace the P1 dispatch-seam TODO in marlin-w4a16.cu with a real W4A16
GEMM for consumer Blackwell (sm_120/121). In-kernel dequant of Q4 weights
to BF16, mma.sync m16n8k16 f32.bf16.bf16.f32 tensor-core multiply against
BF16-converted f32 activations, f32 accumulate and write, reusing ggml's
mma.cuh tile abstractions.

Handles the contiguous 2D GEMM prefill path for Q4_0 and Q4_K (f32
activations, ne2==ne3==1); batched, broadcast, permuted, non-contiguous
and f16-activation cases return false and fall back to MMQ so the gate
stays green. M/N boundaries are zero-padded in-kernel.

Parity gate (GGML_CUDA_W4A16=1 test-backend-ops MUL_MAT on GB10):
1103/1103 passed; default flag-off build stays byte-identical 1103/1103.
Model sanity: Qwen3-32B-Q4_K_M llama-bench pp512 31.75 t/s (slow is
expected for P2 - the naive single-warp kernel is the correctness
checkpoint; P3 adds the cp.async pipeline and weight reshuffle).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-20 22:09:12 +00:00

HOOK.md

kernel(P1): W4A16 dispatch seam (gated, byte-identical fallback to MMQ)

2026-06-20 21:46:38 +00:00

marlin-w4a16.cu

feat(cuda): W4A16 P2 correctness-first BF16 GEMM kernel

2026-06-20 22:09:12 +00:00

marlin-w4a16.cuh

kernel(P1): W4A16 dispatch seam (gated, byte-identical fallback to MMQ)

2026-06-20 21:46:38 +00:00

SUBAGENT_BRIEFS.md

kernel: written subagent dispatch briefs for P3/P4/P5

2026-06-20 22:01:18 +00:00