Commit Graph

4 Commits

Author SHA1 Message Date
Ettore Di Giacinto
9973fa995a feat(w4a16): P3 step 1 - block-tiled multi-warp Marlin GEMM (GB10)
Replace the P2 1-warp-per-16x8 W4A16 kernel with a block-tiled multi-warp
kernel: blockDim=(32, WM*WN) so threadIdx.x is the warp lane (required by
mma.cuh get_i/get_j) and threadIdx.y is the warp index. WM*WN warps compute a
BM(=WM*FM*16) x BN(=WN*FN*8) output tile, each warp owning an FM x FN grid of
m16n8k16 BF16 mma fragments accumulated in F32. The BM x 16 dequantized Q4
weight strip is staged once per k-step in a small (~4 KB) shared buffer and
reused across the block's whole BN span. Shipping config WM=2,WN=2,FM=2,FN=4.

The P2 launch put all threads on threadIdx.x; with >1 warp that drove the mma
tile get_j past the shared bound (out-of-bounds shared read, caught by
compute-sanitizer). The new (32, nwarps) layout matches mmf.cu and fixes it.

Parity gate holds 1103/1103 (test-backend-ops MUL_MAT CUDA0), flag set and
unset (byte-identical when GGML_CUDA_W4A16 is unset; the seam returns false).

Perf (q4_K m=4096 k=14336 n=512): ~2 TFLOPS (P2) -> ~7-9 TFLOPS (thermal
dependent); llama-bench Qwen3-32B-Q4_K_M pp512 31.75 -> ~118-142 t/s. Still
below the MMQ baseline (47 TFLOPS / 718 t/s): a tile sweep stayed flat and
q4_0 vs q4_K differ by only ~12%, so dequant compute is not the limiter - the
shared-load / mma-feed is. A naive double-buffered cp.async pipeline (32 KB
shared) regressed via occupancy collapse and an ldmatrix swap was neutral
(unswizzled layout bank-conflicts), both reverted. The path to >=150 TFLOPS is
the full Marlin machinery (XOR-swizzled shared layout + offline weight reshuffle
+ tuned async pipeline + Stream-K), deferred to P3 step 4. See
W4A16_MARLIN_KERNEL_PLAN.md for the per-step table and dead-end notes.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-20 23:36:58 +00:00
Ettore Di Giacinto
4de0c3b1b2 feat(cuda): W4A16 P2 correctness-first BF16 GEMM kernel
Replace the P1 dispatch-seam TODO in marlin-w4a16.cu with a real W4A16
GEMM for consumer Blackwell (sm_120/121). In-kernel dequant of Q4 weights
to BF16, mma.sync m16n8k16 f32.bf16.bf16.f32 tensor-core multiply against
BF16-converted f32 activations, f32 accumulate and write, reusing ggml's
mma.cuh tile abstractions.

Handles the contiguous 2D GEMM prefill path for Q4_0 and Q4_K (f32
activations, ne2==ne3==1); batched, broadcast, permuted, non-contiguous
and f16-activation cases return false and fall back to MMQ so the gate
stays green. M/N boundaries are zero-padded in-kernel.

Parity gate (GGML_CUDA_W4A16=1 test-backend-ops MUL_MAT on GB10):
1103/1103 passed; default flag-off build stays byte-identical 1103/1103.
Model sanity: Qwen3-32B-Q4_K_M llama-bench pp512 31.75 t/s (slow is
expected for P2 - the naive single-warp kernel is the correctness
checkpoint; P3 adds the cp.async pipeline and weight reshuffle).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-20 22:09:12 +00:00
Ettore Di Giacinto
9a71e81fc4 kernel: written subagent dispatch briefs for P3/P4/P5
Same strategy as P2: one fresh Opus-4.8 subagent per phase, each handed a
complete zero-context brief, dispatched sequentially as each predecessor lands
(P3 pipeline needs P2's correct kernel, P4 tune needs P3, P5 enable needs P4).
Shared DGX/harness/commit boilerplate factored into a COMMON section; each phase
brief carries its goal, incremental steps, acceptance gate, and a splice note for
the prior phase's actual deliverable.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-20 22:01:18 +00:00
Ettore Di Giacinto
718b31d063 kernel(P1): W4A16 dispatch seam (gated, byte-identical fallback to MMQ)
marlin-w4a16.{cuh,cu} + a gated hook in ggml_cuda_mul_mat (dense path), behind
GGML_CUDA_W4A16 + sm_120/121 + Q4_0/Q4_K + f32. Returns false -> MMQ, so the
default build is byte-identical. Verified on GB10: clean build, test-backend-ops
MUL_MAT 1103/1103, llama-bench pp512 unchanged (717.77 default / 718.26 flagged),
and GGML_CUDA_W4A16=1 reaches the seam ([w4a16] P1 warning) before falling back.
Source + apply steps under kernel/w4a16/ (DGX checkout is volatile). The frame the
P2 correctness kernel + P3 Marlin pipeline fill.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-20 21:46:38 +00:00