LocalAI

mirror/LocalAI

Fork 0

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 16:19:07 -04:00

Commit Graph

Author	SHA1	Message	Date
Ettore Di Giacinto	4de0c3b1b2	feat(cuda): W4A16 P2 correctness-first BF16 GEMM kernel Replace the P1 dispatch-seam TODO in marlin-w4a16.cu with a real W4A16 GEMM for consumer Blackwell (sm_120/121). In-kernel dequant of Q4 weights to BF16, mma.sync m16n8k16 f32.bf16.bf16.f32 tensor-core multiply against BF16-converted f32 activations, f32 accumulate and write, reusing ggml's mma.cuh tile abstractions. Handles the contiguous 2D GEMM prefill path for Q4_0 and Q4_K (f32 activations, ne2==ne3==1); batched, broadcast, permuted, non-contiguous and f16-activation cases return false and fall back to MMQ so the gate stays green. M/N boundaries are zero-padded in-kernel. Parity gate (GGML_CUDA_W4A16=1 test-backend-ops MUL_MAT on GB10): 1103/1103 passed; default flag-off build stays byte-identical 1103/1103. Model sanity: Qwen3-32B-Q4_K_M llama-bench pp512 31.75 t/s (slow is expected for P2 - the naive single-warp kernel is the correctness checkpoint; P3 adds the cp.async pipeline and weight reshuffle). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 22:09:12 +00:00
Ettore Di Giacinto	9a71e81fc4	kernel: written subagent dispatch briefs for P3/P4/P5 Same strategy as P2: one fresh Opus-4.8 subagent per phase, each handed a complete zero-context brief, dispatched sequentially as each predecessor lands (P3 pipeline needs P2's correct kernel, P4 tune needs P3, P5 enable needs P4). Shared DGX/harness/commit boilerplate factored into a COMMON section; each phase brief carries its goal, incremental steps, acceptance gate, and a splice note for the prior phase's actual deliverable. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 22:01:18 +00:00
Ettore Di Giacinto	718b31d063	kernel(P1): W4A16 dispatch seam (gated, byte-identical fallback to MMQ) marlin-w4a16.{cuh,cu} + a gated hook in ggml_cuda_mul_mat (dense path), behind GGML_CUDA_W4A16 + sm_120/121 + Q4_0/Q4_K + f32. Returns false -> MMQ, so the default build is byte-identical. Verified on GB10: clean build, test-backend-ops MUL_MAT 1103/1103, llama-bench pp512 unchanged (717.77 default / 718.26 flagged), and GGML_CUDA_W4A16=1 reaches the seam ([w4a16] P1 warning) before falling back. Source + apply steps under kernel/w4a16/ (DGX checkout is volatile). The frame the P2 correctness kernel + P3 Marlin pipeline fill. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 21:46:38 +00:00

Author

SHA1

Message

Date

Ettore Di Giacinto

4de0c3b1b2

feat(cuda): W4A16 P2 correctness-first BF16 GEMM kernel

Replace the P1 dispatch-seam TODO in marlin-w4a16.cu with a real W4A16
GEMM for consumer Blackwell (sm_120/121). In-kernel dequant of Q4 weights
to BF16, mma.sync m16n8k16 f32.bf16.bf16.f32 tensor-core multiply against
BF16-converted f32 activations, f32 accumulate and write, reusing ggml's
mma.cuh tile abstractions.

Handles the contiguous 2D GEMM prefill path for Q4_0 and Q4_K (f32
activations, ne2==ne3==1); batched, broadcast, permuted, non-contiguous
and f16-activation cases return false and fall back to MMQ so the gate
stays green. M/N boundaries are zero-padded in-kernel.

Parity gate (GGML_CUDA_W4A16=1 test-backend-ops MUL_MAT on GB10):
1103/1103 passed; default flag-off build stays byte-identical 1103/1103.
Model sanity: Qwen3-32B-Q4_K_M llama-bench pp512 31.75 t/s (slow is
expected for P2 - the naive single-warp kernel is the correctness
checkpoint; P3 adds the cp.async pipeline and weight reshuffle).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-20 22:09:12 +00:00

Ettore Di Giacinto

9a71e81fc4

kernel: written subagent dispatch briefs for P3/P4/P5

Same strategy as P2: one fresh Opus-4.8 subagent per phase, each handed a
complete zero-context brief, dispatched sequentially as each predecessor lands
(P3 pipeline needs P2's correct kernel, P4 tune needs P3, P5 enable needs P4).
Shared DGX/harness/commit boilerplate factored into a COMMON section; each phase
brief carries its goal, incremental steps, acceptance gate, and a splice note for
the prior phase's actual deliverable.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-20 22:01:18 +00:00

Ettore Di Giacinto

718b31d063

kernel(P1): W4A16 dispatch seam (gated, byte-identical fallback to MMQ)

marlin-w4a16.{cuh,cu} + a gated hook in ggml_cuda_mul_mat (dense path), behind
GGML_CUDA_W4A16 + sm_120/121 + Q4_0/Q4_K + f32. Returns false -> MMQ, so the
default build is byte-identical. Verified on GB10: clean build, test-backend-ops
MUL_MAT 1103/1103, llama-bench pp512 unchanged (717.77 default / 718.26 flagged),
and GGML_CUDA_W4A16=1 reaches the seam ([w4a16] P1 warning) before falling back.
Source + apply steps under kernel/w4a16/ (DGX checkout is volatile). The frame the
P2 correctness kernel + P3 Marlin pipeline fill.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-20 21:46:38 +00:00

3 Commits