mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-25 09:09:07 -04:00
marlin-w4a16.{cuh,cu} + a gated hook in ggml_cuda_mul_mat (dense path), behind
GGML_CUDA_W4A16 + sm_120/121 + Q4_0/Q4_K + f32. Returns false -> MMQ, so the
default build is byte-identical. Verified on GB10: clean build, test-backend-ops
MUL_MAT 1103/1103, llama-bench pp512 unchanged (717.77 default / 718.26 flagged),
and GGML_CUDA_W4A16=1 reaches the seam ([w4a16] P1 warning) before falling back.
Source + apply steps under kernel/w4a16/ (DGX checkout is volatile). The frame the
P2 correctness kernel + P3 Marlin pipeline fill.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
15 lines
679 B
Plaintext
15 lines
679 B
Plaintext
#pragma once
|
|
|
|
#include "common.cuh"
|
|
|
|
// W4A16 Marlin-style BF16 GEMM for NVIDIA Blackwell consumer GPUs (sm_120/121).
|
|
// Dense (non-MoE) 4-bit-weight matmul run on BF16 tensor cores, the path that
|
|
// reaches the GB10 BF16 ceiling where MMQ (int8, Ampere-tuned) and cuBLAS (sm_80
|
|
// fallback) both plateau at ~22% of it. Returns true if it handled the op; false
|
|
// to fall back to MMQ. Gated behind GGML_CUDA_W4A16 until correct + faster.
|
|
bool ggml_cuda_w4a16_mul_mat(
|
|
ggml_backend_cuda_context & ctx,
|
|
const ggml_tensor * src0, // 4-bit weights (Q4_0/Q4_K)
|
|
const ggml_tensor * src1, // F32 activations
|
|
ggml_tensor * dst); // F32 output
|