mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 08:08:52 -04:00

Files

Ettore Di Giacinto 718b31d063 kernel(P1): W4A16 dispatch seam (gated, byte-identical fallback to MMQ)

marlin-w4a16.{cuh,cu} + a gated hook in ggml_cuda_mul_mat (dense path), behind
GGML_CUDA_W4A16 + sm_120/121 + Q4_0/Q4_K + f32. Returns false -> MMQ, so the
default build is byte-identical. Verified on GB10: clean build, test-backend-ops
MUL_MAT 1103/1103, llama-bench pp512 unchanged (717.77 default / 718.26 flagged),
and GGML_CUDA_W4A16=1 reaches the seam ([w4a16] P1 warning) before falling back.
Source + apply steps under kernel/w4a16/ (DGX checkout is volatile). The frame the
P2 correctness kernel + P3 Marlin pipeline fill.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-20 21:46:38 +00:00

1.4 KiB

Raw Blame History

W4A16 seam — how to apply to a llama.cpp / ggml-cuda checkout

Two source files + two one-line edits to ggml/src/ggml-cuda/ggml-cuda.cu. The build picks up the new .cu via the existing file(GLOB) after a cmake -S . -B build reconfigure (no CMakeLists edit).

Files (copy into `ggml/src/ggml-cuda/`)

marlin-w4a16.cuh
marlin-w4a16.cu

Edit `ggml/src/ggml-cuda/ggml-cuda.cu`

Include — after the existing #include "ggml-cuda/fp4-grouped-moe.cuh" (sibling-header style):
```
#include "ggml-cuda/marlin-w4a16.cuh"
```
Dispatch hook — immediately before the dense dispatch chain, i.e. before if (!split && use_mul_mat_vec_f) { in ggml_cuda_mul_mat(...) (after const int cc = ...):
```
if (!split && ggml_cuda_w4a16_mul_mat(ctx, src0, src1, dst)) { return; }
```

Verify (P1 acceptance — met)

cmake --build build --target test-backend-ops llama-bench → builds clean.
test-backend-ops test -o MUL_MAT -b CUDA0 → 1103/1103 (byte-identical default).
llama-bench dense Q4 pp512 → unchanged (~718, MMQ).
GGML_CUDA_W4A16=1 llama-bench → unchanged + stderr [w4a16] ... P1 seam - using MMQ (seam reached, gating passes on sm_121, falls back).

The kernel body (P2 correctness → P3 Marlin pipeline) replaces the TODO(P2/P3) block in marlin-w4a16.cu and returns true once parity holds.

1.4 KiB Raw Blame History

W4A16 seam — how to apply to a llama.cpp / ggml-cuda checkout

Files (copy into ggml/src/ggml-cuda/)

Edit ggml/src/ggml-cuda/ggml-cuda.cu

Verify (P1 acceptance — met)

1.4 KiB

Raw Blame History

Files (copy into `ggml/src/ggml-cuda/`)

Edit `ggml/src/ggml-cuda/ggml-cuda.cu`