# W4A16 seam — how to apply to a llama.cpp / ggml-cuda checkout

Two source files + two one-line edits to `ggml/src/ggml-cuda/ggml-cuda.cu`. The build picks up the
new `.cu` via the existing `file(GLOB)` after a `cmake -S . -B build` reconfigure (no CMakeLists edit).

## Files (copy into `ggml/src/ggml-cuda/`)
- `marlin-w4a16.cuh`
- `marlin-w4a16.cu`

## Edit `ggml/src/ggml-cuda/ggml-cuda.cu`

1. **Include** — after the existing `#include "ggml-cuda/fp4-grouped-moe.cuh"` (sibling-header style):
   ```cpp
   #include "ggml-cuda/marlin-w4a16.cuh"
   ```

2. **Dispatch hook** — immediately before the dense dispatch chain, i.e. before
   `if (!split && use_mul_mat_vec_f) {` in `ggml_cuda_mul_mat(...)` (after `const int cc = ...`):
   ```cpp
   if (!split && ggml_cuda_w4a16_mul_mat(ctx, src0, src1, dst)) { return; }
   ```

## Verify (P1 acceptance — met)
- `cmake --build build --target test-backend-ops llama-bench` → builds clean.
- `test-backend-ops test -o MUL_MAT -b CUDA0` → **1103/1103** (byte-identical default).
- `llama-bench` dense Q4 pp512 → unchanged (~718, MMQ).
- `GGML_CUDA_W4A16=1 llama-bench` → unchanged + stderr `[w4a16] ... P1 seam - using MMQ` (seam reached,
  gating passes on sm_121, falls back).

The kernel body (P2 correctness → P3 Marlin pipeline) replaces the `TODO(P2/P3)` block in `marlin-w4a16.cu`
and returns `true` once parity holds.