From ef578866c88f0b04377b4efa0a170ec6bb7b2aa5 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 10:59:56 +0000 Subject: [PATCH] docs(paged): scope W4A16 direct activation experiment Assisted-by: Codex:gpt-5 --- .../docs/PARITY_HANDOFF.md | 9 + ...6-07-01-w4a16-direct-activation-phase61.md | 545 ++++++++++++++++++ 2 files changed, 554 insertions(+) create mode 100644 docs/superpowers/plans/2026-07-01-w4a16-direct-activation-phase61.md diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 2799b3c27..99e7cafef 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -733,6 +733,15 @@ work needs a larger redesign that improves the grouped kernel body and removes or fuses sorted activation movement. Near-term GB10 parity work should return to broader prefill/GDN/MoE design or hardware-pivot benchmarking. +Phase61 is scoped as that larger W4A16 kill-gate, not as a committed code +change: `docs/superpowers/plans/2026-07-01-w4a16-direct-activation-phase61.md`. +It proposes a default-off `LLAMA_W4A16_DIRECT_A=1` experiment that consumes the +original activation tensor plus the existing `ids_to_sorted` map directly, +removing Phase60's sorted activation gather and separate cast kernels before any +grouped-kernel body rewrite. Keep it only if it improves forced W4A16 S_PP by at +least `+12%` and reaches at least `0.75x` default FP4-MMQ; otherwise reject and +do not continue W4A16 body tuning. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) diff --git a/docs/superpowers/plans/2026-07-01-w4a16-direct-activation-phase61.md b/docs/superpowers/plans/2026-07-01-w4a16-direct-activation-phase61.md new file mode 100644 index 000000000..d40dd0490 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-w4a16-direct-activation-phase61.md @@ -0,0 +1,545 @@ +# W4A16 Direct-Activation Phase61 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Build and gate a default-off W4A16 grouped MoE prefill experiment that removes the measured sorted activation gather and separate f32-to-bf16 cast overhead from Phase60. + +**Architecture:** Keep the existing default path unchanged. Add a second W4A16 grouped kernel mode behind `LLAMA_W4A16_DIRECT_A=1` that consumes the original `src1` activation tensor and the existing `ids_to_sorted` map directly, converting f32 activations to bf16 while loading A into shared memory. This tests the measured Phase60 hypothesis before any larger grouped-kernel body rewrite. + +**Tech Stack:** llama.cpp fork (`/home/mudler/_git/llama.cpp`), ggml CUDA, CMake, `test-backend-ops`, DGX GB10 gates, LocalAI docs. + +--- + +## Evidence + +Phase60 artifact: + +- `/home/mudler/bench/phase60_w4a16_current_profile/20260701_104915` + +At MoE `npp=512`, forced W4A16 spent: + +- `4.142s` in `w4a16_grouped_kernel<32,128,1,4,2>` +- `1.094s` in `k_get_rows_float` sorted activation gathers +- `0.517s` in `w4a16_cast_act_f32_bf16` + +Default FP4-MMQ spent: + +- `2.712s` in `mul_mat_q` +- `0.314s` in `quantize_mmq_nvfp4` + +The direct-activation experiment can at most remove the `1.611s` gather+cast tax before kernel-body effects. It is a kill-gate experiment: if forced W4A16 remains far behind default MMQ, stop this branch and do not tune the W4A16 body again. + +## Files + +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cuh` + - Add declarations for a direct-activation engagement helper and direct kernel launcher. +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cu` + - Add `LLAMA_W4A16_DIRECT_A` parsing. + - Add a pure helper for direct-mode route tests. + - Add a direct-activation kernel variant that uses `ids_to_sorted` and original `src1` strides. +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu` + - Pass `ids_to_sorted`, original `src1` data pointer, and source strides to the direct launcher. + - Skip `get_rows_cuda` and `w4a16_cast_act_f32_bf16` only in direct mode. +- Modify: `/home/mudler/_git/llama.cpp/tests/CMakeLists.txt` + - Register a small policy unit test. +- Create: `/home/mudler/_git/llama.cpp/tests/test-cuda-w4a16-policy.cpp` + - Unit-test direct-mode engagement logic without requiring CUDA execution. +- Create after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/docs/superpowers/plans/2026-07-01-w4a16-direct-activation-phase61-result.md` + - Record gates, A/B, decision, and whether to keep or revert the fork patch. + +## Task 1: Add Red Policy Tests + +**Files:** + +- Create: `/home/mudler/_git/llama.cpp/tests/test-cuda-w4a16-policy.cpp` +- Modify: `/home/mudler/_git/llama.cpp/tests/CMakeLists.txt` + +- [ ] **Step 1: Add the failing policy test file** + +Create `/home/mudler/_git/llama.cpp/tests/test-cuda-w4a16-policy.cpp`: + +```cpp +#include "ggml-cuda/w4a16-gemm.cuh" + +#include +#include + +static void test_direct_a_requires_master_w4a16() { + const bool ok = ggml_cuda_w4a16_direct_a_should_engage_params( + GGML_TYPE_NVFP4, GGML_TYPE_F32, GGML_TYPE_F32, + /*blackwell=*/true, + /*w4a16_prefill_m=*/0, + /*direct_a=*/true, + /*tokens=*/512, + /*k=*/2048, + /*n=*/512); + assert(!ok); +} + +static void test_direct_a_requires_direct_flag() { + const bool ok = ggml_cuda_w4a16_direct_a_should_engage_params( + GGML_TYPE_NVFP4, GGML_TYPE_F32, GGML_TYPE_F32, + /*blackwell=*/true, + /*w4a16_prefill_m=*/1, + /*direct_a=*/false, + /*tokens=*/512, + /*k=*/2048, + /*n=*/512); + assert(!ok); +} + +static void test_direct_a_engages_for_large_nvfp4_moe_prefill_shape() { + const bool ok = ggml_cuda_w4a16_direct_a_should_engage_params( + GGML_TYPE_NVFP4, GGML_TYPE_F32, GGML_TYPE_F32, + /*blackwell=*/true, + /*w4a16_prefill_m=*/1, + /*direct_a=*/true, + /*tokens=*/512, + /*k=*/2048, + /*n=*/512); + assert(ok); +} + +static void test_direct_a_rejects_decode_sized_shape() { + const bool ok = ggml_cuda_w4a16_direct_a_should_engage_params( + GGML_TYPE_NVFP4, GGML_TYPE_F32, GGML_TYPE_F32, + /*blackwell=*/true, + /*w4a16_prefill_m=*/128, + /*direct_a=*/true, + /*tokens=*/128, + /*k=*/2048, + /*n=*/512); + assert(!ok); +} + +int main() { + test_direct_a_requires_master_w4a16(); + test_direct_a_requires_direct_flag(); + test_direct_a_engages_for_large_nvfp4_moe_prefill_shape(); + test_direct_a_rejects_decode_sized_shape(); + std::puts("test-cuda-w4a16-policy: OK"); + return 0; +} +``` + +- [ ] **Step 2: Register the test** + +Add to `/home/mudler/_git/llama.cpp/tests/CMakeLists.txt` near the other small C++ tests: + +```cmake +if (GGML_CUDA) + llama_test_executable(test-cuda-w4a16-policy test-cuda-w4a16-policy.cpp) + target_link_libraries(test-cuda-w4a16-policy PRIVATE ggml) +endif() +``` + +- [ ] **Step 3: Verify RED** + +Run: + +```bash +cd /home/mudler/_git/llama.cpp +cmake --build build --target test-cuda-w4a16-policy -j2 +``` + +Expected: build fails because `ggml_cuda_w4a16_direct_a_should_engage_params` is not declared. + +## Task 2: Add Direct-A Policy Helper + +**Files:** + +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cuh` +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cu` + +- [ ] **Step 1: Declare the helper** + +Add to `w4a16-gemm.cuh`: + +```cpp +bool ggml_cuda_w4a16_direct_a_enabled(); + +bool ggml_cuda_w4a16_direct_a_should_engage_params( + ggml_type src0_type, + ggml_type src1_type, + ggml_type dst_type, + bool blackwell, + int64_t w4a16_prefill_m, + bool direct_a, + int64_t tokens, + int64_t k, + int64_t n); +``` + +- [ ] **Step 2: Implement the helper** + +Add to `w4a16-gemm.cu` near `ggml_cuda_w4a16_prefill_enabled()`: + +```cpp +bool ggml_cuda_w4a16_direct_a_enabled() { + static const bool enabled = [] { + const char * e = getenv("LLAMA_W4A16_DIRECT_A"); + return e != nullptr && atoi(e) != 0; + }(); + return enabled; +} + +bool ggml_cuda_w4a16_direct_a_should_engage_params( + ggml_type src0_type, + ggml_type src1_type, + ggml_type dst_type, + bool blackwell, + int64_t w4a16_prefill_m, + bool direct_a, + int64_t tokens, + int64_t k, + int64_t n) { + if (!direct_a || w4a16_prefill_m <= 0) { + return false; + } + if (src0_type != GGML_TYPE_NVFP4 || src1_type != GGML_TYPE_F32 || dst_type != GGML_TYPE_F32) { + return false; + } + if (!blackwell || tokens <= w4a16_prefill_m) { + return false; + } + return k % 64 == 0 && n % 128 == 0; +} +``` + +- [ ] **Step 3: Verify GREEN for policy test** + +Run: + +```bash +cd /home/mudler/_git/llama.cpp +cmake --build build --target test-cuda-w4a16-policy -j2 +./build/bin/test-cuda-w4a16-policy +``` + +Expected: + +```text +test-cuda-w4a16-policy: OK +``` + +## Task 3: Add Direct-A Kernel Launcher Skeleton + +**Files:** + +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cuh` +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cu` + +- [ ] **Step 1: Declare the direct launcher** + +Add to `w4a16-gemm.cuh`: + +```cpp +void ggml_cuda_mul_mat_id_w4a16_grouped_direct_a( + ggml_backend_cuda_context & ctx, + const ggml_tensor * src0, + const float * src1, + const int32_t * ids_to_sorted, + float * dst_sorted, + const int * tokens_per_expert, + int64_t n_experts, + int64_t n_expert_used, + int64_t k, + int64_t n, + size_t src1_nb1, + size_t src1_nb2, + cudaStream_t stream); +``` + +- [ ] **Step 2: Add a stub that preserves behavior** + +Add to `w4a16-gemm.cu` after `ggml_cuda_mul_mat_id_w4a16_grouped()`: + +```cpp +void ggml_cuda_mul_mat_id_w4a16_grouped_direct_a( + ggml_backend_cuda_context & ctx, + const ggml_tensor * src0, + const float * src1, + const int32_t * ids_to_sorted, + float * dst_sorted, + const int * tokens_per_expert, + int64_t n_experts, + int64_t n_expert_used, + int64_t k, + int64_t n, + size_t src1_nb1, + size_t src1_nb2, + cudaStream_t stream) { + GGML_UNUSED(ctx); + GGML_UNUSED(src0); + GGML_UNUSED(src1); + GGML_UNUSED(ids_to_sorted); + GGML_UNUSED(dst_sorted); + GGML_UNUSED(tokens_per_expert); + GGML_UNUSED(n_experts); + GGML_UNUSED(n_expert_used); + GGML_UNUSED(k); + GGML_UNUSED(n); + GGML_UNUSED(src1_nb1); + GGML_UNUSED(src1_nb2); + GGML_UNUSED(stream); + GGML_ABORT("LLAMA_W4A16_DIRECT_A selected before direct-A kernel implementation"); +} +``` + +- [ ] **Step 3: Verify build still passes** + +Run: + +```bash +cd /home/mudler/_git/llama.cpp +cmake --build build --target test-cuda-w4a16-policy llama-batched-bench -j2 +./build/bin/test-cuda-w4a16-policy +``` + +Expected: test passes and `llama-batched-bench` builds. + +## Task 4: Route Direct-A Mode Without Touching Default Path + +**Files:** + +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu` + +- [ ] **Step 1: Add direct-mode branch** + +In `ggml_cuda_mul_mat_id`, after `ids_to_sorted` and `ids_from_sorted` are prepared, replace the W4A16 branch with this structure: + +```cpp + const bool use_w4a16_direct_a = ggml_cuda_w4a16_direct_a_should_engage_params( + src0->type, src1->type, dst->type, + blackwell_mma_available(cc), + ggml_cuda_w4a16_prefill_m(), + ggml_cuda_w4a16_direct_a_enabled(), + ne12, + ne10, + ne0); + + if (use_w4a16_direct_a) { + ggml_cuda_mul_mat_id_w4a16_grouped_direct_a(ctx, src0, + (const float *) src1->data, ids_to_sorted, (float *) dst_sorted.ptr, + tokens_per_expert.data(), ne02, n_expert_used, ne10, ne0, + nb11, nb12, stream); + } else { + get_rows_cuda(src1->data, src1->type, ids_to_sorted, src1_sorted.ptr, type_src1_sorted, + ne10, nb11, nb12, nb13, + ne_get_rows, 1, 1, sizeof(int32_t), ne_get_rows*sizeof(int32_t), ne_get_rows*sizeof(int32_t), + ne10*ts_src1_sorted, ne_get_rows*ne10*ts_src1_sorted, ne_get_rows*ne10*ts_src1_sorted, stream); + CUDA_CHECK(cudaGetLastError()); + + if (ggml_cuda_w4a16_moe_grouped_should_engage(src0, src1, dst, cc)) { + ggml_cuda_mul_mat_id_w4a16_grouped(ctx, src0, + (const float *) src1_sorted.ptr, (float *) dst_sorted.ptr, + tokens_per_expert.data(), ne02, ne10, ne0, stream); + } else { + // existing per-expert loop remains here unchanged + } + } +``` + +Do not leave two `get_rows_cuda` calls in the direct path. + +- [ ] **Step 2: Verify default path** + +Run: + +```bash +cd /home/mudler/_git/llama.cpp +cmake --build build --target test-cuda-w4a16-policy llama-batched-bench -j2 +./build/bin/test-cuda-w4a16-policy +``` + +Expected: build and policy test pass. Do not run `LLAMA_W4A16_DIRECT_A=1` yet; the stub must abort if selected. + +## Task 5: Implement Direct-A Kernel + +**Files:** + +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cu` + +- [ ] **Step 1: Add the direct kernel variant** + +Copy `w4a16_grouped_kernel` into a new template named `w4a16_grouped_direct_a_kernel`. Change only the A-load section: + +```cpp +const int32_t src_row = ids_to_sorted[row0 + r]; +const int64_t token = src_row / n_expert_used; +const int64_t slot = src_row - token * n_expert_used; +const char * src_base = ((const char *) src1) + token * src1_nb2 + slot * src1_nb1; +const float * src = (const float *) (src_base + (int64_t) kt * BK * sizeof(float) + c * 8 * sizeof(float)); +``` + +Load eight f32 values, convert to bf16, and store into `sA[st]`. This replaces the old `cp.async` A load because the source conversion is no longer a raw copy: + +```cpp +nv_bfloat16 tmp[8]; +#pragma unroll +for (int q = 0; q < 8; ++q) { + tmp[q] = __float2bfloat16(src[q]); +} +uint4 packed = *reinterpret_cast(tmp); +*reinterpret_cast(((char *) sA[st]) + (r*ASTR + c*4)*sizeof(uint32_t)) = packed; +``` + +Keep W `cp.async` unchanged. + +- [ ] **Step 2: Wire the direct launcher to the new kernel** + +Replace the stub body with a launcher that mirrors `ggml_cuda_mul_mat_id_w4a16_grouped_impl`, but: + +- does not allocate `Abf`; +- does not call `w4a16_cast_act_f32_bf16`; +- passes `src1`, `ids_to_sorted`, `n_expert_used`, `src1_nb1`, and `src1_nb2` into the direct kernel. + +- [ ] **Step 3: Build** + +Run: + +```bash +cd /home/mudler/_git/llama.cpp +cmake --build build --target test-cuda-w4a16-policy test-backend-ops llama-batched-bench llama-completion -j2 +./build/bin/test-cuda-w4a16-policy +``` + +Expected: build succeeds and policy test passes. + +## Task 6: Local CUDA Correctness Gate + +**Files:** none. + +- [ ] **Step 1: Run forced W4A16 direct-A op gate** + +Run on a CUDA host: + +```bash +cd /home/mudler/_git/llama.cpp +LLAMA_W4A16_PREFILL_M=1 LLAMA_W4A16_DIRECT_A=1 ./build/bin/test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1 +``` + +Expected: `806/806 tests passed`. + +- [ ] **Step 2: Run default op gate** + +Run: + +```bash +cd /home/mudler/_git/llama.cpp +./build/bin/test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1 +``` + +Expected: `806/806 tests passed`. + +## Task 7: DGX Inference and Performance Gate + +**Files:** none. + +- [ ] **Step 1: Preflight DGX** + +Run: + +```bash +ssh dgx.casa 'echo docker=$(docker ps -q | wc -l); echo compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l); cat /tmp/localai-gpu.lock 2>/dev/null || true; pgrep -af "[l]ocal-ai-worker|[v]llm|[l]lama-server" || true' +``` + +Expected: Docker `0`, compute `0`, lock `FREE*`, and no worker/server process. + +- [ ] **Step 2: Apply patch to clean DGX mirror and build** + +Use the fork diff for this one patch only, apply it to `~/llama-phase6-source`, and build `build-cuda`. Do not leave the DGX mirror dirty after the phase. + +- [ ] **Step 3: Run pre gates** + +Run the canonical MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates: + +```bash +LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 ./llama-completion -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1 /tmp/phase61-w4a16-direct-a-rejected.diff +git restore ggml/src/ggml-cuda/w4a16-gemm.cuh ggml/src/ggml-cuda/w4a16-gemm.cu ggml/src/ggml-cuda/ggml-cuda.cu tests/CMakeLists.txt +rm -f tests/test-cuda-w4a16-policy.cpp +git status --short +``` + +- [ ] **Step 3: Update LocalAI docs** + +Create `docs/superpowers/plans/2026-07-01-w4a16-direct-activation-phase61-result.md` with the artifact path, gate table, A/B table, and keep/reject decision. Update: + +- `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + +- [ ] **Step 4: Commit LocalAI docs** + +Run: + +```bash +cd /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git add -f docs/superpowers/plans/2026-07-01-w4a16-direct-activation-phase61-result.md +git commit -m "docs(paged): record W4A16 direct activation phase" -m "Assisted-by: Codex:gpt-5" +``` + +## Self-Review + +- Spec coverage: The plan addresses Phase60's measured W4A16 sorted-gather and cast overhead before any grouped-kernel-body rewrite. +- Placeholder scan: No `TBD`, `TODO`, or unspecified test commands remain. +- Type consistency: Helper names use the `ggml_cuda_w4a16_direct_a_*` prefix consistently across declaration, test, implementation, and route branch.