mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-02 20:37:03 -04:00
patches(paged): mirror W4A16 packed metadata
Mirror the fork-first W4A16 packed tile metadata commit into the LocalAI paged patch series, record the Phase 1 benchmark result, and keep the implementation plan checked off. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -128,6 +128,49 @@ First implementation target:
|
||||
many small ragged tile maps. The first fork-first experiment should remove or
|
||||
amortize that host-built tile-map path before retuning MMA tile shapes.
|
||||
|
||||
## W4A16 Metadata Phase 1
|
||||
|
||||
Fork commit: `4b0cc1163cc42dc1c17892fd41ce5ab384ba3e17`
|
||||
(`feat(paged): pack W4A16 grouped tile metadata`).
|
||||
|
||||
LocalAI patch mirror: `0048-feat-paged-pack-W4A16-grouped-tile-metadata.patch`.
|
||||
|
||||
Mirror invariant: applying the full LocalAI `patches/paged/*.patch` series to
|
||||
base pin `0ed235ea2c17a19fc8238668653946721ed136fd` tree-matches fork HEAD
|
||||
`4b0cc1163cc42dc1c17892fd41ce5ab384ba3e17`.
|
||||
|
||||
Artifacts:
|
||||
|
||||
- Diff: `~/bench/w4a16_phase1/packed_desc.diff`
|
||||
- Build mtimes: `~/bench/w4a16_phase1/build_binary_mtimes.txt`
|
||||
- MoE gate: `~/bench/w4a16_phase1/gate_moe.md5`
|
||||
- Dense gate: `~/bench/w4a16_phase1/gate_dense.md5`
|
||||
- Default FP4-MMQ: `~/bench/w4a16_phase1/w4a16_off.txt`
|
||||
- Packed W4A16: `~/bench/w4a16_phase1/w4a16_on_thr64.txt`
|
||||
|
||||
Canonical gates:
|
||||
|
||||
- MoE greedy md5: `8cb0ce23777bf55f92f63d0292c756b0` (matched expected)
|
||||
- Dense greedy md5: `5951a5b4d624ce891e22ab5fca9bc439` (matched expected)
|
||||
|
||||
Packed descriptor A/B:
|
||||
|
||||
| Path | PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|
||||
|------|----|----|---|------|--------|----------|--------|----------|-----|-------|
|
||||
| FP4-MMQ | 512 | 4 | 32 | 16512 | 7.114 | 2303.07 | 0.323 | 396.55 | 7.437 | 2220.32 |
|
||||
| FP4-MMQ | 2048 | 4 | 32 | 65664 | 27.045 | 2423.23 | 0.331 | 387.14 | 27.376 | 2398.64 |
|
||||
| W4A16 packed | 512 | 4 | 32 | 16512 | 12.468 | 1314.08 | 0.322 | 397.97 | 12.790 | 1291.04 |
|
||||
| W4A16 packed | 2048 | 4 | 32 | 65664 | 48.930 | 1339.39 | 0.330 | 387.44 | 49.260 | 1333.00 |
|
||||
|
||||
Result:
|
||||
|
||||
- Packed descriptors improved forced W4A16 by `+0.39%` at `npp=512` and
|
||||
`+0.48%` at `npp=2048` versus the Phase 0 no-debug W4A16 baseline.
|
||||
- W4A16 remains `-42.9%` at `npp=512` and `-44.7%` at `npp=2048` versus
|
||||
same-run default FP4-MMQ.
|
||||
- Decision: keep patch `0048` as a small simplification, but pivot the next
|
||||
W4A16 iteration to the activation cast or MMA/dequant tile body.
|
||||
|
||||
## Clean Build
|
||||
|
||||
First clean build attempt:
|
||||
@@ -173,7 +216,8 @@ Second clean build attempt:
|
||||
- HEAD: `51168c5eee2e35348d9006f0b2fab3dc6e7c01cc`
|
||||
- Base pin: `0ed235ea2c17a19fc8238668653946721ed136fd`
|
||||
- Merge-base with base pin: `0ed235ea2c17a19fc8238668653946721ed136fd`
|
||||
- LocalAI patch count: `38`
|
||||
- LocalAI patch count: `38` at Phase 0; current mirror count is `39` after
|
||||
patch `0048`.
|
||||
- LocalAI patch mirror: applies cleanly to the base pin and tree-matches fork
|
||||
HEAD.
|
||||
- Tree hash after patch application: `a73d759350277532a14e853e1fe78f08bbb74ce8`
|
||||
|
||||
@@ -58,7 +58,7 @@ A lever compiled into the binary is **NOT** isolated by a runtime flag alone. It
|
||||
- **Always update the fork FIRST, in this exact order:** (1) commit the change on the `localai-paged` branch and **push it**, then (2) regenerate the LocalAI series (`backend/cpp/llama-cpp-localai-paged/patches/paged/`) from the fork via `git format-patch` (one patch per fork commit, source-only, never touching a `*.md`/dev-doc), so the series stays a **1:1, drift-free mirror** of the branch. No hand-export.
|
||||
- **NEVER edit the LocalAI `patches/paged/*.patch` files directly**, and **NEVER add a patch to the series with no corresponding fork-branch commit.** They are generated output, not source.
|
||||
- The fork branch is also **where the build and the per-path bit-exact md5 gate actually run**, so it is the **only** place a change is truly validated. A patch that lives only in the LocalAI series has never been built or gated.
|
||||
- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. (Concretely: fork HEAD `51168c5ee` "patch 0044" is byte-identical to worktree `0044-feat-paged-fused-gated-RMSNorm-SiLU-gate-mul.patch`; the f32-only M5 tensor-core scan is worktree patch `0047`.)
|
||||
- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. (Concretely: fork HEAD `4b0cc1163` is mirrored by the new worktree `0048-feat-paged-pack-W4A16-grouped-tile-metadata.patch`; the f32-only M5 tensor-core scan is worktree patch `0047`.)
|
||||
|
||||
### 2.6 Bench hygiene gates
|
||||
- **NEVER set `LLAMA_MAX_BATCH_TOKENS` in benches** (the harness explicitly logs "NO LLAMA_MAX_BATCH_TOKENS").
|
||||
@@ -253,14 +253,14 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
## 7. KEY FILE / ARTIFACT INDEX
|
||||
|
||||
### Fork (canonical source of truth)
|
||||
- `dgx:~/llama-paged-fork`, remote `fork git@github.com:mudler/llama.cpp.git`, branch **`localai-paged`**, HEAD `51168c5eee2e35348d9006f0b2fab3dc6e7c01cc` ("fused gated RMSNorm + SiLU gate-mul CUDA op (patch 0044)"). **Currently dirty** (uncommitted `M ggml/src/ggml-cuda/gated_delta_net.cu`).
|
||||
- `dgx:~/llama-paged-fork`, remote `fork git@github.com:mudler/llama.cpp.git`, branch **`localai-paged`**, last clean local canonical HEAD `4b0cc1163cc42dc1c17892fd41ce5ab384ba3e17` ("pack W4A16 grouped tile metadata", patch `0048`). The DGX checkout itself may still be dirty and must not be treated as canonical.
|
||||
- `dgx:~/llama-paged-dev` (experimental dev/build tree), branch **`paged`**, HEAD `a7d439e8ce6990eb09721223c975da4e49d8d136` ("GDN CONFIG C (M8) - bf16 Kc/Qc"). **Dirty** + many untracked profiling artifacts. This tree's `build-cuda/bin/` produced the benchmarked binaries; `COMBINED_DEFINITIVE` recorded `GIT_HEAD=a7d439e` (the M8 bf16 dev config), NOT the fork HEAD. The dev tree carries bf16/hybrid M6/M7/M8 machinery deliberately EXCLUDED from the shipped f32-only series.
|
||||
|
||||
### LocalAI worktree
|
||||
- Path: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`, branch `worktree-feat+paged-attention` (199 ahead, 25 behind origin/master; the ahead count grows with each new commit).
|
||||
- Backend dir: `backend/cpp/llama-cpp-localai-paged/` (`Makefile` thin wrapper, `package.sh`, `run.sh`, `README.md` ~44 KB canonical, `docs/`, `patches/paged/`).
|
||||
- `docs/`: `VLLM_PARITY_FINAL.md` (authoritative record), `VLLM_PARITY_LEVER_MAP.md` (working brainstorm, profile-validated section), `DECODE_SERVING_SCOPE.md`, `PREFILL_GEMM_SCOPE.md`, `PREFILL_GEMM_RESULTS.md`, `TENSORCORE_GDN_SCOPE.md`, `TENSORCORE_GDN_BUILD_PLAN.md`, `ACCELERATOR_PORTING_SCOPE.md`, `UPSTREAM_LAYER2_SCOPE.md`, `LOCALAI_LLAMACPP_BACKEND_PLAN.md`, `PAGED_BITEXACT_NOTE.md`, `PATCH_MAINTENANCE.md`, `final_benchmark.csv`, `paged-burst-bench.cpp`, `paged-reclaim-unit.cpp`, 3 PNGs, and this `PARITY_HANDOFF.md`.
|
||||
- `patches/paged/`: **38** `.patch` files spanning 0001-0047 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV).
|
||||
- `patches/paged/`: **39** `.patch` files spanning 0001-0048 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata is 0048.
|
||||
|
||||
### Bench artifacts (DGX)
|
||||
- `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - the definitive same-session both-engine run.
|
||||
@@ -276,7 +276,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
|
||||
### Discrepancies to flag / resolve (carried verbatim from the gather, including UNVERIFIED labels)
|
||||
1. **Pin prose reconciled in this worktree.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` is authoritative and matches the local fork merge-base. Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile when building.
|
||||
2. **Both DGX checkouts are dirty** (`gated_delta_net.cu` modified in each), and the fork HEAD (`51168c5ee`, patch 0044) differs from the dev-tree HEAD (`a7d439e`, M8 bf16) that actually produced the `COMBINED_DEFINITIVE` numbers.
|
||||
2. **Both DGX checkouts are dirty** (`gated_delta_net.cu` modified in each), and the current clean local fork HEAD (`4b0cc1163`, patch 0048) differs from the dev-tree HEAD (`a7d439e`, M8 bf16) that actually produced the `COMBINED_DEFINITIVE` numbers.
|
||||
3. **Worktree patch 0044 is now tracked here.** LocalAI commit `2033086f6` added `patches/paged/0044-feat-paged-fused-gated-RMSNorm-SiLU-gate-mul.patch`; the only current untracked path in this worktree is `.claude/`.
|
||||
4. **`sm_121a` is not in the worktree build files** - it lives only in the DGX experimental build scripts (`gdn_cc.sh`, `gdn_bv_build.sh`, `paged-build.sh`); mainline uses arch `121`. **UNVERIFIED** whether the shipped CI Dockerfile build path injects `121a` for the FP4-MMA kernels (`Dockerfile.llama-cpp-localai-paged` does not hardcode a CUDA arch).
|
||||
5. **The `0921716...` paged-MoE md5 open item.** `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=0921716cd0582b5d15af8c362b811d00` for MoE, but a full doc/patch/`git log -S` grep of the worktree found **no** occurrence of `0921716...` in any committed source; the committed canonical paged-MoE gate is `8cb0ce23`. Treat this as **unreconciled**: the documented, KL-validated paged-MoE gate remains `8cb0ce23`, and any paged-MoE divergence (including `0921716`) must be KL-validated against the f16 reference before being accepted as benign, never on assertion alone. The `0921716` value is **UNVERIFIED** as a sanctioned gate; do not adopt it as canonical without re-running the KL gate. The **dense** run is symmetric: `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=ecfe924dee6c5622c149f419ff2a6481` for dense, which likewise differs from the canonical dense gate `5951a5b4`. Both CDEF `PAGED_GATE_MD5` values come from the `combined_definitive.sh` harness's own gate command, NOT the canonical bit-exact gate command in section 3.3, which is why they diverge from the committed `8cb0ce23` / `5951a5b4`; neither is a sanctioned gate and both must be KL-validated before being treated as benign.
|
||||
|
||||
@@ -0,0 +1,121 @@
|
||||
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Tue, 30 Jun 2026 21:19:07 +0000
|
||||
Subject: [PATCH] feat(paged): pack W4A16 grouped tile metadata
|
||||
|
||||
Collapse the grouped W4A16 tile metadata from three host vectors and three device copies into one aligned descriptor array and one H2D copy. This keeps the path default-off while reducing launch-side metadata overhead for the experimental W4A16 MoE prefill path.
|
||||
|
||||
Assisted-by: Codex:gpt-5
|
||||
---
|
||||
ggml/src/ggml-cuda/w4a16-gemm.cu | 46 ++++++++++++++++++--------------
|
||||
1 file changed, 26 insertions(+), 20 deletions(-)
|
||||
|
||||
diff --git a/ggml/src/ggml-cuda/w4a16-gemm.cu b/ggml/src/ggml-cuda/w4a16-gemm.cu
|
||||
index f348f31ad..899e1a23f 100644
|
||||
--- a/ggml/src/ggml-cuda/w4a16-gemm.cu
|
||||
+++ b/ggml/src/ggml-cuda/w4a16-gemm.cu
|
||||
@@ -18,6 +18,14 @@ typedef tile<16, 8, nv_bfloat162> tile_A; // A operand: M=16, K=16
|
||||
typedef tile< 8, 8, nv_bfloat162> tile_B; // B operand: N=8, K=16
|
||||
typedef tile<16, 8, float> tile_C; // accumulator: M=16, N=8
|
||||
|
||||
+struct alignas(16) w4a16_tile_desc {
|
||||
+ int expert;
|
||||
+ int row0;
|
||||
+ int rows;
|
||||
+ int pad;
|
||||
+};
|
||||
+static_assert(sizeof(w4a16_tile_desc) == 16, "w4a16 tile descriptors must stay 16 bytes");
|
||||
+
|
||||
#ifndef LLAMA_W4A16_PREFILL_M
|
||||
#define LLAMA_W4A16_PREFILL_M 0
|
||||
#endif // LLAMA_W4A16_PREFILL_M
|
||||
@@ -92,9 +100,7 @@ static __global__ void w4a16_grouped_kernel(
|
||||
const nv_bfloat16 * __restrict__ Abf, // [pad_rows, K] bf16
|
||||
const block_nvfp4 * __restrict__ W0, // src0 base (expert 0)
|
||||
float * __restrict__ C, // [total_rows, N] f32
|
||||
- const int * __restrict__ g_tile_expert,
|
||||
- const int * __restrict__ g_tile_row0,
|
||||
- const int * __restrict__ g_tile_rows,
|
||||
+ const w4a16_tile_desc * __restrict__ g_tiles,
|
||||
int N, int K, int64_t expert_stride_blocks) {
|
||||
#if defined(AMPERE_MMA_AVAILABLE) && defined(CP_ASYNC_AVAILABLE)
|
||||
constexpr int BK = 64; // one nvfp4 block
|
||||
@@ -129,9 +135,10 @@ static __global__ void w4a16_grouped_kernel(
|
||||
const int tid = warp*32 + lane; // linear id for the cp.async strided copies
|
||||
const int wrow = warp / WARPS_N, wcol = warp % WARPS_N;
|
||||
|
||||
- const int e = g_tile_expert[blockIdx.y];
|
||||
- const int row0 = g_tile_row0[blockIdx.y];
|
||||
- const int rcount = g_tile_rows[blockIdx.y];
|
||||
+ const w4a16_tile_desc tile_desc = g_tiles[blockIdx.y];
|
||||
+ const int e = tile_desc.expert;
|
||||
+ const int row0 = tile_desc.row0;
|
||||
+ const int rcount = tile_desc.rows;
|
||||
const int blockCol = blockIdx.x*BN;
|
||||
const int Kb = K/64;
|
||||
const block_nvfp4 * We = W0 + (int64_t) e*expert_stride_blocks; // expert e weight base
|
||||
@@ -238,7 +245,7 @@ static __global__ void w4a16_grouped_kernel(
|
||||
}
|
||||
#else
|
||||
GGML_UNUSED(Abf); GGML_UNUSED(W0); GGML_UNUSED(C);
|
||||
- GGML_UNUSED(g_tile_expert); GGML_UNUSED(g_tile_row0); GGML_UNUSED(g_tile_rows);
|
||||
+ GGML_UNUSED(g_tiles);
|
||||
GGML_UNUSED(N); GGML_UNUSED(K); GGML_UNUSED(expert_stride_blocks);
|
||||
NO_DEVICE_CODE;
|
||||
#endif // AMPERE_MMA_AVAILABLE && CP_ASYNC_AVAILABLE
|
||||
@@ -296,18 +303,21 @@ void ggml_cuda_mul_mat_id_w4a16_grouped(
|
||||
return;
|
||||
}
|
||||
|
||||
- std::vector<int32_t> h_tile_expert, h_tile_row0, h_tile_rows;
|
||||
+ std::vector<w4a16_tile_desc> h_tiles;
|
||||
int64_t row = 0;
|
||||
for (int64_t e = 0; e < n_experts; e++) {
|
||||
const int t = tokens_per_expert[e];
|
||||
for (int off = 0; off < t; off += BM) {
|
||||
- h_tile_expert.push_back((int32_t) e);
|
||||
- h_tile_row0.push_back((int32_t) (row + off));
|
||||
- h_tile_rows.push_back((int32_t) std::min(BM, t - off));
|
||||
+ h_tiles.push_back({
|
||||
+ (int) e,
|
||||
+ (int) (row + off),
|
||||
+ (int) std::min(BM, t - off),
|
||||
+ 0,
|
||||
+ });
|
||||
}
|
||||
row += t;
|
||||
}
|
||||
- const int n_tiles = (int) h_tile_expert.size();
|
||||
+ const int n_tiles = (int) h_tiles.size();
|
||||
|
||||
if (getenv("LLAMA_W4A16_DEBUG")) {
|
||||
int max_tpe = 0, multi = 0;
|
||||
@@ -319,13 +329,9 @@ void ggml_cuda_mul_mat_id_w4a16_grouped(
|
||||
(long long) total_rows, (long long) n_experts, (long long) K, (long long) N, n_tiles, max_tpe, multi);
|
||||
}
|
||||
|
||||
- // device: tile map
|
||||
- ggml_cuda_pool_alloc<int32_t> d_tile_expert(ctx.pool(), n_tiles);
|
||||
- ggml_cuda_pool_alloc<int32_t> d_tile_row0 (ctx.pool(), n_tiles);
|
||||
- ggml_cuda_pool_alloc<int32_t> d_tile_rows (ctx.pool(), n_tiles);
|
||||
- CUDA_CHECK(cudaMemcpyAsync(d_tile_expert.ptr, h_tile_expert.data(), n_tiles*sizeof(int32_t), cudaMemcpyHostToDevice, stream));
|
||||
- CUDA_CHECK(cudaMemcpyAsync(d_tile_row0.ptr, h_tile_row0.data(), n_tiles*sizeof(int32_t), cudaMemcpyHostToDevice, stream));
|
||||
- CUDA_CHECK(cudaMemcpyAsync(d_tile_rows.ptr, h_tile_rows.data(), n_tiles*sizeof(int32_t), cudaMemcpyHostToDevice, stream));
|
||||
+ // device: packed tile map; one pageable H2D copy instead of three tiny copies
|
||||
+ ggml_cuda_pool_alloc<w4a16_tile_desc> d_tiles(ctx.pool(), n_tiles);
|
||||
+ CUDA_CHECK(cudaMemcpyAsync(d_tiles.ptr, h_tiles.data(), n_tiles*sizeof(w4a16_tile_desc), cudaMemcpyHostToDevice, stream));
|
||||
|
||||
// activations: f32 -> bf16 (cheap cast, NO act-quant), zero-padded so every tile's BM-row read
|
||||
// stays in-bounds. A tile's row0 is generally NOT BM-aligned (experts start mid-buffer), and a
|
||||
@@ -353,7 +359,7 @@ void ggml_cuda_mul_mat_id_w4a16_grouped(
|
||||
dim3 block(32, WARPS_M*WARPS_N); // 2D: threadIdx.x = warp lane, threadIdx.y = warp
|
||||
kern<<<grid, block, smem_bytes, stream>>>(
|
||||
Abf.get(), (const block_nvfp4 *) src0->data, dst_sorted,
|
||||
- d_tile_expert.ptr, d_tile_row0.ptr, d_tile_rows.ptr,
|
||||
+ d_tiles.ptr,
|
||||
(int) N, (int) K, expert_stride_blocks);
|
||||
CUDA_CHECK(cudaGetLastError());
|
||||
}
|
||||
--
|
||||
2.43.0
|
||||
|
||||
38
docs/superpowers/plans/2026-06-30-w4a16-metadata-phase1.md
Normal file
38
docs/superpowers/plans/2026-06-30-w4a16-metadata-phase1.md
Normal file
@@ -0,0 +1,38 @@
|
||||
# W4A16 Metadata Phase 1 Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development or superpowers:executing-plans. Keep checkboxes current while executing.
|
||||
|
||||
**Goal:** Test the first W4A16 kill-gate target selected by Phase 0: reduce host-built tile metadata overhead in the grouped W4A16 MoE prefill path.
|
||||
|
||||
**Scope:** Fork-first in `/home/mudler/_git/llama.cpp`; LocalAI patch series is regenerated only after the fork commit is validated.
|
||||
|
||||
## Task 1: Packed Tile Descriptor
|
||||
|
||||
**Files:**
|
||||
- Modify fork-first: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cu`
|
||||
- Modify fork-first if needed: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cuh`
|
||||
|
||||
- [x] Replace `h_tile_expert`, `h_tile_row0`, and `h_tile_rows` with one packed tile descriptor vector.
|
||||
- [x] Replace three device pool allocations and three `cudaMemcpyAsync` calls with one descriptor allocation and one H2D copy.
|
||||
- [x] Keep default-off behavior unchanged when `LLAMA_W4A16_PREFILL_M=0`.
|
||||
|
||||
## Task 2: Fork Build And Gates
|
||||
|
||||
- [x] Build the fork on DGX from a clean source snapshot.
|
||||
- [x] Run canonical MoE and dense md5 gates.
|
||||
- [x] Run W4A16 off/on prefill A/B at `npp=512,2048`.
|
||||
- [x] Record whether packed descriptors improve, regress, or do not materially change W4A16.
|
||||
|
||||
Result: packed descriptors passed md5 gates and improved forced W4A16 by only
|
||||
`+0.39%` at `npp=512` and `+0.48%` at `npp=2048`; W4A16 remains `-42.9%` and
|
||||
`-44.7%` behind default FP4-MMQ respectively.
|
||||
|
||||
## Task 3: Next Decision
|
||||
|
||||
- [x] If W4A16 improves materially, continue metadata work toward device-side/cached descriptor generation.
|
||||
- [x] If W4A16 does not improve materially, keep the patch only if it simplifies the path and choose the next target from the observed bottleneck.
|
||||
- [x] Commit fork-first result, regenerate LocalAI patches, verify mirror invariant, and update LocalAI results docs.
|
||||
|
||||
Decision: keep the packed descriptor patch as a simplification, but do not spend
|
||||
the next iteration on metadata alone. The remaining gap is dominated elsewhere;
|
||||
next target should be the activation cast or MMA/dequant tile body.
|
||||
Reference in New Issue
Block a user