docs(paged): record prefill bucket attribution phase

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 12:20:29 +00:00
parent 6a2618b6dc
commit 2e19e5c90f
4 changed files with 485 additions and 1 deletions

View File

@@ -3507,3 +3507,49 @@ Decision:
- Do not tune `spec-draft-n-max` blindly. Phase15, Phase19, and Phase62 all
showed high acceptance with poor serving throughput, so the remaining question
is verify cost, not whether MTP can draft.
## Prefill Bucket Attribution Phase63 Result
Phase63 is recorded in
`docs/superpowers/plans/2026-07-01-prefill-bucket-attribution-phase63.md`.
It was a measurement and decision phase, not a source patch phase.
Artifact:
- `/home/mudler/bench/phase63_prefill_bucket/20260701_140127`
Pre/post inference gates passed:
| gate | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|------|---------|-----------|-----------|--------------|
| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
llama.cpp MoE prefill, `npl=32`, `ntg=4`:
| npp | S_PP | MoE/FFN-GEMM | GDN | bf16-proj | layout-copy | act-quant | MoE-dispatch | gather | FA |
|-----|------|--------------|-----|-----------|-------------|-----------|--------------|--------|----|
| 512 | `2248.20` | `40.48%` | `18.00%` | `10.19%` | `7.82%` | `4.47%` | `1.94%` | `1.26%` | `0.71%` |
| 2048 | `2385.22` | `41.06%` | `16.15%` | `9.97%` | `7.96%` | `4.61%` | `2.12%` | `1.36%` | `1.18%` |
vLLM MoE prefill, `NSEQ=32`, `GEN=1`, `NREP=3`, eager profile path:
| PT | S_PP | ew/glue | GDN | FA | bf16-proj | MoE-dispatch | top unclassified |
|----|------|---------|-----|----|-----------|--------------|------------------|
| 512 | `5315.6` | `32.97%` | `18.34%` | `0.73%` | `3.41%` | `1.37%` | Marlin MoE `1940.99ms`, FP8 projection `565.74ms` |
| 2048 | `5384.4` | `33.48%` | `18.00%` | `1.75%` | `1.06%` | `0.49%` | Marlin MoE `7745.84ms`, FP8 projection `3047.75ms` |
Decision:
- Reject a Phase63 paged FlashAttention mask/block-table source patch. llama.cpp
FA is only `1.18%` of prefill GPU kernel time at `npp=2048`, below the `<5%`
reject rule and far below the `8%` source-funding threshold.
- The `npp=2048` FA cost is about `4.9 us/tok` for llama.cpp and `3.1 us/tok`
for vLLM, so the cross-engine FA delta is only about `1.7 us/tok`, below the
`15 us/tok` funding threshold.
- The dominant remaining llama.cpp buckets are still MoE/FFN GEMM, GDN,
bf16 projections, layout copies, and activation quantization. Phase63 did not
identify a new low-conflict source patch that can move GB10 parity without
reopening already-rejected W4A16/GDN/MTP/small-M work.
- No llama.cpp source files were modified. Default inferencing stayed green with
the canonical md5/op gates.

View File

@@ -878,4 +878,24 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
---
*Status: investigation CLOSED. This handoff is procedure; `VLLM_PARITY_FINAL.md` is the record. The path to parity is datacenter Blackwell, not GB10 kernels.*
## 8. PHASE63 RESULT: PREFILL BUCKET ATTRIBUTION
Phase63 is complete as a measurement-only no-go. The plan is
`docs/superpowers/plans/2026-07-01-prefill-bucket-attribution-phase63.md`; the
DGX artifact is `/home/mudler/bench/phase63_prefill_bucket/20260701_140127`.
Pre/post gates stayed green:
- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`;
- dense md5 `5951a5b4d624ce891e22ab5fca9bc439`;
- `MUL_MAT` `1146/1146`;
- `MUL_MAT_ID` `806/806`.
The candidate paged FlashAttention mask/block-table cleanup is rejected for now:
llama.cpp FA is only `0.71%` at `npp=512` and `1.18%` at `npp=2048`; the
`npp=2048` cross-engine FA delta is about `1.7 us/tok`, not the `15 us/tok`
needed to fund source work. No llama.cpp source files were modified.
*Status: Phase63 closed. `VLLM_PARITY_FINAL.md` remains the GB10 shortcut record;
the remaining measured buckets are still MoE/FFN GEMM, GDN, bf16 projections,
layout copies, and activation quantization.*

View File

@@ -85,6 +85,15 @@ The 10-16 full-attention layers' QK^T·softmax·PV is a separate kernel covered
## Bottom line
Two prefill levers (GEMM, GDN) are correctly the top-2 and own ~the gap's majority, but they are **not** the whole gap. The op-walk surfaces **MoE router+combine/scatter** and the **W4A4 activation-quant pass** as genuine, currently-untracked prefill contributors on the MoE decision model (~8-14% combined), plus **FA prefill** as a context-dependent risk the npp=128 bench hides. Per the methodology, step 0 is an nsys prefill-only window that explicitly breaks out `argsort/add(combine)`, `quantize_mmq_nvfp4`, and `flash_attn` as separate rows to size these three before funding a kernel.
Phase63 executed that step-0 discipline after the W4A16 direct-A and MTP
rejections. It stayed profile-first and inference-gated: pre/post canonical md5
and backend-op gates wrapped same-shape llama.cpp/vLLM prefill profiles at
`npp/PT=512` and `2048`. Result: FA is not a source lever on GB10 right now.
llama.cpp FA was `0.71%` at `npp=512` and `1.18%` at `npp=2048`; the
`npp=2048` cross-engine FA delta was about `1.7 us/tok`. The paged
FlashAttention mask/block-table cleanup remains a correctness/test gap worth
keeping in mind, but Phase63 rejects it as a parity patch.
Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189).
## 2. Decode-serving compute hypotheses (ranked)

View File

@@ -0,0 +1,409 @@
# Prefill Bucket Attribution Phase63 Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Re-profile current llama.cpp and vLLM MoE prefill on GB10 with inference gates before/after, then fund only a localized paged FlashAttention mask/block-table cleanup if the profile proves the bucket is material.
**Architecture:** Phase63 is measurement-first. It brackets all DGX work with canonical md5 and backend-op gates, captures same-shape Nsight Systems prefill profiles for llama.cpp and vLLM, reduces kernel rows into named buckets, and records a go/no-go decision before touching llama.cpp source. If the FA/mask bucket is too small, the phase closes as a documented rejection.
**Tech Stack:** LocalAI paged docs, llama.cpp CUDA backend, Nsight Systems, DGX `dgx.casa`, `/home/mudler/bench/bucket.py`, `llama-batched-bench`, vLLM offline profiling harness.
---
## Guardrails
- Do not edit llama.cpp source until Task 4 has a positive go decision.
- Do not regenerate the LocalAI patch series in this phase.
- Do not accept any md5 drift as benign without a separate KL decision.
- Canonical gates:
- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
- dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
- `MUL_MAT`: `1146/1146`
- `MUL_MAT_ID`: `806/806`
- DGX preflight must show `docker=0`, `local_ai_worker=0`, `compute=0`, and a free lock before starting a run.
## Files
- Create: `docs/superpowers/plans/2026-07-01-prefill-bucket-attribution-phase63.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
- Read-only unless Task 4 is positive: `/home/mudler/_git/llama.cpp/src/paged-attn.cpp`
- Read-only unless Task 4 is positive: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/fattn-vec.cuh`
- Read-only unless Task 4 is positive: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/fattn-tile.cuh`
- Read-only unless Task 4 is positive: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/fattn.cu`
- Test if Task 4 is positive: `/home/mudler/_git/llama.cpp/tests/test-backend-ops.cpp`
---
### Task 1: Acquire DGX and Run Pre-Gates
- [x] **Step 1: Verify DGX is idle and acquire the phase lock**
Run:
```bash
ssh dgx.casa 'set -euo pipefail
docker_count=$(docker ps --format "{{.Names}}" | wc -l)
worker_count=$(pgrep -af "[l]ocal-ai-worker" | wc -l)
compute_count=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l)
lock_state=FREE
if [ -f /tmp/localai-gb10.lock ]; then lock_state=$(cat /tmp/localai-gb10.lock); fi
printf "docker=%s local_ai_worker=%s compute=%s lock=%s\n" "$docker_count" "$worker_count" "$compute_count" "$lock_state"
test "$docker_count" = 0
test "$worker_count" = 0
test "$compute_count" = 0
case "$lock_state" in FREE*|FREE-no-lock) : ;; *) exit 3 ;; esac
printf "codex-phase63-prefill-bucket %s\n" "$(date +%s)" > /tmp/localai-gb10.lock'
```
Expected: one line containing `docker=0 local_ai_worker=0 compute=0 lock=FREE...`, exit code `0`, and `/tmp/localai-gb10.lock` owned by `codex-phase63-prefill-bucket`.
Result: initial preflight showed `docker=0`, `compute=0`, and no real
`local-ai-worker` process. The first direct gate retry exposed a shell issue:
with `set -euo pipefail`, an empty `pgrep` pipeline exits before printing, so the
execution command uses `(pgrep -af '[l]ocal-ai-worker' || true) | wc -l`.
- [x] **Step 2: Run canonical pre-gate**
Run:
```bash
ssh dgx.casa 'set -euo pipefail
cd /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged
ART=/home/mudler/bench/phase63_prefill_bucket/$(date +%Y%m%d_%H%M%S)
mkdir -p "$ART"
echo "$ART" > /tmp/phase63_artifact_dir
./scripts/paged-inference-gates.sh /home/mudler/llama-phase6-source/build-cuda/bin "$ART/pre_gate" | tee "$ART/pre_gate.log"'
```
Expected:
```text
moe md5 OK: 8cb0ce23777bf55f92f63d0292c756b0
dense md5 OK: 5951a5b4d624ce891e22ab5fca9bc439
1146/1146 tests passed
806/806 tests passed
paged inference gates OK
```
Result:
```text
docker=0 local_ai_worker=0 compute=0 lock=FREE-no-lock
pre moe_md5 8cb0ce23777bf55f92f63d0292c756b0 8cb0ce23777bf55f92f63d0292c756b0 ok
pre dense_md5 5951a5b4d624ce891e22ab5fca9bc439 5951a5b4d624ce891e22ab5fca9bc439 ok
pre MUL_MAT 1146/1146 1146/1146 ok
pre MUL_MAT_ID 806/806 806/806 ok
paged inference gates OK
```
Artifact: `/home/mudler/bench/phase63_prefill_bucket/20260701_140127`.
---
### Task 2: Capture Current llama.cpp Prefill Profiles
- [x] **Step 1: Run `npp=512` and `npp=2048` llama.cpp prefill profiles**
Run:
```bash
ssh dgx.casa 'set -euo pipefail
ART=$(cat /tmp/phase63_artifact_dir)
BIN=/home/mudler/llama-phase6-source/build-cuda/bin/llama-batched-bench
MODEL=/home/mudler/bench/q36-35b-a3b-nvfp4.gguf
for npp in 512 2048; do
REP="$ART/llama_moe_prefill_npp${npp}"
rm -f "$REP.nsys-rep" "$REP.sqlite" "$REP.log" "$REP.buckets.txt"
env LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 GGML_CUDA_DISABLE_GRAPHS=1 \
nsys profile --trace=cuda --sample=none --cpuctxsw=none --force-overwrite true -o "$REP" \
"$BIN" -m "$MODEL" -c 131072 -b 2048 -ub 512 -ngl 99 -fa on \
-npp "$npp" -ntg 4 -npl 32 > "$REP.log" 2>&1
nsys stats --report cuda_gpu_kern_sum --format csv --force-export true -o "$REP.kern" "$REP.nsys-rep" >/dev/null
python3 /home/mudler/bench/bucket.py "$REP.nsys-rep" "phase63_llama_npp${npp}" > "$REP.buckets.txt"
grep -E "main:|pp|tg|llama_print_timings|error|failed|CUDA" "$REP.log" | tail -40 > "$REP.summary.txt" || true
done'
```
Expected:
- `llama_moe_prefill_npp512.nsys-rep`, `.kern_cuda_gpu_kern_sum.csv`, `.buckets.txt`, `.log`.
- `llama_moe_prefill_npp2048.nsys-rep`, `.kern_cuda_gpu_kern_sum.csv`, `.buckets.txt`, `.log`.
- Logs contain no `error`, `failed`, or CUDA runtime failure.
Result: both profiles completed under
`/home/mudler/bench/phase63_prefill_bucket/20260701_140127`.
- [x] **Step 2: Extract llama bucket rows for the decision table**
Run:
```bash
ssh dgx.casa 'set -euo pipefail
ART=$(cat /tmp/phase63_artifact_dir)
for f in "$ART"/llama_moe_prefill_npp*.buckets.txt; do
echo "==== $f ===="
sed -n "/--- MACRO buckets ---/,/--- FINE buckets ---/p" "$f"
sed -n "/--- FINE buckets ---/,/--- top UNCLASSIFIED ---/p" "$f" | \
egrep "mmq_nvfp4|act_quant|gdn_core|fa|argsort|mm_ids|gather_mmq|get_rows|copy_layout|concat_layout|convert_dtype" || true
done | tee "$ART/llama_bucket_extract.txt"'
```
Expected: extract includes rows for `MoE/FFN-GEMM`, `GDN`, `act-quant`, and `FA`; FA may be small.
Result:
| npp | MoE/FFN-GEMM | GDN | bf16-proj | layout-copy | act-quant | MoE-dispatch | gather | FA |
|-----|--------------|-----|-----------|-------------|-----------|--------------|--------|----|
| 512 | `40.48%` | `18.00%` | `10.19%` | `7.82%` | `4.47%` | `1.94%` | `1.26%` | `0.71%` |
| 2048 | `41.06%` | `16.15%` | `9.97%` | `7.96%` | `4.61%` | `2.12%` | `1.36%` | `1.18%` |
The FA bucket is below the Phase63 reject threshold before any source work.
---
### Task 3: Capture vLLM Same-Shape Prefill Profiles
- [x] **Step 1: Run vLLM `PT=512` and `PT=2048` prefill profiles**
Run:
```bash
ssh dgx.casa 'set -euo pipefail
ART=$(cat /tmp/phase63_artifact_dir)
export PATH=$HOME/vllm-bench/bin:$PATH
export HF_HUB_OFFLINE=1
for pt in 512 2048; do
REP="$ART/vllm_moe_prefill_pt${pt}"
rm -f "$REP.nsys-rep" "$REP.sqlite" "$REP.log" "$REP.buckets.txt"
env NSEQ=32 PT="$pt" GEN=1 NREP=3 \
nsys profile --capture-range=cudaProfilerApi --capture-range-end=stop \
--trace=cuda --sample=none --cpuctxsw=none --force-overwrite true -o "$REP" \
$HOME/vllm-bench/bin/python /home/mudler/bench/vllm_prefill_prof.py > "$REP.log" 2>&1
nsys stats --report cuda_gpu_kern_sum --format csv --force-export true -o "$REP.kern" "$REP.nsys-rep" >/dev/null
python3 /home/mudler/bench/bucket.py "$REP.nsys-rep" "phase63_vllm_pt${pt}" > "$REP.buckets.txt"
grep -E "TIMING|PROFILED|Error|Traceback|RuntimeError|CUDA" "$REP.log" | tail -40 > "$REP.summary.txt" || true
done'
```
Expected:
- `vllm_moe_prefill_pt512.nsys-rep`, `.kern_cuda_gpu_kern_sum.csv`, `.buckets.txt`, `.log`.
- `vllm_moe_prefill_pt2048.nsys-rep`, `.kern_cuda_gpu_kern_sum.csv`, `.buckets.txt`, `.log`.
- Logs contain `TIMING ... S_PP=...`, `PROFILED PREFILL START`, and `PROFILED END`.
Result: both vLLM profiles completed under
`/home/mudler/bench/phase63_prefill_bucket/20260701_140127`.
Timing:
| PT | S_PP |
|----|------|
| 512 | `5315.6 tok/s` |
| 2048 | `5384.4 tok/s` |
- [x] **Step 2: Extract vLLM bucket rows for the decision table**
Run:
```bash
ssh dgx.casa 'set -euo pipefail
ART=$(cat /tmp/phase63_artifact_dir)
for f in "$ART"/vllm_moe_prefill_pt*.buckets.txt; do
echo "==== $f ===="
sed -n "/--- MACRO buckets ---/,/--- FINE buckets ---/p" "$f"
sed -n "/--- FINE buckets ---/,/--- top UNCLASSIFIED ---/p" "$f" | \
egrep "vllm_fa|fla_gdn|vllm_dispatch|vllm_fp4_gemm|torch_ew|rmsnorm|triton|scaled|quant" || true
done | tee "$ART/vllm_bucket_extract.txt"'
```
Expected: extract includes vLLM rows for `MoE/FFN-GEMM`, `GDN`, `FA`, and dispatch/glue.
Result:
| PT | ew(misc) | GDN | FA | bf16-proj | MoE-dispatch | top `other` rows |
|----|----------|-----|----|-----------|--------------|------------------|
| 512 | `32.97%` | `18.34%` | `0.73%` | `3.41%` | `1.37%` | Marlin MoE `1940.99ms`, FP8 projection `565.74ms` |
| 2048 | `33.48%` | `18.00%` | `1.75%` | `1.06%` | `0.49%` | Marlin MoE `7745.84ms`, FP8 projection `3047.75ms` |
---
### Task 4: Decide Whether a Source Patch Is Funded
- [x] **Step 1: Apply the Phase63 decision gate**
Use these rules:
- Continue to a source patch only if llama.cpp FA or paged-mask-related work is at least `8%` of prefill GPU kernel time at `npp>=2048`, or it accounts for at least `15 us/tok` versus vLLM at the same shape.
- Reject source work if FA is below `5%` of llama.cpp prefill kernel time at `npp=2048`.
- Reject source work if the profile again points primarily at already-rejected GDN, W4A16, MTP, small-M MMQ, or gate-projection buckets.
- If continuing, keep the source target limited to physical mask/block-table indexing for paged FlashAttention and an explicit `FLASH_ATTN_EXT` block-table backend-op test.
Expected: write a short decision paragraph into `GB10_PARITY_PHASE0_RESULTS.md`.
Result: reject source work for Phase63. llama.cpp FA was `0.71%` at `npp=512`
and `1.18%` at `npp=2048`, below the `<5%` source-work reject threshold. At
`npp=2048`, llama FA was `320.66ms` over `65536` prompt tokens, about
`4.9 us/tok`; vLLM FA was `618.02ms` over `196608` prompt tokens, about
`3.1 us/tok`. The approximate FA delta is only `1.7 us/tok`, below the
`15 us/tok` source-funding gate.
- [x] **Step 2: If the source gate is negative, skip directly to Task 6**
Expected: no source files modified.
Result: no llama.cpp source files were modified.
---
### Task 5: Optional Source Patch Only If Task 4 Is Positive
Skipped: Task 4 rejected source work.
- [ ] **Step 1: Add the missing block-table FlashAttention backend-op case first**
Modify `/home/mudler/_git/llama.cpp/tests/test-backend-ops.cpp` so `FLASH_ATTN_EXT` has a paged/block-table mask case that fails before any mask-indexing implementation.
Run:
```bash
ssh dgx.casa 'set -euo pipefail
cd /home/mudler/llama-phase6-source
cmake --build build-cuda --target test-backend-ops -j $(nproc)
./build-cuda/bin/test-backend-ops test -b CUDA0 -o FLASH_ATTN_EXT -j 1'
```
Expected before implementation: the new block-table case fails or is skipped with an explicit unsupported path that proves the gap.
- [ ] **Step 2: Implement physical mask indexing behind the existing block-table dispatch**
Modify only the narrow paged-FA files:
- `/home/mudler/_git/llama.cpp/src/paged-attn.cpp`
- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/fattn-vec.cuh`
- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/fattn-tile.cuh`
- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/fattn.cu`
The implementation must remove mask compaction only when a block table is present and the CUDA kernel is using the physical-mask path. Non-paged attention must keep the existing mask layout.
- [ ] **Step 3: Run correctness and inference gates**
Run:
```bash
ssh dgx.casa 'set -euo pipefail
cd /home/mudler/llama-phase6-source
cmake --build build-cuda --target llama-completion llama-batched-bench test-backend-ops -j $(nproc)
ART=$(cat /tmp/phase63_artifact_dir)
./build-cuda/bin/test-backend-ops test -b CUDA0 -o FLASH_ATTN_EXT -j 1 | tee "$ART/flash_attn_ext_post.log"
cd /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged
./scripts/paged-inference-gates.sh /home/mudler/llama-phase6-source/build-cuda/bin "$ART/post_patch_gate" | tee "$ART/post_patch_gate.log"'
```
Expected: `FLASH_ATTN_EXT` passes, canonical md5s match, `MUL_MAT` is `1146/1146`, and `MUL_MAT_ID` is `806/806`.
- [ ] **Step 4: Run the A/B performance gate**
Run baseline and patched builds with:
```bash
env LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 ./llama-batched-bench \
-m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -c 131072 -b 2048 -ub 512 -ngl 99 -fa on \
-npp 128 -ntg 128 -npl 128,256
```
Keep only if the patch improves decode `S_TG` by at least `1.0%` at `npl=128` or `npl=256`, or reduces graph-node-traced decode wall by at least `0.5 ms/step`, with no md5/op drift.
---
### Task 6: Post-Gate, Release DGX, and Record Result
- [x] **Step 1: Run canonical post-gate**
Run:
```bash
ssh dgx.casa 'set -euo pipefail
ART=$(cat /tmp/phase63_artifact_dir)
cd /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged
./scripts/paged-inference-gates.sh /home/mudler/llama-phase6-source/build-cuda/bin "$ART/post_gate" | tee "$ART/post_gate.log"'
```
Expected:
```text
moe md5 OK: 8cb0ce23777bf55f92f63d0292c756b0
dense md5 OK: 5951a5b4d624ce891e22ab5fca9bc439
1146/1146 tests passed
806/806 tests passed
paged inference gates OK
```
Result:
```text
post moe_md5 8cb0ce23777bf55f92f63d0292c756b0 8cb0ce23777bf55f92f63d0292c756b0 ok
post dense_md5 5951a5b4d624ce891e22ab5fca9bc439 5951a5b4d624ce891e22ab5fca9bc439 ok
post MUL_MAT 1146/1146 1146/1146 ok
post MUL_MAT_ID 806/806 806/806 ok
post paged inference gates OK
```
- [x] **Step 2: Release DGX lock**
Run:
```bash
ssh dgx.casa 'printf "FREE released-by-codex-phase63-prefill-bucket %s\n" "$(date +%s)" > /tmp/localai-gb10.lock'
```
Expected: `/tmp/localai-gb10.lock` starts with `FREE released-by-codex-phase63-prefill-bucket`.
Result: `/tmp/localai-gb10.lock` is
`FREE released-by-codex-phase63-prefill-bucket 1782908317`; Docker count `0`,
worker count `0`, and no compute-app rows.
- [x] **Step 3: Update LocalAI docs**
Modify:
- `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
- `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
- `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
Record:
- artifact directory,
- pre/post gate md5s and op counts,
- llama and vLLM bucket table,
- Task 4 decision,
- source patch commit if any, or explicit source-work rejection.
Result: completed in this commit.
- [x] **Step 4: Commit LocalAI tracking docs**
Run:
```bash
cd /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention
git add -f docs/superpowers/plans/2026-07-01-prefill-bucket-attribution-phase63.md
git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \
backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \
backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
git commit -m "docs(paged): record prefill bucket attribution phase" \
-m "Assisted-by: Codex:gpt-5"
```
Expected: commit succeeds without bypassing hooks.
Result: committed as `6fc2cfb54 docs(paged): record prefill bucket attribution
phase`, then amended to mark this final checklist item complete.
---
## Self-Review
- Spec coverage: The plan directly covers the user's inferencing-safety request with pre/post md5 and op gates, uses DGX only after idle preflight, scopes Phase63 as measurement before source work, and limits any source follow-up to a localized FA/mask candidate.
- Placeholder scan: No `TBD`, `TODO`, or undefined test command remains.
- Type/path consistency: Artifact path, gate command, model paths, and binary paths are consistent across tasks.