diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 2ab55514d..bb32b97fa 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -29,6 +29,46 @@ No baseline runs have been started yet. HEAD. - Tree hash after patch application: `a73d759350277532a14e853e1fe78f08bbb74ce8` +## Existing Artifact Gap Review + +Read-only DGX artifact inspection was performed after confirming the machine was +idle: `docker ps` returned no running containers, +`nvidia-smi --query-compute-apps` returned no compute-app rows, and +`~/gpu_bench_lock/owner` read +`FREE released-by-claude-fp4norm-profile 1782828229`. + +Existing paged llama.cpp decode and prefill numbers are supported by +`/home/mudler/bench/COMBINED_DEFINITIVE.txt`: MoE paged prefill lines 13-18, +MoE paged serving decode lines 23-26, dense paged prefill lines 43-48, and +dense paged serving decode lines 53-56. Supporting comparison artifacts are +`/home/mudler/bench/STOCK3WAY.txt`, `/home/mudler/bench/PREFILL_KNOB.txt`, +`/home/mudler/bench/DEFINITIVE_S3ab.txt`, and the adjacent raw logs. + +No self-contained vLLM `1078 t/s` GPU-steady `ntg16`/`ntg64` +difference-method artifact was found. The available vLLM evidence is +serving-run output in `/home/mudler/bench/COMBINED_DEFINITIVE.txt` plus +nsys/run artifacts under `/home/mudler/bench/profgap/` and +`/home/mudler/bench/postssm_decomp/`; these do not form a packaged +`ntg16`/`ntg64` difference-method report. + +W4A16/Marlin evidence exists in `/home/mudler/bench/vllm_prefix.log`, +`/home/mudler/bench/profgap/vllm_moe_decode.run.log`, and +`/home/mudler/bench/marlin_gate/kl_marlin.log`. +`/home/mudler/llama-paged-dev/LEVER3_ACTQUANT_FUSION_RESULTS.md` records the +parity conclusion: W4A16/Marlin is a precision-change lever, not a bit-exact +llama.cpp parity lever. + +GDN M5/M8 evidence exists in `/home/mudler/bench/COMBINED_DEFINITIVE.txt` +(`GDN CONFIG C (M8)` and production defaults noting GDN M5), +`/home/mudler/llama-paged-dev/LEVER1_GATHER_RESULTS.md`, and +`/home/mudler/llama-paged-dev/CONV_STATE_FUSION_RESULTS.md`. + +S3 evidence exists in `/home/mudler/bench/DEFINITIVE_S3ab.txt`; that A/B shows +S3-on was worse unless paired with `LLAMA_PAGED_PREFILL_PERIOD=1`, matching +`/home/mudler/bench/COMBINED_DEFINITIVE.txt` where S3 is recorded as off by +default. No separate self-contained adaptive-scheduling proof artifact was +found beyond the S3 and prefill-knob artifacts. + ## Open Items - Reproduce paged prefill and decode baselines.