diff --git a/backend/cpp/llama-cpp-localai-paged/README.md b/backend/cpp/llama-cpp-localai-paged/README.md index a536e041a..e06668d4e 100644 --- a/backend/cpp/llama-cpp-localai-paged/README.md +++ b/backend/cpp/llama-cpp-localai-paged/README.md @@ -164,7 +164,7 @@ swept over serving width `npl` in {8, 32, 64, 128}. Plots: [`qwen36_moe_decode_vs_npl.png`](docs/qwen36_moe_decode_vs_npl.png); raw data [`final_benchmark.csv`](docs/final_benchmark.csv). -![NVFP4 decode throughput vs concurrency on GB10: llama.cpp standard vs vLLM vs LocalAI's llama.cpp patches](docs/qwen36_decode_overview.png) +![NVFP4 decode throughput vs concurrency on GB10: llama.cpp standard vs vLLM vs LocalAI's llama.cpp patches, plus the opt-in bf16-tau ceiling](docs/qwen36_decode_overview.png) > **What was re-measured (2026-06-27).** The three llama columns - **stock**, > **patched**, and **patched+bf16-tau** - were all re-measured this session on one diff --git a/backend/cpp/llama-cpp-localai-paged/docs/qwen36_decode_overview.png b/backend/cpp/llama-cpp-localai-paged/docs/qwen36_decode_overview.png index 7a5f2e809..bec4bbd41 100644 Binary files a/backend/cpp/llama-cpp-localai-paged/docs/qwen36_decode_overview.png and b/backend/cpp/llama-cpp-localai-paged/docs/qwen36_decode_overview.png differ diff --git a/backend/cpp/llama-cpp-localai-paged/docs/qwen36_dense_decode_vs_npl.png b/backend/cpp/llama-cpp-localai-paged/docs/qwen36_dense_decode_vs_npl.png index 0f40032d6..1dd5cf000 100644 Binary files a/backend/cpp/llama-cpp-localai-paged/docs/qwen36_dense_decode_vs_npl.png and b/backend/cpp/llama-cpp-localai-paged/docs/qwen36_dense_decode_vs_npl.png differ diff --git a/backend/cpp/llama-cpp-localai-paged/docs/qwen36_moe_decode_vs_npl.png b/backend/cpp/llama-cpp-localai-paged/docs/qwen36_moe_decode_vs_npl.png index d06ca0759..680fd10db 100644 Binary files a/backend/cpp/llama-cpp-localai-paged/docs/qwen36_moe_decode_vs_npl.png and b/backend/cpp/llama-cpp-localai-paged/docs/qwen36_moe_decode_vs_npl.png differ