PR #17004 is merged and already present in our pinned llama.cpp f3e1828. Measured on DGX Spark (GB10, sm_121, Qwen3-32B-Q4_K_M): - llama-batched-bench does no sampling (random tokens), so it cannot test the fix; its ~540 t/s plateau is not sampling-bound. - Real-sampling A/B via llama-batched (CPU vs -bs GPU sampler): +25% at np=32, +3% at np=64, GGML_ASSERT(obj_new) graph-alloc crash at np>=128. - nsys at np=64: GPU-busy time and kernel mix unchanged (392 vs 404 t/s); sampling kernels negligible. GPU utilization did not rise. Clean negative: the fix does not break the plateau toward the ~2700 ceiling or past vLLM 667, and is unusable at the multi-user parallelism in question. Adoption: code arrives via LLAMA_VERSION bump (prepare.sh vendors the modified upstream server-context.cpp), but grpc-server must set params.sampling.backend_sampling to enable it; grammar/tool-call/logprobs requests fall back to CPU. Defer adoption until #18547/#18550 stabilise it. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
9.3 KiB
PR #17004 (backend / GPU sampling) evaluation on DGX Spark (GB10, sm_121)
Date: 2026-06-21. Hardware: NVIDIA GB10 (GB10, sm_121), CUDA 13.0, cmake 3.28.
Model: Qwen3-32B-Q4_K_M.gguf. LocalAI pin: LLAMA_VERSION=f3e182816421c648188b5eab269853bf1531d950 (2026-06-17).
TL;DR (clean negative)
- PR #17004 is MERGED and is ALREADY present in our pinned llama.cpp
f3e1828. There is nothing to apply / cherry-pick / patch. The-bs/--backend-samplingCLI arg, thellama_set_sampler/llama_get_sampled_*API, and the GPU argsort/top-k/cumsum/softmax kernels are all in the pin. - The prescribed benchmark cannot test the fix.
llama-batched-benchdoes ZERO sampling - it feeds random tokens (std::rand() % n_vocab). Its ~540 t/s plateau is therefore not sampling-bound, and enabling backend sampling does nothing to it. The valid tool isllama-batched(examples/batched), which the PR updated to drive per-sequence sampler chains and which actually exercises-bs. - In a controlled real-sampling A/B (same
llama-batchedharness, CPU vs GPU sampler), GPU sampling gave only +25% at np=32, +3% at np=64, and CRASHED (GGML_ASSERT(obj_new), graph-context alloc) at np=128 and np=256 - exactly the multi-user regime the investigation cares about. - nsys at np=64: GPU kernel profile and GPU-busy time are essentially identical with and without the fix (CPU 392.5 t/s / GPU 404.2 t/s; total GPU kernel+memop time ~4.05 s in both). Sampling kernels do not even appear among the top GPU contributors. GPU utilization did not rise.
- Conclusion: PR #17004, in the state shipped by our pin, does NOT break the ~540 plateau and does not move decode aggregate toward the ~2700 GPU-bound ceiling or past vLLM's 667. It is modest at low parallelism and unusable (crash) at the high parallelism in question. The PR's own guidance ("recommended
--parallel 1", "will take time to mature") matches what we measured.
1. What PR #17004 does + state
- Title: "sampling : add support for backend sampling". State: MERGED into
master(PR head branchgpu-sampling). 44 files, +4133/-296. libllama: newllama_context_params.samplers/n_samplers,llama_set_sampler,llama_get_sampled_*,llama_sampler_seq_config, updatedllama_sampler_i. Sampler chain can now run inside the compute graph on the backend (GPU) instead of on the CPU afterllama_decode.- CUDA: optimized/new
argsort,top-k,cumsum,softmaxkernels; CMake option-DGGML_CUDA_CUB_3DOT2=ON(builds a CCCL v3.2 prerelease for faster top-k). - Tools: new
-bs, --backend-samplingarg incommon/arg.cpp(line 1921); server (server-context.cpp) per-slot wiring;examples/batched/batched.cppupdated. - Supported backend samplers:
top-k,top-p,min-p,temp(+ dist). Limitations (from the PR): not compatible with grammar sampling; single output per sequence per batch; no save/load of sampling state; recommended only with--parallel 1and CUB_3DOT2. Open follow-ups: #18547 (avoid graph reallocations), #18550 (skip inactive samplers in parallel decode). - It DOES target the CPU-side per-sequence sampling stall we hypothesised - the mechanism is correct. Maturity is the problem.
Note: the GitHub API reports mergedAt: 2026-01-04, but the PR contains June 2026 upstream-merge commits and the feature is verified present in our 2026-06-17 pin, so treat the date field as a metadata quirk. What matters: the code is in f3e1828.
2/3. Apply + build
No apply needed (already in pin). Built from a clean git worktree at f3e1828 (~/llama-pr17004), to avoid disturbing the existing diffusion build:
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES=121 -DLLAMA_MAX_SEQ=256 \
-DGGML_CUDA_CUB_3DOT2=ON -DLLAMA_CURL=OFF
cmake --build build --target llama-batched llama-batched-bench -j20
Build: SUCCESS (CUB_3DOT2=ON FetchContent fetched and compiled despite flaky net; sm_121; LLAMA_MAX_SEQ=256). -bs/--backend-sampling confirmed present in llama-batched --help.
4. Decode aggregate: fix vs baseline vs vLLM
4a. llama-batched-bench (NO sampling - reconfirms the plateau, unaffected by the fix)
-npp 16 -ntg 128 -npl 32,64,128,256 -c 40960 -b 2048 -ub 2048
| npl | S_TG t/s |
|---|---|
| 32 | 241.8 |
| 64 | 395.1 |
| 128 | 542.6 |
| 256 | 567.2 |
Reproduces the ~540 plateau. Because this tool never samples, -bs is irrelevant here - the plateau is decode/host-overhead-bound, not sampling-bound.
4b. llama-batched real-sampling A/B (CPU sampler vs -bs GPU sampler, identical harness)
-kvu -n 128 -np {32,64,128,256} -c 40960 --seed 1 (samplers: top-k 40 / top-p 0.95 / temp 0.8)
| np | CPU sampling t/s | GPU -bs sampling t/s |
delta |
|---|---|---|---|
| 32 | 174.1 | 217.5 | +25% |
| 64 | 390.5 | 403.4 | +3.3% |
| 128 | 497.9 | CRASH GGML_ASSERT(obj_new) ggml.c:1768 |
- |
| 256 | 396.7 | CRASH GGML_ASSERT(obj_new) ggml.c:1768 |
- |
(llama-batched absolute t/s is lower than batched-bench because it does real sampling plus per-token detokenize/string/stream work; the A/B within this harness isolates the sampler cost.)
Does the fix break the plateau? No. GPU sampling helps only at low parallelism and the gain shrinks as np rises (+25% -> +3%), then the path crashes at np>=128 - i.e. it fails in exactly the multi-user regime where the plateau matters. It does not approach the ~2700 ceiling and does not pass vLLM's 667. The CPU-sampling curve itself peaks at np=128 (498) and drops at np=256 (397), confirming CPU sampling is a scaling wall - but PR #17004 as shipped does not lift it because the GPU path is unstable there.
5. GPU-utilization mechanism (nsys, np=64, the highest np where -bs survives)
nsys profile -t cuda ... -n 96 -np 64
| mode | decode t/s | total GPU kernel+memop time | top GPU contributors |
|---|---|---|---|
| CPU sampling | 392.5 | ~4.07 s | mul_mat_q (55%+17%), flash_attn (5.7%), mul_mat_vec (2%) |
GPU -bs |
404.2 | ~4.04 s | identical set; sampling kernels not in top contributors |
GPU-busy time and the kernel mix are essentially unchanged between modes. The argsort/top-k/cumsum/softmax sampling kernels are negligible in the timeline; the only visible difference is H2D memcpy instances rising 1,495 -> 7,076 (pinned-memory sampler transfers) at ~unchanged total memcpy time. GPU utilization did not rise. This directly refutes the idea that, at this workload, the GPU idle is dominated by CPU sampler arithmetic - moving the sampler onto the GPU barely changed throughput (+3%) and did not raise GPU occupancy. The ~80% idle measured elsewhere is dominated by something other than the sampler math (host-side batch construction / synchronization / detokenize), which PR #17004 does not address.
(np=256 nsys "with fix" could not be captured: -bs aborts there. Fixing the crash needs the unmerged follow-ups #18547/#18550, not in our pin.)
LocalAI adoption path
The code arrives transparently with a version bump; enabling it is not transparent.
backend/cpp/llama-cpp/prepare.shcopies all of upstreamllama.cpp/tools/server/*(including the #17004-modifiedserver-context.cpp/server-task.cpp/server-common.cpp) intotools/grpc-server/, andgrpc-server.cpp#includes them. So onceLLAMA_VERSIONpoints at a commit containing #17004 (our pinf3e1828already does), the backend-sampling machinery compiles intogrpc-serverautomatically. No vendored patch inpatches/is required for the code.- The vendored
server-context.cppalready does the per-slot wiring (around line 1615):backend_sampling &= task.params.sampling.backend_sampling, also disabled for speculative decode and for pre-sampling logits (n_probs>0), thenllama_set_sampler(ctx_tgt, slot.id, common_sampler_get(slot.smpl)). - But it is OFF unless
task.params.sampling.backend_sampling == true. LocalAI'sgrpc-serverbuildsparamsitself from the gRPC request and never sets this flag (and does not pass the upstream--backend-samplingCLI arg). So as-is, LocalAI compiles the feature but never uses it. A small grpc-server change is needed: read a LocalAI model option / env and setparams.sampling.backend_sampling = true(global or per-request). - For performant CUDA top-k, add
-DGGML_CUDA_CUB_3DOT2=ONto the llama-cpp CUDACMAKE_ARGSin the Makefile (optional; a non-CUB fallback exists). - Caveats that blunt the benefit for LocalAI specifically: grammar-constrained requests (JSON-schema / tool calls - a large share of LocalAI traffic),
logprobs/n_probs>0, and speculative decoding all fall back to CPU sampling by the gating above; and the GPU path crashes at np>=128 in this pin. So even after wiring the flag, the multi-user throughput case would not benefit (and would crash) until the follow-up PRs (#18547/#18550) land and stabilise high-parallelism backend sampling.
Recommendation
Do not adopt PR #17004 as the multi-user throughput fix yet. It is already in the tree but is immature at the parallelism that matters (crashes at np>=128, modest gains below). The measured bottleneck at this workload is not the sampler arithmetic (nsys shows GPU-busy unchanged when sampling moves to GPU). Re-evaluate after #18547/#18550 merge into a future pin; revisit the host-side decode/batch-construction overhead as the more likely real lever.