LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 16:19:07 -04:00

Files

Ettore Di Giacinto 40ee9cdd13 docs(paged): evaluate llama.cpp PR #17004 (GPU/backend sampling) on GB10

PR #17004 is merged and already present in our pinned llama.cpp f3e1828.
Measured on DGX Spark (GB10, sm_121, Qwen3-32B-Q4_K_M):

- llama-batched-bench does no sampling (random tokens), so it cannot test
  the fix; its ~540 t/s plateau is not sampling-bound.
- Real-sampling A/B via llama-batched (CPU vs -bs GPU sampler): +25% at
  np=32, +3% at np=64, GGML_ASSERT(obj_new) graph-alloc crash at np>=128.
- nsys at np=64: GPU-busy time and kernel mix unchanged (392 vs 404 t/s);
  sampling kernels negligible. GPU utilization did not rise.

Clean negative: the fix does not break the plateau toward the ~2700 ceiling
or past vLLM 667, and is unusable at the multi-user parallelism in question.

Adoption: code arrives via LLAMA_VERSION bump (prepare.sh vendors the
modified upstream server-context.cpp), but grpc-server must set
params.sampling.backend_sampling to enable it; grammar/tool-call/logprobs
requests fall back to CPU. Defer adoption until #18547/#18550 stabilise it.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-21 15:44:21 +00:00

ds4

chore: ⬆️ Update antirez/ds4 to 80ebbc396aee40eedc1d829222f3362d10fa4c6c (#10378 )

2026-06-18 00:32:13 +02:00

grpc

fix: speedup git submodule update with --single-branch (#2847 )

2024-07-13 22:32:25 +02:00

ik-llama-cpp

chore: ⬆️ Update ikawrakow/ik_llama.cpp to b3dfb7858cfcb9166e92f366e5af87f19ebc94be (#10395 )

2026-06-19 00:03:37 +02:00

llama-cpp

docs(paged): evaluate llama.cpp PR #17004 (GPU/backend sampling) on GB10