mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-20 06:39:01 -04:00
A larger physical batch (n_batch/n_ubatch) materially lifts MoE prefill on NVIDIA Blackwell consumer GPUs (sm_120/121, incl. GB10 / DGX Spark) — measured on a GB10 with Qwen3-Coder-30B-A3B, the prefill ceiling rises (ub512 ~2994 -> ub2048 ~3316 t/s) and saturates around 2048. The heuristic lives in core/config alongside the other config overriders (ApplyInferenceDefaults, guessDefaultsFromFile/NGPULayers) — they all fill the ModelConfig from heuristics, so hardware tuning is the same domain and stays in one place. It is parameterized on a GPU descriptor (not direct detection) so it works in both deployment shapes: - Single host: SetDefaults applies it with the LocalGPU. - Distributed: only the worker sees the GPU, so the worker reports its compute capability on registration (gpu_compute_capability -> BackendNode), and the router re-applies the SAME core/config heuristic for the SELECTED node before loading — fixing the case where the frontend has no GPU at all. Explicit `batch:` always wins (only managed default values are touched). xsysinfo gains NVIDIAComputeCapability() (detection only); all interpretation lives in core/config. Tests: core/config, pkg/xsysinfo, core/services/nodes. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
24 lines
561 B
Go
24 lines
561 B
Go
package xsysinfo
|
|
|
|
import (
|
|
. "github.com/onsi/ginkgo/v2"
|
|
. "github.com/onsi/gomega"
|
|
)
|
|
|
|
var _ = Describe("parseComputeCap", func() {
|
|
DescribeTable("splits major.minor",
|
|
func(in string, maj, min int) {
|
|
m, n := parseComputeCap(in)
|
|
Expect(m).To(Equal(maj))
|
|
Expect(n).To(Equal(min))
|
|
},
|
|
Entry("GB10 / DGX Spark", "12.1", 12, 1),
|
|
Entry("RTX 50-series", "12.0", 12, 0),
|
|
Entry("Hopper", "9.0", 9, 0),
|
|
Entry("major only", "12", 12, 0),
|
|
Entry("whitespace", " 12.1 ", 12, 1),
|
|
Entry("empty", "", -1, -1),
|
|
Entry("garbage", "abc", -1, -1),
|
|
)
|
|
})
|