mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-26 09:26:55 -04:00
The hardware-tuned defaults from #10411 were measured on a GB10 / DGX Spark (128 GiB unified memory) and over-provisioned multi-GPU consumer Blackwell (e.g. 2x16 GiB RTX 50-series) into CUDA OOM during model init: - The Blackwell physical batch (512 -> 2048) sets both n_batch and n_ubatch. The compute buffer scales ~n_ubatch * n_ctx and is allocated PER DEVICE (it can't be split across GPUs), so a large context turns ub2048 into multi-GiB of scratch that must fit one 16 GiB card. - The VRAM-scaled parallel-slot default tiered off TotalAvailableVRAM(), which SUMS all GPUs (2x16 -> "32 GiB" -> 8 slots), but the allocations are per-device. Make both decisions per-device and context-aware: - xsysinfo.MinPerGPUVRAM() reports the smallest device's VRAM; localGPU() uses it so the parallel tier and batch guard reason about one card. - PhysicalBatchForContext(gpu, ctx) raises the batch only when the extra compute buffer fits VRAM/4 at this model's context (16 GiB crosses over ~174k ctx, 32 GiB ~349k; GB10 reports system RAM so it still clears it). - Apply hardware defaults AFTER runBackendHooks in SetDefaults so the GGUF-guessed context is resolved before the batch decision. - The distributed router gates the node batch the same way. Unified-memory devices (GB10, Apple) report system RAM as their single device's VRAM, so they keep the prefill win. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
38 lines
1.0 KiB
Go
38 lines
1.0 KiB
Go
package xsysinfo
|
|
|
|
import (
|
|
. "github.com/onsi/ginkgo/v2"
|
|
. "github.com/onsi/gomega"
|
|
)
|
|
|
|
var _ = Describe("minNonZeroVRAM", func() {
|
|
const gib = uint64(1) << 30
|
|
|
|
It("returns the smallest device on a multi-GPU host", func() {
|
|
// Two unequal cards (e.g. RTX 5070 Ti + 5060 Ti, both 16 GiB, or a
|
|
// mixed pair): the smallest device is the per-card allocation ceiling.
|
|
infos := []GPUMemoryInfo{
|
|
{TotalVRAM: 16 * gib},
|
|
{TotalVRAM: 12 * gib},
|
|
}
|
|
Expect(minNonZeroVRAM(infos)).To(Equal(12 * gib))
|
|
})
|
|
|
|
It("ignores devices that report zero VRAM", func() {
|
|
infos := []GPUMemoryInfo{
|
|
{TotalVRAM: 0},
|
|
{TotalVRAM: 24 * gib},
|
|
}
|
|
Expect(minNonZeroVRAM(infos)).To(Equal(24 * gib))
|
|
})
|
|
|
|
It("returns the single device's VRAM on a one-GPU host", func() {
|
|
Expect(minNonZeroVRAM([]GPUMemoryInfo{{TotalVRAM: 16 * gib}})).To(Equal(16 * gib))
|
|
})
|
|
|
|
It("returns 0 when no device reports VRAM", func() {
|
|
Expect(minNonZeroVRAM([]GPUMemoryInfo{{TotalVRAM: 0}})).To(BeZero())
|
|
Expect(minNonZeroVRAM(nil)).To(BeZero())
|
|
})
|
|
})
|