Decisive DGX experiment: rebuilt with -DGGML_CUDA_FORCE_CUBLAS (it's a compile
#ifdef, not the runtime env we'd been setting - so prior 'cuBLAS no-op' tests
never engaged it). Real result: cuBLAS is SLOWER than MMQ for dense Q4 (pp2048
690 vs 750) and runs an Ampere cutlass_80_tensorop kernel - CUDA-13 has no sm_121
GEMM, falls back to sm_80. So both MMQ and cuBLAS sit at ~46 TFLOP/s; no library
shortcut to the 213 ceiling on GB10. Confirms a hand-tuned sm_120a kernel is
required. Added the phased W4A16 Marlin-style implementation plan (P0 harness ->
P5 enable) as the committed multi-week build; corrected the cuBLAS note.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Per-user decode is at parity without spec-dec (10.2 vs 11.7, bandwidth-bound).
vLLM's per-user speed = speculative decoding (lossless, target-verified). GB10 is
best-case (bandwidth-bound + idle compute); llama.cpp spec-dec measured 2.9x on
dense Qwen2.5-32B. Qwen3-32B has no native MTP - use Qwen3-1.7B draft or EAGLE3
head. Recommendation: make spec-dec easy for dense >=14B on Blackwell (keeps
Q4_K_M quality, no kernel). Prefill-kernel + continuous-batching are separate
(TTFT / aggregate). Our own DGX run pending (box rebooted, llama-cli hangs).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
MXFP4 dense moves prefill off int8-MMQ onto the FP4-MMA path (existing kernel) for
a free 1.44x - shippable as a Blackwell dense-quant recommendation. But it's ~17%
of the FP4 roofline, so the FP4-MMA kernel is itself untuned: ~4-6x still in the
kernel. Sharpens the target to TUNING the FP4-MMA (serves dense+MoE, only path to
beat vLLM). Marlin-style W4A16 BF16 is the alt to match on the BF16 ceiling.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Key corrections: (1) vLLM 24k is AGGREGATE; single-stream roofline ~3300 t/s
(BF16) / 6600 (FP4). (2) GB10 is 1:1:2 BF16:INT8:FP4 - INT8 == BF16, only FP4 is
2x. (3) Measured: dense int8-MMQ at 21% of ceiling, MoE FP4-MMQ at ~5% - both
EXIST, just untuned for Blackwell. Strategy: to MATCH vLLM, tune MMQ or build a
Marlin-style W4A16 BF16 GEMM (FP4 NOT required); to BEAT, fix the existing FP4
MMA on sm_121 (build/miscompile, not greenfield). Dropped the tcgen05 grouped
GEMM rewrite. Cheap next test: dense MXFP4 quant + existing FP4-MMA.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Researched: W4A4 hangs on GB10 because FlashInfer ships no FP4 cubins for
sm_120/121 (all datacenter Sm100a); dense mm_fp4 is gated-off/returns-zeros on
consumer Blackwell, and the FlashInfer FP4 autotuner spins on the first forward
pass. Not a misconfig - dense W4A4 inference isn't validated on sm_121. W4A16
(4-bit weight / 16-bit act, Marlin) vs llama Q4_K_M is the correct apples-to-
apples (same quant class) AND the fast path. Removed the misleading 'W4A4 would
be faster / lower bound' framing. Sources: vllm #30163/#26381, flashinfer
#2577/#3294, cutlass #3096.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Confirms parity (dense+MoE, both phases) is strictly the FP4 tensor-core kernel;
no config/flag shortcut remains.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Benchmark confirms dense prefill 7.6-32x behind too, so the kernel track needs a
non-grouped FP4 dense GEMM (simpler, land first) + the MoE grouped GEMM. Both
share the e2m1 block-scaled collective; dense is grouped-with-one-group.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
vLLM W4A16 vs llama Q4_K_M dense: prefill 7.6-32x behind (llama plateaus ~765,
vLLM scales to 24.4k); decode ~parity at B=1 (weight-bandwidth-bound), 2.2x at
B=64. Full NVFP4 (W4A4) hangs on this vLLM/GB10 stack - W4A16 used. Decision:
the Lever-3 kernel track must ALSO deliver a non-grouped FP4 dense GEMM, not just
the MoE grouped GEMM (dense GEMM is the simpler first kernel to land).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The only work that closes the vLLM gap on Blackwell: mul_mat_q<MXFP4> is 37%
prefill + 54.6% decode-B64 GPU time; paged attention can't touch it (proven).
Scaffold (builds clean on GB10, default byte-identical): fp4-grouped-moe.{cuh,cu}
entry + gated hook in ggml_cuda_mul_mat_id (env GGML_CUDA_FP4_GROUPED), always
falls back to MMQ for now. Design doc has the CUTLASS/tcgen05 implementation
phases + parity harness + the dense-path follow-up (#28).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Decode-dominated B=64 nsys: mul_mat_q<MXFP4> 54.6%, attention only 19.8%. Both
phases are FP4-MoE-kernel-bound (Lever 3). The paged series cannot close the vLLM
gap in either phase; its real value is capacity + prefix-sharing, not tok/s parity.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Prefill 6-48x behind and does NOT scale with B (kernel-bound, paging can't fix).
Decode: we win at B=1; 2.5-3.7x behind at B>=8 - THAT concurrency gap is the
engine's domain (0004 pool + 0005 continuous batching target it). Baseline for
the series to improve on.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Every edit mapped (gather-index graph input mirroring k_idxs; gather K/V/mask by
one aligned index; n_kv compaction; gated so stock stays byte-identical) with
the token-identical gate and the known risks (mask transpose layout, v_trans).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
find_slot places a sequence's tokens at permuted non-contiguous blocks; greedy
generation is token-identical to stock (verified on Qwen3-0.6B at the pin),
branch confirmed firing. Default off. The placement substrate for the gather-read.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
First patch of the stacking series. Adds src/paged-kv-manager.{h,cpp} (the
CPU-verified vLLM-parity block manager) + CMake entry. No behavior change.
Generated against the pinned LLAMA_VERSION; applies clean.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Numbered patches under backend/cpp/llama-cpp/patches/ applied in order against
the pinned LLAMA_VERSION (build hook in the llama.cpp: target). Each phase is one
small, independently-buildable patch so the work rebases cleanly across llama.cpp
bumps (anti-drift). README defines the series (0001 vendor manager -> 0006 prefix
caching) + the regen workflow.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
No tcgen05/CUTLASS grouped-GEMM MoE kernel exists upstream (merged/in-flight/
draft); CUTLASS not a dep; no fork has one; activation-quant gather already
fused. Matching vLLM needs a from-scratch tcgen05 grouped GEMM (months,
maintainers deferring to cuTile). No tractable patch closes the 27x.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
On NVIDIA Blackwell consumer GPUs (sm_120/121, incl. GB10/DGX Spark) a larger
physical batch (n_ubatch) materially lifts MoE prefill throughput - measured on
a GB10 with Qwen3-30B-A3B to lift the prefill ceiling and saturate at ~2048.
When a model config leaves `batch:` unset, EffectiveBatchSize now picks 2048 on
Blackwell instead of 512; explicit `batch:` always overrides. Detection is a
shared, cached Go helper (xsysinfo.IsNVIDIABlackwell, nvidia-smi compute_cap
>= 12). Logic is isolated in core/backend/hardware_defaults.go and applied at
the common ModelOptions builder, so it covers the C++ llama.cpp backend too.
Measured (GB10, Qwen3-Coder-30B-A3B MXFP4): prefill ub512 2994 -> ub2048 3316
t/s; saturates past 2048. Also recorded in the DGX gap plan: 4-bit quant alone
captures the decode win (Q4_K_M 93.5 >= MXFP4 86.4 t/s), MXFP4's only edge is
prefill via Blackwell FP4 tensor cores.
Tests: hardware_defaults_internal_test.go; existing NBatch specs pinned to the
no-Blackwell branch for determinism.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Captures the full dgx.casa investigation: Q8/F16/vLLM baselines, concurrency
sweeps, paged-patch (no concurrency effect), nsys+code root-cause (MoE int8
MMQ on Ampere-class tensor cores = 74.5% compute, no FP8 path), and the
lever plan.
Measured wins:
- Lever 1 (MXFP4 / Blackwell FP4 path): decode +50-66% over Q8, prefill
plateau +66% (2200->3650). MXFP4 decode beats vLLM FP8 at B=1 (83 vs 48),
near-parity B=8. Prefill still plateaus (fused-MoE-GEMM gap).
- Lever 2 (ubatch): saturates at 2048; ceiling is the kernel, not batch.
Designed (not built): Lever 3 fused FP4/FP8 MoE grouped GEMM, Lever 4 FP8
GEMM (needs ggml_mul_mat_ext scale plumbing), Lever 5 tcgen05 kernels, and
the complete paged attention (on-demand alloc + gather-read + continuous
batching + prefix sharing). Honest scope: each is multi-week kernel/systems
work.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Wire paged, non-contiguous fixed-size BLOCK placement into the real
llama.cpp KV cache (find_slot), behind env LLAMA_KV_PAGED, and validate
Gate 0 on a real GGUF: Qwen3-0.6B greedy generation is TOKEN-IDENTICAL to
the contiguous cache while its KV is physically scattered across permuted
blocks (cells 0-15, 144-159, 32-47, ...). Proven non-contiguous via
LLAMA_KV_PAGED_DEBUG, not a silent fallback.
This retires the correctness premise of paged attention IN THE MODEL (not
just at the ggml-op level): attention is invariant to physical KV placement,
because reads use per-cell pos/seq metadata for masking. The patch lives at
patches/0001-paged-kv-block-placement.patch (against llama.cpp 0253fb21f).
Scope: storage/placement layer, single sequence. Remaining (P4): the
gather-read compute path (attend only a seq's own blocks) for the throughput
win, and the multi-sequence driver. README updated with repro + status.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Capture verified state (P0 manager parity, P1 ggml write/gather, P2 attention
numerics 7.5e-08, P3 capacity 9.2x + prefix-sharing 11.3x) and the exact
remaining work: wire build_attn_paged into llama-graph.cpp and validate
token-identical generation on Qwen3-0.6B (Gate 0), then win-2 throughput.
Records the integration seams (create_memory, find_slot, get_k/get_v,
build_attn, mask) and the honest caveats (unified cache already shares a
pool; vLLM's classic kernel is deprecated) so the next session starts warm.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Quantify the two multi-tenant wins that are properties of the host-side
block model (vLLM-parity), independent of the in-model compute path:
WIN 1 concurrency capacity @ 512-block budget
contiguous (reserve n_ctx/seq): 4 sequences
paged (on-demand blocks): 37 sequences
--> 9.2x more concurrent sequences
WIN 3 cross-tenant prefix sharing (32 tenants, 1024-tok shared prefix)
prefix-cache OFF: 2176 physical blocks
prefix-cache ON: 192 physical blocks
--> 11.3x less KV memory
WIN 2 (throughput) is deliberately reported as PENDING: it requires the
paged gather-read path wired into llama-graph.cpp (Gate 0) and is not
measurable at the allocation layer. The win-1 baseline is per-sequence
n_ctx reservation (stream mode); llama.cpp's unified cache already shares
one pool, so the honest win there is on-demand sizing + prefix dedup.
Phase 3 (partial) of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Retire the central numeric risk from the design: feeding gather-to-scratch
KV (a sequence whose blocks are non-contiguous in the shared pool, [2,1,5])
into ggml's standard attention ops produces correct attention.
Path under test: set_rows write -> get_rows gather (K and V) ->
mul_mat(K,Q) -> soft_max_ext -> mul_mat(V^T, probs). Result is compared
against an independent host-computed softmax attention over the same K/V/Q.
Max abs error ~7.5e-08 (n_kv=48, d=8, n_q=4).
This proves the paged read path is numerically sound on CPU with no new
ggml op. Remaining: wire build_attn_paged into llama-graph.cpp and validate
Gate 0 (token-identical greedy generation in a real model).
Phase 2 (core) of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Validate the paged KV read/write path at the ggml-op level, driven by
PagedKVManager:
- write: ggml_set_rows(pool, k_src, slot_mapping) scatter K rows by slot
- read: ggml_get_rows(pool, gather_idx) gather a seq's slots into
contiguous scratch (the tensor an attention kernel consumes)
The test forces a non-contiguous, out-of-order physical block layout
(allocate seqA+seqB, free seqA, reallocate seqC -> blocks [2,1,5]) and
proves gather(write(x)) == x plus cross-sequence isolation in the shared
pool. This de-risks the central question (does slot-addressed paged storage
round-trip correctly through ggml) before the llama-graph integration.
Pool is statically allocated via ggml_backend_alloc_ctx_tensors, mirroring
how llama.cpp allocates its KV cache. CPU backend, no new ggml op.
Built against ggml from the vendored llama.cpp checkout.
Phase 1 of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): legible Usage charts - distinct prompt/completion hues + chart a11y
Prompt and completion were the same color (primary at 0.35 opacity), so the
stacked token charts read as one blurry blob. Completion now uses a distinct
data-viz hue (--color-data-3) at full opacity across the time chart, the
per-model distribution bars, and the tooltip. The source-mix chart is no longer
aria-hidden: it exposes role="img" with a label.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): sortable Users table
The admin Users table is now sortable by name, email, provider, role, status,
and created date - clickable headers with an aria-sort state, a direction
caret, and keyboard activation (Enter/Space). Permissions and Actions stay
non-sortable.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): unsaved-changes guard on Settings and Agent create/edit
Add a reusable UnsavedChangesGuard (router useBlocker + beforeunload) that
prompts before navigating away or closing the tab with unsaved edits. Wired to
Settings (existing isDirty) and AgentCreate (snapshot the loaded form, compare;
suppressed while saving so the post-save redirect is not blocked). Adds the
common.unsaved i18n keys.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): sortable Traces tables
Both trace tables are now sortable: the API table by method/path/status and the
backend table by type/time/model/duration, with aria-sort, a direction caret,
and keyboard activation. Sort and the expanded row reset when switching tabs
(the two tables have different columns).
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): responsive table reflow (cards on mobile), applied to Users
Dense admin tables sideways-scroll on phones. Add a reusable ResponsiveTable
that mirrors the <thead> labels onto each body cell (data-label) and a
<=640px stylesheet that stacks rows into label/value cards. Wired to both
Users tables; reusable for the other dense tables next.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): roll responsive table reflow to Traces, Models, Manage, Nodes
Apply ResponsiveTable to the remaining dense tables so they stack into
label/value cards on phones instead of scrolling sideways. Harden the
component for these tables: scope label-mirroring and the card CSS to direct
children (nested detail tables render normally), override inline min-width on
mobile, and pass through table/container inline styles. Nested expansion
tables in Nodes/Models/Manage are intentionally left as-is.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): unsaved-changes guard on the Fine-Tuning form
Editing the long fine-tune job form and navigating away silently discarded
everything. Snapshot the assembled getFormConfig() as a baseline, treat the
open form as dirty when it diverges, and reuse UnsavedChangesGuard to prompt
before leaving. The baseline is rebased after a job is submitted so leaving
afterward does not warn.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): add Fraunces variable serif + --font-serif token
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): serif display tier + section-heading typography scale
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): un-overload accent — nav rail, stronger focus ring, neutral hover
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): orchestrated page reveal + stagger motion primitives
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(ui): fix dead token refs + dedupe toggle to one primitive
Migrate all .toggle-slider consumers (Users, Chat, AgentChat) to the
canonical BEM toggle primitive and delete the legacy duplicate CSS block.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(ui): route boot fallback through the LoadingSpinner primitive
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): EmptyState primitive with serif title
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): Skeleton shimmer primitive
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): PageHeader + SectionHeading editorial primitives
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): StatusPill primitive + time-of-day greeting helper
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): Home editorial header + status line (north-star redesign)
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): Home loaded-models skeleton list, button hierarchy, EmptyState wizard
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ui): single focus ring (no double-ring) + neutralize stagger delay under reduced motion
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* refactor(ui): all-sans editorial headings + tint-only active nav
Per design review, pivot the heading strategy from hybrid-serif to a
refined grotesk: drop the Fraunces dependency, token, and import; page
titles, the Home greeting, and section/empty-state titles now use Geist
at semibold with the editorial fluid sizing and tight tracking. No serif
anywhere.
Active sidebar item is now a tint-only treatment (accent text + tinted
background); the left accent rail is removed and the shared base
.nav-item.active inset bar is suppressed in the sidebar (as the console
rail already does). Update the design-system e2e specs to assert the
sans display font and the tinted-background active state.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(e2e): add --host flag to ui-test-server
Allow binding the e2e/preview server to an arbitrary address (e.g.
0.0.0.0 to review the UI from another device on the LAN). Defaults to
127.0.0.1 so existing e2e behavior is unchanged.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(ui): declutter Home - discoverable + dismissable API, vertical balance
Home felt overloaded and top-heavy. Three changes from review:
- The API endpoint catalog (12 endpoints) is collapsed by default behind a
"Browse the API" disclosure; only the base URL + copy stay visible, so the
catalog is discoverable without dominating the page.
- The whole connect card is dismissable (x): dismissing unmounts it so the
vertical space is recovered, and the choice is remembered (localStorage).
- .home-page now fills its column and vertically centers its content when
there is slack, so sparse states (no models / card dismissed) read as a
balanced launcher instead of content jammed at the top. Overflow-safe:
tall content flows from the top and scrolls.
Adds connect.browse / connect.hide / connect.dismiss i18n keys to all locales.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): editorial PageHeader with section eyebrow + scroll-to-top on nav
PageHeader now derives its eyebrow from the route's section/console (Build /
Operate / Create) via sectionKeyForPath, so pages get a consistent, meaningful
eyebrow with no per-page wiring (override with the eyebrow prop, suppress with
eyebrow={null}). Settings adopts it as the first consumer.
Also fix a navigation scroll bug: the default layout uses the document as its
scroll container and route changes did not reset it, so navigating the console
rail from a scrolled page landed mid-view. App now scrolls to top on pathname
change.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(ui): adopt PageHeader on agent/media/import/backend pages (batch A)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* refactor(ui): adopt PageHeader on ops/admin/media pages (batch B)
Replace hand-rolled .page-header title blocks with the shared editorial
PageHeader component across 14 pages (Manage, Middleware, Models,
NodeBackendLogs, Nodes, P2P, SkillEdit, Skills, Sound, Traces, TTS, Usage,
Users, VideoGen). Title/subtitle move into PageHeader; header-own action
clusters (Models stats+buttons, Skills search+buttons) move into the actions
slot. Tabs, filters, stat cards, ResourceMonitor and page body stay as
siblings. Eyebrow is left to auto-derive from the route.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(ui): home greeting asserts sans font, not the dropped serif
The greeting render-smoke still asserted Fraunces; update it to assert the
Geist sans display font (and not Fraunces), matching the all-sans direction.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): ThemeToggle i18n + animated icon, drop transition:all
The theme toggle hard-coded its English tooltip; route it through the existing
nav switchToLightMode/switchToDarkMode keys and add an aria-label. The sun/moon
icon now replays a small rotate+fade on theme change (keyed remount; honored by
the global reduced-motion block). Replace the .theme-toggle `transition: all`
with explicit properties.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): canvas drag-to-resize + slide-in, fix hooks order, typed download
Canvas was a fixed pane; make it a workbench:
- Drag the panel's left edge to resize (clamped 360px..75vw), persisted to
localStorage, double-click to reset; hidden and full-width on narrow screens.
- Slide-in/fade on open via canvasSlideIn (honored by reduced-motion).
- Fix a rules-of-hooks bug: the `if (!current) return null` early return sat
above useEffect, so the hook count changed when artifacts emptied. All hooks
now run unconditionally before the guard.
- Downloads use the artifact language's real extension + MIME (a Python
artifact saves as .py, not .txt) via extensionForLanguage.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): per-message code blocks get a language header + copy button
Chat code blocks now render inside a framed block with a header showing the
language and a copy button (delegated handler, copies the block and flips to a
check briefly). Decoration + highlighting run from a MutationObserver scoped to
the messages container, which fires reliably for streamed responses AND for
chats loaded/switched from storage - the prior render-keyed effect missed the
load path (code was left unhighlighted on reload). The observer disconnects
while mutating so it does not retrigger on its own edits.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): image attachments show a thumbnail in the composer
Staged image attachments now preview as a 28px thumbnail (from their data URL)
instead of a bare file icon; other types keep the icon. File names truncate and
the remove button gets an aria-label.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): jump-to-latest pill when scrolled up in chat
When the user scrolls away from the bottom of a conversation, a floating
"Jump to latest" pill appears (sticky, centered above the composer); clicking
it smooth-scrolls to the newest message and re-pins auto-scroll. Resets on
chat switch. Adds the chat.actions.jumpToLatest i18n key to all locales.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): canvas fullscreen toggle + keyboard tab navigation
The canvas header gains a fullscreen toggle (expands the panel to cover the
viewport; resize handle hidden while fullscreen). The artifact tab strip is now
a proper ARIA tablist with roving tabindex and Left/Right arrow-key navigation.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): image result lightbox (zoom, prev/next, download, keyboard)
Generated/history images on the Image page are now clickable, opening a
fullscreen Lightbox with a download button, prev/next navigation, an N/M
counter, and keyboard control (Esc to close, Left/Right to navigate). Adds a
reusable `Lightbox` component (usable later for Video) and the media.image
.actions.view i18n key.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): generation progress with placeholder tiles + elapsed timer
Image generation replaces the bare spinner with a GenerationProgress scaffold:
shimmer placeholder tiles matching the requested count plus a live elapsed-time
readout, so the (often slow) wait feels accountable. Reusable for the other
media generation pages.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): generation progress on Video, TTS, and Sound pages
Reuse GenerationProgress (placeholder tile + elapsed timer) in place of the
bare spinner on the remaining media generation pages, so every slow generation
gives the same accountable feedback.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): agent chat gets per-message code-copy + reliable highlighting
AgentChat now shares Chat's code-block treatment: it runs highlightAll +
enhanceCodeBlocks from a MutationObserver on its messages container (the same
proven path), so agent responses get language headers, copy buttons, and
highlighting that fires for both streamed and loaded messages - closing the
divergence with the main chat without a large refactor.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): Talk voice visualizer
Add a hero frequency-bar visualizer at the top of the Talk page so users get
ambient feedback that they are heard and that the assistant is speaking - the
audit's main Talk gap (the only prior feedback was a small status pill; the
waveform was buried in the dev diagnostics panel).
VoiceVisualizer is self-contained: it builds its own AudioContext + analysers
from the output <audio> stream (speaking) and the mic stream (listening) so it
does not touch the existing WebRTC/diagnostics graph. Bars are status-tinted
(idle/connected/listening/speaking/error) and animate with a gentle idle wave
when not connected. Live mic/output animation is exercised on a real session.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
chore: bump localrecall to include PostgreSQL table-name sanitization fix
Pulls mudler/localrecall#48, which makes sanitizeTableName allowlist valid
identifier characters so collection names containing ':' (e.g. the per-user
"legacy-api-key:<agent>" namespace) no longer break PostgreSQL CREATE TABLE
with "syntax error at or near ':'".
Fixes#10375
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
docs: document the privacy-filter.cpp backend in README and compatibility table
The privacy-filter.cpp backend (#10360) was registered in backend/index.yaml
and referenced from the PII feature docs, but was missing from the backend
catalog surfaces. Add it to the README "Backends built by us" table, the
compatibility table (Utilities & Other, CPU/CUDA 13/Vulkan), and the backend
type list in the backends feature doc.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Squashed feat/pii-ner-tier-engine rebased onto master (was 45 commits; see
backup/pii-ner-tier-engine-prerebase). Net change:
- privacy-filter.cpp: standalone GGML engine for the openai-privacy-filter
PII/NER token classifier, wired as a LocalAI gRPC backend (CPU/CUDA/Vulkan).
TokenClassify moves off the patched llama.cpp path onto this backend.
- PII filter reworked to be NER-centric (encoder/NER detection tier scanning
whole conversations as one document), with a recreated bounded restricted-
regex secret-matching pattern detector tier alongside it (per-model
pii_detection.builtins / .patterns + core/services/routing/piipattern).
- Detection labelled by source (ner vs pattern); backend trace / confidence /
debug observability; analyze/redact exposed as a synchronous API.
- Instance-wide default detector policy + per-usecase default-on; request
filtering extended to completions, embeddings, edits & Ollama.
- React UI: NER-centric PII editor, detector-models table, pattern/builtins
editor, middleware default-policy UI.
- Gallery: privacy-filter-multilingual token-classify model + NER install
filter; token_classify known_usecase; batch sized to context for NER models.
privacy-filter backend registered in the backend gallery (cpu/vulkan/cuda-13
meta + image entries with a capabilities map) matching its CI matrix jobs,
and an /import-model auto-detect importer (PrivacyFilterImporter, narrow
privacy-filter GGUF detection) replacing the prior pref-only registration.
Reconciled against master's independent evolution:
- Dropped master's PIIPatternOverrides feature (global-pattern runtime
overrides + /api/pii/patterns API + runtime_settings.json persistence). The
per-model NER + pattern-detector design supersedes it; it was built on the
global redactor pattern set this branch replaced.
- Reverted the llama.cpp Score carry-patch (0006-server-task-type-score):
removed the patch and restored master's grpc-server.cpp Score RPC (direct
llama_decode, slot-loop bypass) and LLAMA_VERSION pin, plus master's
model_config validation forbidding score + chat/completion/embeddings on
llama-cpp. token_classify is unaffected (it runs on the privacy-filter
backend, not llama-cpp).
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
The sidebar collapse toggle silently no-op'd in dev builds. toggleCollapse
ran its side effects (localStorage write + sidebar-collapse dispatch) inside
the setCollapsed updater. StrictMode double-invokes updaters in dev to surface
impurity, and the synchronous dispatch re-entered setState from the
App/Sidebar listeners mid-update, so the toggle never committed. Production
builds don't double-invoke, which is why only the dev server was affected.
Compute next from current state and move the persist + broadcast into the
handler body so the updater is pure.
Also fix the Talk page anchoring to the transcript box on load. The transcript
is its own overflow container, but scrollIntoView bubbles to every scrollable
ancestor including the window, yanking the whole page down on mount. Scroll
the transcript container directly instead.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): restructure sidebar into Create/Recognition/Build tiers
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ui): preserve exact sidebar gating for agent items and fine-tune/quantize
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* i18n(ui): add nav tier + console keys to all locales
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): add grouped admin console via pathless layout route
Wrap the existing admin pages in a pathless AdminConsoleLayout route so
they keep their exact flat URLs while gaining a grouped left rail
(Inference / Cluster / Observability / Access / System). Rail item gating
mirrors the sidebar (adminOnly / authOnly / feature + /api/features). The
layout forwards the App-level outlet context (addToast) to the wrapped
pages, which read it via useOutletContext().
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): fold Audio Transform into Studio as a tab
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(ui): update e2e specs for tiered nav + admin console
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ui): gate embedded Studio transform view on audio_transform feature
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): visual polish + console-ize Build/Recognition tiers
Generalize the one-off admin console into a reusable ConsoleLayout driven by
a shared consoleConfig (single source of truth for the rail, its gating, and
the sidebar entry that opens it — removes the prior rail/sidebar drift).
- Promote Install Models to the top menu next to Home.
- Build and Operate are now console tiers (secondary rail); Create stays inline.
- Fold Recognition (Faces/Voices) into the Build console as a group alongside
Automation and Training so it no longer feels split off.
- Style the console rail as a panel (header, grouped dividers, rounded active
pills) with a hover nudge; sidebar items become inset rounded pills. The rail
slide-in plays only when entering a console, not on item-to-item sub-nav
(which remounts the layout), so switching no longer flashes the menu. All
token-based (light + dark), respects reduced-motion.
- Add a delayed RouteFallback loader so lazy routes no longer flash blank;
scoped inside ConsoleLayout so the rail stays put while the body loads.
- Update e2e specs for the new structure (.console-* classes, console entries).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): persist console layout across sub-nav + add drop-in endpoint section
- Keep the page-transition key stable within a console (derived from the
shared console config) so the ConsoleLayout and its rail persist across
item-to-item navigation instead of remounting — fixes the submenu flash.
Cache /api/features across mounts and play the rail entrance animation only
when actually entering a console.
- Add a "One endpoint, every API" section to Home: leads with LocalAI's own
native API (images, video, realtime voice over WebRTC/WS, depth, object
detection, rerank, audio/TTS, face & voice recognition) plus a Full API
reference link, then the drop-in compatibility layer (OpenAI, Anthropic,
Ollama, OpenAI Responses) with the live copyable base URL. All 7 locales.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ui): revert Middleware nav label rename (keep Middleware in all locales)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Bring the Backend & Model Compatibility Table up to the full set of
backends published in backend/index.yaml (60+), organized by modality
with per-backend acceleration targets. Add an "Available Backends"
pointer and expand the backend-type list in the backends feature doc.
Update the README backend count to 60+ and add a "Backends built by us"
section listing the native C/C++/GGML engines maintained by the LocalAI
project (parakeet.cpp, voxtral.c, vibevoice.cpp, rf-detr.cpp,
locate-anything.cpp, depth-anything.cpp, LocalVQE, local-store).
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
feat(ds4): wire SSD streaming + quality engine options, add 128GB DeepSeek gallery models
The ds4 backend zero-initialized ds4_engine_options and exposed none of the
engine's tunable knobs, so SSD streaming (run a model larger than RAM by
streaming routed MoE experts from the GGUF on SSD) and the quality/perf knobs
were unreachable from LocalAI model YAMLs.
Map ModelOptions.Options onto ds4_engine_options through a declarative table
(kEngineOptSpecs + apply_engine_option) instead of per-field branches: the
struct is fixed C with no reflection, so the field set is enumerated once and a
future knob is a one-line table row. Two fields use ds4's own typed parsers
(GiB budgets, cache-experts count-or-NGB). Bare flags (e.g. "ssd_streaming")
mean true; path-type options (mtp_path, expert_profile_path,
directional_steering_file) resolve relative to the model directory so a gallery
entry can reference a companion file by bare filename. mtp_draft/mtp_margin are
now validated rather than parsed with throwing std::stoi/std::stof.
Add gallery entries for the 128 GB class:
- deepseek-v4-flash-q2-q4 (~91 GB, mixed q2/q4, fits RAM, higher quality)
- deepseek-v4-flash-q4-ssd (~153 GB full 4-bit, runs on 128 GB via SSD streaming)
- deepseek-v4-flash-q2-mtp (~81 GB + MTP speculative draft weights)
- deepseek-v4-pro-q2-ssd (~433 GB Pro, experimental SSD streaming)
SSD streaming is Metal (Darwin) only; the options are inert on CUDA/CPU.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>