Merge remote-tracking branch 'origin/master' into worktree-feat+paged-attention

# Conflicts: # gallery/index.yaml
2026-06-27 09:57:14 -04:00 · 2026-06-26 21:38:56 +00:00
parent 6dd8a3d895 56600eec3e
commit c1f1d1e8ea
11 changed files with 330 additions and 50 deletions
--- a/backend/cpp/llama-cpp/patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch
@@ -1,4 +1,4 @@
-From 944636cf34b486d4035575e48845840368de0743 Mon Sep 17 00:00:00 2001
+From fafe8785c8595f53a51efec20cf84f9146437e0c Mon Sep 17 00:00:00 2001
 From: Ettore Di Giacinto <mudler@localai.io>
 Date: Fri, 26 Jun 2026 22:58:47 +0200
 Subject: [PATCH] feat(paged): qwen35 recurrent-state gather fusion (patch
@@ -46,22 +46,56 @@ MoE npl128 783.9 t/s (step 163.3 ms vs MOE_GAP 169.8 ms @0025), dense 377.3 t/s.
 Assisted-by: Claude:opus-4.8 [Claude Code]
 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
 ---
- LEVER1_GATHER_RESULTS.md       | 110 +++++++++++++++++++++++
- ggml/include/ggml.h            |  20 +++++
- ggml/src/ggml-cpu/ops.cpp      |  90 ++++++++++++++++++-
- ggml/src/ggml-cuda/ssm-conv.cu | 155 ++++++++++++++++++++++++++++++++-
+ LEVER1_GATHER_PROGRESS.md      |  26 ++++++
+ LEVER1_GATHER_RESULTS.md       | 163 +++++++++++++++++++++++++++++++++
+ ggml/include/ggml.h            |  20 ++++
+ ggml/src/ggml-cpu/ops.cpp      |  90 +++++++++++++++++-
+ ggml/src/ggml-cuda/ssm-conv.cu | 155 ++++++++++++++++++++++++++++++-
 ggml/src/ggml.c                |  62 +++++++++++++
 src/models/delta-net-base.cpp  |  26 ++++--
- tests/test-backend-ops.cpp     |  69 +++++++++++++++
- 7 files changed, 521 insertions(+), 11 deletions(-)
+ tests/test-backend-ops.cpp     |  69 ++++++++++++++
+ 8 files changed, 600 insertions(+), 11 deletions(-)
+ create mode 100644 LEVER1_GATHER_PROGRESS.md
 create mode 100644 LEVER1_GATHER_RESULTS.md

+diff --git a/LEVER1_GATHER_PROGRESS.md b/LEVER1_GATHER_PROGRESS.md
+new file mode 100644
+index 0000000..e4d14b9
+--- /dev/null
+++ b/LEVER1_GATHER_PROGRESS.md
+@@ -0,0 +1,26 @@
+# Lever 1 (residual recurrent-state gather fusion) - PROGRESS / gather-bench DONE
+
+STATUS: COMPLETE. Bit-exact, both gates green, rigorous same-session A/B bench done, committed both trees.
+
+## What
+Fused the residual conv-state tap k_get_rows (build_conv_state_fused) in-place into the SSM_CONV
+update via ggml_ssm_conv_update_inplace_ids (src[4]=ids discriminator). Mirrors 0019 (SSM-state) +
+0018 (in-place). Eliminates the last k_get_rows in the GDN decode path. Bit-exact by construction
+(read path gather -> indexed in-kernel read; values + reduction order unchanged).
+
+## Gates (lever1 build = build-cuda, base = build-cuda-base = 0026)
+- md5 greedy --temp 0 --seed 1 -n 48: dense 5951a5b4d624ce891e22ab5fca9bc439 == baseline;
+  MoE 07db32c2bcb78d17a43ed18bc22705cd == baseline; base == lever1 (byte-identical).
+- test-backend-ops CUDA0: SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
+  GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
+
+## Bench (S_TG t/s, npp128 ntg128 npl 32/128)
+- dense npl128 369.95 -> 377.83 (+2.13%, 94.6 -> 96.6% of vLLM 391); npl32 208.56 -> 209.39.
+- MoE   npl128 763.47 -> 777.95 (+1.90%, 84.7 -> 86.3% of vLLM 901); npl32 456.85 -> 459.56.
+- nsys MoE decode: k_get_rows_float 17334 -> 15414 inst (-1920 = 30 GDN x 64 steps), 358.37 -> 133.52 ms;
+  step 167.7 -> 164.5 ms (-3.13 ms/step). gather eliminated, replaced by no-op nonident kernel.
+
+## Artifacts
+- Patch: patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch (LocalAI worktree)
+- Docs: LEVER1_GATHER_RESULTS.md (full bench tables)
+- DGX bench outs: ab_{dense,moe}_{base,lever1}.out, nab_{base,lever1}.kern.csv, md5{d,m}_{base,lever1}.txt
 diff --git a/LEVER1_GATHER_RESULTS.md b/LEVER1_GATHER_RESULTS.md
 new file mode 100644
-index 0000000..c78e3c0
+index 0000000..afced02
 --- /dev/null
 +++ b/LEVER1_GATHER_RESULTS.md
-@@ -0,0 +1,110 @@
+@@ -0,0 +1,163 @@
 +# Patch 0028: qwen35 recurrent-state gather fusion (Lever 1, bit-exact)
 +
 +The MoE-gap groundtruth (`MOE_GAP_VS_VLLM.md`) found `k_get_rows_float` to be the single biggest
@@ -172,6 +206,59 @@ index 0000000..c78e3c0
 ++ 0021 conv compute). Additive and risk-free; ready for the rigorous same-session A/B bench.
 +
 +Assisted-by: Claude:opus-4.8 [Claude Code]
+
+## Rigorous same-session A/B bench (label gather-bench, DGX GB10 sm_121)
+
+Independently re-validated on a fresh GPU session. `build-cuda-base` = pre-lever-1 (0026, 33e7c65;
+NO `ssm_conv_update_ids` symbol) vs `build-cuda` = lever-1 (this commit; WITH it). Same env, back-to-back.
+
+### Gate re-confirm (greedy --temp 0 --seed 1 -n 48, -fa on) - base == lever1 == recorded baseline
+
+| model             | base (0026)                      | lever1 (0028)                    | recorded baseline                |
+|-------------------|----------------------------------|----------------------------------|----------------------------------|
+| q36-27b-nvfp4     | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 |
+| q36-35b-a3b-nvfp4 | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd |
+
+test-backend-ops (CUDA0): SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
+GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
+
+### decode_agg (S_TG t/s) before/after, npp128 ntg128 -npl 32,128 -c 33000
+
+dense q36-27b-nvfp4 (LLAMA_KV_PAGED=1):
+
+| npl | base S_TG | lever1 S_TG | delta  | vs vLLM 391    |
+|-----|-----------|-------------|--------|----------------|
+| 32  | 208.56    | 209.39      | +0.40% | -              |
+| 128 | 369.95    | 377.83      | +2.13% | 94.6% -> 96.6% |
+
+MoE q36-35b-a3b-nvfp4 (LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1):
+
+| npl | base S_TG | lever1 S_TG | delta  | vs vLLM 901    |
+|-----|-----------|-------------|--------|----------------|
+| 32  | 456.85    | 459.56      | +0.59% | -              |
+| 128 | 763.47    | 777.95      | +1.90% | 84.7% -> 86.3% |
+
+Step time npl128: dense 346.0 -> 338.8 ms/batch-step, MoE 167.7 -> 164.5 ms/step (-3.13 ms/step).
+
+### nsys (MoE decode, npp128 ntg64 npl128, same env) - k_get_rows eliminated
+
+| kernel                          | base (0026)            | lever1 (0028)                                |
+|---------------------------------|------------------------|----------------------------------------------|
+| k_get_rows_float<float,float>   | 17334 inst / 358.37 ms | 15414 inst / 133.52 ms                       |
+| delta                           |                        | -1920 inst (= 30 GDN x 64 steps), -224.85 ms |
+| ssm_conv_update(_ids)_f32       | 219.71 ms (update)     | 225.75 ms (update_ids, +6 ms)                |
+| ssm_conv_gather_nonident_kernel | -                      | 1920 x ~1.13 us = 2.17 ms (no-op, all ident) |
+
+The 1920 big ~114 us conv-tap gathers are gone; only the ~1.13 us no-op gather kernel + ~6 ms folded
+into the update kernel are added. Net GDN get_rows saving ~216 ms / 64 steps = ~3.4 ms/step, matching
+the -3.13 ms/step throughput delta at npl128.
+
+### Verdict (gather-bench)
+
+Bit-exact (gate re-confirmed, both md5 byte-identical to baseline), the residual k_get_rows conv
+gather is independently nsys-confirmed eliminated (-1920 inst, -224.85 ms over 64 steps), and decode
+throughput lifts BOTH models in the same session: dense npl128 +2.13% (94.6 -> 96.6% of vLLM),
+MoE npl128 +1.90% (84.7 -> 86.3% of vLLM). Ship it.
 diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
 index 2a5cbce..5fa220a 100644
 --- a/ggml/include/ggml.h
--- a/backend/cpp/llama-cpp/patches/paged/LEVER1_GATHER_PROGRESS.md
+++ b/backend/cpp/llama-cpp/patches/paged/LEVER1_GATHER_PROGRESS.md
@@ -1,42 +1,26 @@
-# LEVER1_GATHER_PROGRESS.md - gather-build GPU agent checkpoint
+# Lever 1 (residual recurrent-state gather fusion) - PROGRESS / gather-bench DONE

-Status: **DONE.** Residual k_get_rows fused in-place, bit-exact, both gates pass. Patch 0028.
+STATUS: COMPLETE. Bit-exact, both gates green, rigorous same-session A/B bench done, committed both trees.

-## Lever
-Fuse the residual `k_get_rows_float` in the GDN decode path (the biggest single kernel vLLM lacks,
-~5.2 ms/step MoE per MOE_GAP_VS_VLLM.md). 0019 fused the SSM-state gather; 0021 fused the conv
-compute but kept a `build_rs` gather for the conv taps. This patch closes that last gather.
+## What
+Fused the residual conv-state tap k_get_rows (build_conv_state_fused) in-place into the SSM_CONV
+update via ggml_ssm_conv_update_inplace_ids (src[4]=ids discriminator). Mirrors 0019 (SSM-state) +
+0018 (in-place). Eliminates the last k_get_rows in the GDN decode path. Bit-exact by construction
+(read path gather -> indexed in-kernel read; values + reduction order unchanged).

-## Located (nsys, DGX GB10, MoE q36-35b-a3b-nvfp4, npp128 ntg24 npl128)
-The residual is the **conv-state tap gather** in `build_conv_state_fused`
-(`src/models/delta-net-base.cpp`): the plain 4-arg `build_rs` -> `ggml_get_rows` of n_embd_r = 24576
-floats (= (d_conv-1)*(d_inner + 2*n_group*d_state) = 3*8192) x 128 seqs, once per GDN layer per step.
-Decode-window `k_get_rows_float<float,float>` had a BIG cluster of ~720 instances (30 GDN x 24) at
-~115 us = ~3.4 ms/step (5.2 ms/step at steady ntg=128). grid (ne10=128, block_num_y=96) confirmed
-ne00=24576 == n_embd_r (the SSM n_embd_s=524288 gather is already fused by 0019).
+## Gates (lever1 build = build-cuda, base = build-cuda-base = 0026)
+- md5 greedy --temp 0 --seed 1 -n 48: dense 5951a5b4d624ce891e22ab5fca9bc439 == baseline;
+  MoE 07db32c2bcb78d17a43ed18bc22705cd == baseline; base == lever1 (byte-identical).
+- test-backend-ops CUDA0: SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
+  GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.

-## Built (paged branch f32 default = 0026 hybrid default is f32)
-New op `ggml_ssm_conv_update_inplace_ids` (src[4]=ids, op_params[1]=rs_head): reads each seq's prior
-taps from cache[ids[s]] in-kernel (identity -> in place from conv_state_dst; non-identity -> disjoint
-scratch via ssm_conv_gather_nonident_kernel). Mirrors 0019. Files: ggml.h, ggml.c, ssm-conv.cu,
-ggml-cpu/ops.cpp, delta-net-base.cpp, tests/test-backend-ops.cpp. Build EXIT=0.
-
-## GATE - PASS
- test-backend-ops (CUDA0 2/2): SSM_CONV_UPDATE_IDS OK (new), SSM_CONV_UPDATE OK, SSM_CONV OK,
-  GATED_DELTA_NET OK, GET_ROWS OK.
- greedy md5 (-temp 0 -seed 1 -n 48) BYTE-IDENTICAL both models:
-  dense 5951a5b4d624ce891e22ab5fca9bc439, MoE 07db32c2bcb78d17a43ed18bc22705cd (== baseline).
- nsys: k_get_rows<float,float> 10174 -> 9454 (720 fewer), 186.3 -> 102.8 ms; conv gathers replaced
-  by 720 x ~1.1 us no-op gather. MoE npl128 783.9 t/s (step 163.3 ms vs 169.8 @0025), dense 377.3 t/s.
+## Bench (S_TG t/s, npp128 ntg128 npl 32/128)
+- dense npl128 369.95 -> 377.83 (+2.13%, 94.6 -> 96.6% of vLLM 391); npl32 208.56 -> 209.39.
+- MoE   npl128 763.47 -> 777.95 (+1.90%, 84.7 -> 86.3% of vLLM 901); npl32 456.85 -> 459.56.
+- nsys MoE decode: k_get_rows_float 17334 -> 15414 inst (-1920 = 30 GDN x 64 steps), 358.37 -> 133.52 ms;
+  step 167.7 -> 164.5 ms (-3.13 ms/step). gather eliminated, replaced by no-op nonident kernel.

 ## Artifacts
- DGX: commit `944636c` on branch `paged`; LEVER1_GATHER_RESULTS.md in llama tree; nsys
-  `/tmp/kgr_moe.nsys-rep` (before) + `/tmp/kgr_moe_after.nsys-rep` (after).
- LocalAI worktree: patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch + LEVER1_GATHER_RESULTS.md.
- BOTH trees committed (-s). NOT pushed.
-
-## Next
-Ready for the rigorous same-session A/B decode bench (npl 32/128, dense + MoE, before/after on the
-same 0026 base). The kernel-elimination and bit-exactness are proven; the bench quantifies the lift.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
+- Patch: patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch (LocalAI worktree)
+- Docs: LEVER1_GATHER_RESULTS.md (full bench tables)
+- DGX bench outs: ab_{dense,moe}_{base,lever1}.out, nab_{base,lever1}.kern.csv, md5{d,m}_{base,lever1}.txt
--- a/backend/cpp/llama-cpp/patches/paged/LEVER1_GATHER_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/LEVER1_GATHER_RESULTS.md
@@ -108,3 +108,56 @@ shared GDN conv path). This closes the last `k_get_rows` in the GDN decode path
 + 0021 conv compute). Additive and risk-free; ready for the rigorous same-session A/B bench.

 Assisted-by: Claude:opus-4.8 [Claude Code]
+
+## Rigorous same-session A/B bench (label gather-bench, DGX GB10 sm_121)
+
+Independently re-validated on a fresh GPU session. `build-cuda-base` = pre-lever-1 (0026, 33e7c65;
+NO `ssm_conv_update_ids` symbol) vs `build-cuda` = lever-1 (this commit; WITH it). Same env, back-to-back.
+
+### Gate re-confirm (greedy --temp 0 --seed 1 -n 48, -fa on) - base == lever1 == recorded baseline
+
+| model             | base (0026)                      | lever1 (0028)                    | recorded baseline                |
+|-------------------|----------------------------------|----------------------------------|----------------------------------|
+| q36-27b-nvfp4     | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 |
+| q36-35b-a3b-nvfp4 | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd |
+
+test-backend-ops (CUDA0): SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
+GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
+
+### decode_agg (S_TG t/s) before/after, npp128 ntg128 -npl 32,128 -c 33000
+
+dense q36-27b-nvfp4 (LLAMA_KV_PAGED=1):
+
+| npl | base S_TG | lever1 S_TG | delta  | vs vLLM 391    |
+|-----|-----------|-------------|--------|----------------|
+| 32  | 208.56    | 209.39      | +0.40% | -              |
+| 128 | 369.95    | 377.83      | +2.13% | 94.6% -> 96.6% |
+
+MoE q36-35b-a3b-nvfp4 (LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1):
+
+| npl | base S_TG | lever1 S_TG | delta  | vs vLLM 901    |
+|-----|-----------|-------------|--------|----------------|
+| 32  | 456.85    | 459.56      | +0.59% | -              |
+| 128 | 763.47    | 777.95      | +1.90% | 84.7% -> 86.3% |
+
+Step time npl128: dense 346.0 -> 338.8 ms/batch-step, MoE 167.7 -> 164.5 ms/step (-3.13 ms/step).
+
+### nsys (MoE decode, npp128 ntg64 npl128, same env) - k_get_rows eliminated
+
+| kernel                          | base (0026)            | lever1 (0028)                                |
+|---------------------------------|------------------------|----------------------------------------------|
+| k_get_rows_float<float,float>   | 17334 inst / 358.37 ms | 15414 inst / 133.52 ms                       |
+| delta                           |                        | -1920 inst (= 30 GDN x 64 steps), -224.85 ms |
+| ssm_conv_update(_ids)_f32       | 219.71 ms (update)     | 225.75 ms (update_ids, +6 ms)                |
+| ssm_conv_gather_nonident_kernel | -                      | 1920 x ~1.13 us = 2.17 ms (no-op, all ident) |
+
+The 1920 big ~114 us conv-tap gathers are gone; only the ~1.13 us no-op gather kernel + ~6 ms folded
+into the update kernel are added. Net GDN get_rows saving ~216 ms / 64 steps = ~3.4 ms/step, matching
+the -3.13 ms/step throughput delta at npl128.
+
+### Verdict (gather-bench)
+
+Bit-exact (gate re-confirmed, both md5 byte-identical to baseline), the residual k_get_rows conv
+gather is independently nsys-confirmed eliminated (-1920 inst, -224.85 ms over 64 steps), and decode
+throughput lifts BOTH models in the same session: dense npl128 +2.13% (94.6 -> 96.6% of vLLM),
+MoE npl128 +1.90% (84.7 -> 86.3% of vLLM). Ship it.
--- a/backend/go/piper/package.sh
+++ b/backend/go/piper/package.sh
@@ -16,7 +16,15 @@ cp -rfv $CURDIR/run.sh $CURDIR/package/
 cp -rfLv $CURDIR/sources/go-piper/piper-phonemize/pi/lib/* $CURDIR/package/lib/

 # Detect architecture and copy appropriate libraries
-if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
+if [ "$(uname)" = "Darwin" ]; then
+    # macOS has no glibc loader to bundle. The piper binary links its bundled
+    # libs (libucd, libespeak-ng, libpiper_phonemize, libonnxruntime) via
+    # @rpath but ships with no LC_RPATH, so dyld aborts at launch with
+    # "Library not loaded: @rpath/libucd.dylib ... no LC_RPATH's found".
+    # Add an @loader_path/lib rpath so @rpath resolves to package/lib/.
+    echo "Detected macOS; adding @loader_path/lib rpath so bundled libs resolve via @rpath..."
+    install_name_tool -add_rpath @loader_path/lib "$CURDIR/package/piper"
+elif [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
    # x86_64 architecture
    echo "Detected x86_64 architecture, copying x86_64 libraries..."
    cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
--- a/backend/go/piper/run.sh
+++ b/backend/go/piper/run.sh
@@ -4,7 +4,12 @@ set -ex
 CURDIR=$(dirname "$(realpath "$0")")

 export ESPEAK_NG_DATA="$CURDIR"/espeak-ng-data
-export LD_LIBRARY_PATH="$CURDIR"/lib:$LD_LIBRARY_PATH
+
+if [ "$(uname)" = "Darwin" ]; then
+	export DYLD_LIBRARY_PATH="$CURDIR"/lib:$DYLD_LIBRARY_PATH
+else
+	export LD_LIBRARY_PATH="$CURDIR"/lib:$LD_LIBRARY_PATH
+fi

 # If there is a lib/ld.so, use it
 if [ -f "$CURDIR"/lib/ld.so ]; then
--- a/backend/go/silero-vad/package.sh
+++ b/backend/go/silero-vad/package.sh
@@ -15,7 +15,14 @@ cp -avf $CURDIR/run.sh $CURDIR/package/
 cp -rfLv $CURDIR/backend-assets/lib/* $CURDIR/package/lib/

 # Detect architecture and copy appropriate libraries
-if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
+if [ "$(uname)" = "Darwin" ]; then
+    # macOS has no glibc loader to bundle. silero-vad links its bundled
+    # libonnxruntime via @rpath but ships with no LC_RPATH, so dyld can't find
+    # it at runtime. Add an @loader_path/lib rpath so @rpath resolves to
+    # package/lib/ (matching the piper darwin fix, #10525).
+    echo "Detected macOS; adding @loader_path/lib rpath so bundled libs resolve via @rpath..."
+    install_name_tool -add_rpath @loader_path/lib "$CURDIR/package/silero-vad"
+elif [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
    # x86_64 architecture
    echo "Detected x86_64 architecture, copying x86_64 libraries..."
    cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
--- a/backend/go/silero-vad/run.sh
+++ b/backend/go/silero-vad/run.sh
@@ -3,7 +3,11 @@ set -ex

 CURDIR=$(dirname "$(realpath "$0")")

-export LD_LIBRARY_PATH="$CURDIR"/lib:$LD_LIBRARY_PATH
+if [ "$(uname)" = "Darwin" ]; then
+	export DYLD_LIBRARY_PATH="$CURDIR"/lib:$DYLD_LIBRARY_PATH
+else
+	export LD_LIBRARY_PATH="$CURDIR"/lib:$LD_LIBRARY_PATH
+fi

 # If there is a lib/ld.so, use it
 if [ -f "$CURDIR"/lib/ld.so ]; then
--- a/core/http/endpoints/localai/nodes.go
+++ b/core/http/endpoints/localai/nodes.go
@@ -60,7 +60,10 @@ func GetNodeEndpoint(registry *nodes.NodeRegistry) echo.HandlerFunc {
 	return func(c echo.Context) error {
 		ctx := c.Request().Context()
 		id := c.Param("id")
-		node, err := registry.Get(ctx, id)
+		// GetWithExtras (not Get) so the response carries the node's labels,
+		// loaded-model count, and in-flight total — the bare BackendNode keeps
+		// labels in a separate table, leaving the detail view's label list empty.
+		node, err := registry.GetWithExtras(ctx, id)
 		if err != nil {
 			return c.JSON(http.StatusNotFound, nodeError(http.StatusNotFound, "node not found"))
 		}
--- a/core/services/nodes/registry.go
+++ b/core/services/nodes/registry.go
@@ -673,6 +673,49 @@ func (r *NodeRegistry) Get(ctx context.Context, nodeID string) (*BackendNode, er
 	return &node, nil
 }

+// GetWithExtras returns a single node enriched with the same computed fields as
+// ListWithExtras (labels, loaded-model count, in-flight total). The plain Get
+// returns a bare BackendNode whose Labels live in a separate table, so the node
+// detail view needs this to show a node's existing labels and live counts.
+func (r *NodeRegistry) GetWithExtras(ctx context.Context, nodeID string) (*NodeWithExtras, error) {
+	node, err := r.Get(ctx, nodeID)
+	if err != nil {
+		return nil, err
+	}
+
+	labels := make(map[string]string)
+	nodeLabels, err := r.GetNodeLabels(ctx, nodeID)
+	if err != nil {
+		xlog.Warn("GetWithExtras: failed to get labels", "node", nodeID, "error", err)
+	} else {
+		for _, l := range nodeLabels {
+			labels[l.Key] = l.Value
+		}
+	}
+
+	var modelCount int64
+	if err := r.db.WithContext(ctx).Model(&NodeModel{}).
+		Where("node_id = ? AND state = ?", nodeID, "loaded").
+		Count(&modelCount).Error; err != nil {
+		xlog.Warn("GetWithExtras: failed to get model count", "node", nodeID, "error", err)
+	}
+
+	var inFlight struct{ Total int }
+	if err := r.db.WithContext(ctx).Model(&NodeModel{}).
+		Select("COALESCE(SUM(in_flight), 0) as total").
+		Where("node_id = ? AND state IN ?", nodeID, []string{"loaded", "unloading"}).
+		Scan(&inFlight).Error; err != nil {
+		xlog.Warn("GetWithExtras: failed to get in-flight count", "node", nodeID, "error", err)
+	}
+
+	return &NodeWithExtras{
+		BackendNode:   *node,
+		ModelCount:    int(modelCount),
+		InFlightCount: inFlight.Total,
+		Labels:        labels,
+	}, nil
+}
+
 // GetByName returns a single node by name.
 func (r *NodeRegistry) GetByName(ctx context.Context, name string) (*BackendNode, error) {
 	var node BackendNode
--- a/core/services/nodes/registry_test.go
+++ b/core/services/nodes/registry_test.go
@@ -646,6 +646,38 @@ var _ = Describe("NodeRegistry", func() {
 		})
 	})

+	Describe("GetWithExtras", func() {
+		It("returns the node enriched with its labels map", func() {
+			node := makeNode("extras-node", "10.0.0.80:50051", 8_000_000_000)
+			Expect(registry.Register(context.Background(), node, true)).To(Succeed())
+			Expect(registry.SetNodeLabel(context.Background(), node.ID, "env", "prod")).To(Succeed())
+			Expect(registry.SetNodeLabel(context.Background(), node.ID, "region", "us-east")).To(Succeed())
+
+			got, err := registry.GetWithExtras(context.Background(), node.ID)
+			Expect(err).ToNot(HaveOccurred())
+			Expect(got).ToNot(BeNil())
+			Expect(got.ID).To(Equal(node.ID))
+			Expect(got.Name).To(Equal("extras-node"))
+			Expect(got.Labels).To(Equal(map[string]string{"env": "prod", "region": "us-east"}))
+		})
+
+		It("returns an empty (non-nil) labels map when the node has none", func() {
+			node := makeNode("extras-no-labels", "10.0.0.81:50051", 8_000_000_000)
+			Expect(registry.Register(context.Background(), node, true)).To(Succeed())
+
+			got, err := registry.GetWithExtras(context.Background(), node.ID)
+			Expect(err).ToNot(HaveOccurred())
+			Expect(got).ToNot(BeNil())
+			Expect(got.Labels).ToNot(BeNil())
+			Expect(got.Labels).To(BeEmpty())
+		})
+
+		It("returns an error for an unknown node", func() {
+			_, err := registry.GetWithExtras(context.Background(), "does-not-exist")
+			Expect(err).To(HaveOccurred())
+		})
+	})
+
 	Describe("FindNodesBySelector", func() {
 		It("returns nodes matching all labels in selector", func() {
 			n1 := makeNode("sel-match", "10.0.0.80:50051", 8_000_000_000)
--- a/gallery/index.yaml
+++ b/gallery/index.yaml
@@ -209,6 +209,60 @@
    - filename: llama-cpp/models/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF/Qwen3.6-35B-A3B-NVFP4-MTP-TURBO.gguf
      sha256: f3d2fdc74e3ef19925ccbf794b04d7f6f11fb12eba7722b7749219d0cc5c36ed
      uri: https://huggingface.co/michaelw9999/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-NVFP4-MTP-TURBO.gguf
+- name: "ornith-1.0-9b"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/deepreinforce-ai/Ornith-1.0-9B-GGUF
+  description: |
+    [](https://deep-reinforce.com/ornith.html)
+
+    # Ornith-1.0-9B-GGUF
+
+    Aloha! 🌺 Today, we are releasing Ornith-1.0, a self-improving family of open-source models for agentic coding.
+
+    Highlights:
+
+      - **State-of-the-Art Coding Agents**: Available in 9B-Dense, 31B-Dense, 35B-MoE, and 397B-MoE (post-trained on top of Gemma 4 and Qwen 3.5), achieving state-of-the-art performance among open-source models of comparable size on coding benchmarks such as Terminal-Bench 2.1, SWE-Bench, NL2Repo and OpenClaw.
+      - **Self-Improving Training Framework**:  Ornith-1.0 employs RL to learn to generate not only solution rollouts, but also the scallfold that drive those rollouts. By jointly optimizing the scaffold and the resulting solution, the model  discovers better search trajectories and generates higher-quality solutions.
+      - **Licence**: MIT licensed, globally accessible, and free from regional limitations.
+
+    ## Ornith 1.0 9B
+
+    This model card documents **Ornith-1.0-9B**, the most lightweight member of the Ornith family, designed for efficient single-GPU deployment.
+
+    ### Benchmarks
+
+    Ornith-1.0-9B
+    Qwen3.5-9B
+    Qwen3.5-35B
+    Gemma4-12B
+    Gemma4-31B
+
+    Agentic Coding
+
+    ...
+  license: "mit"
+  tags:
+    - llm
+    - gguf
+  overrides:
+    backend: llama-cpp
+    function:
+      automatic_tool_parsing_fallback: true
+      grammar:
+        disable: true
+    known_usecases:
+      - chat
+    options:
+      - use_jinja:true
+    parameters:
+      model: llama-cpp/models/Ornith-1.0-9B-GGUF/ornith-1.0-9b-Q4_K_M.gguf
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/Ornith-1.0-9B-GGUF/ornith-1.0-9b-Q4_K_M.gguf
+      sha256: 5720d1f671b4996481274fffe01868c3c36e87c135cc8538471cc7bd6087b106
+      uri: https://huggingface.co/deepreinforce-ai/Ornith-1.0-9B-GGUF/resolve/main/ornith-1.0-9b-Q4_K_M.gguf
 - name: "ornith-1.0-35b"
  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
  urls: