Merge remote-tracking branch 'origin/master' into worktree-feat+paged-attention

# Conflicts:
#	gallery/index.yaml
This commit is contained in:
Ettore Di Giacinto
2026-06-26 21:38:56 +00:00
11 changed files with 330 additions and 50 deletions

View File

@@ -1,4 +1,4 @@
From 944636cf34b486d4035575e48845840368de0743 Mon Sep 17 00:00:00 2001
From fafe8785c8595f53a51efec20cf84f9146437e0c Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Fri, 26 Jun 2026 22:58:47 +0200
Subject: [PATCH] feat(paged): qwen35 recurrent-state gather fusion (patch
@@ -46,22 +46,56 @@ MoE npl128 783.9 t/s (step 163.3 ms vs MOE_GAP 169.8 ms @0025), dense 377.3 t/s.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
LEVER1_GATHER_RESULTS.md | 110 +++++++++++++++++++++++
ggml/include/ggml.h | 20 +++++
ggml/src/ggml-cpu/ops.cpp | 90 ++++++++++++++++++-
ggml/src/ggml-cuda/ssm-conv.cu | 155 ++++++++++++++++++++++++++++++++-
LEVER1_GATHER_PROGRESS.md | 26 ++++++
LEVER1_GATHER_RESULTS.md | 163 +++++++++++++++++++++++++++++++++
ggml/include/ggml.h | 20 ++++
ggml/src/ggml-cpu/ops.cpp | 90 +++++++++++++++++-
ggml/src/ggml-cuda/ssm-conv.cu | 155 ++++++++++++++++++++++++++++++-
ggml/src/ggml.c | 62 +++++++++++++
src/models/delta-net-base.cpp | 26 ++++--
tests/test-backend-ops.cpp | 69 +++++++++++++++
7 files changed, 521 insertions(+), 11 deletions(-)
tests/test-backend-ops.cpp | 69 ++++++++++++++
8 files changed, 600 insertions(+), 11 deletions(-)
create mode 100644 LEVER1_GATHER_PROGRESS.md
create mode 100644 LEVER1_GATHER_RESULTS.md
diff --git a/LEVER1_GATHER_PROGRESS.md b/LEVER1_GATHER_PROGRESS.md
new file mode 100644
index 0000000..e4d14b9
--- /dev/null
+++ b/LEVER1_GATHER_PROGRESS.md
@@ -0,0 +1,26 @@
+# Lever 1 (residual recurrent-state gather fusion) - PROGRESS / gather-bench DONE
+
+STATUS: COMPLETE. Bit-exact, both gates green, rigorous same-session A/B bench done, committed both trees.
+
+## What
+Fused the residual conv-state tap k_get_rows (build_conv_state_fused) in-place into the SSM_CONV
+update via ggml_ssm_conv_update_inplace_ids (src[4]=ids discriminator). Mirrors 0019 (SSM-state) +
+0018 (in-place). Eliminates the last k_get_rows in the GDN decode path. Bit-exact by construction
+(read path gather -> indexed in-kernel read; values + reduction order unchanged).
+
+## Gates (lever1 build = build-cuda, base = build-cuda-base = 0026)
+- md5 greedy --temp 0 --seed 1 -n 48: dense 5951a5b4d624ce891e22ab5fca9bc439 == baseline;
+ MoE 07db32c2bcb78d17a43ed18bc22705cd == baseline; base == lever1 (byte-identical).
+- test-backend-ops CUDA0: SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
+ GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
+
+## Bench (S_TG t/s, npp128 ntg128 npl 32/128)
+- dense npl128 369.95 -> 377.83 (+2.13%, 94.6 -> 96.6% of vLLM 391); npl32 208.56 -> 209.39.
+- MoE npl128 763.47 -> 777.95 (+1.90%, 84.7 -> 86.3% of vLLM 901); npl32 456.85 -> 459.56.
+- nsys MoE decode: k_get_rows_float 17334 -> 15414 inst (-1920 = 30 GDN x 64 steps), 358.37 -> 133.52 ms;
+ step 167.7 -> 164.5 ms (-3.13 ms/step). gather eliminated, replaced by no-op nonident kernel.
+
+## Artifacts
+- Patch: patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch (LocalAI worktree)
+- Docs: LEVER1_GATHER_RESULTS.md (full bench tables)
+- DGX bench outs: ab_{dense,moe}_{base,lever1}.out, nab_{base,lever1}.kern.csv, md5{d,m}_{base,lever1}.txt
diff --git a/LEVER1_GATHER_RESULTS.md b/LEVER1_GATHER_RESULTS.md
new file mode 100644
index 0000000..c78e3c0
index 0000000..afced02
--- /dev/null
+++ b/LEVER1_GATHER_RESULTS.md
@@ -0,0 +1,110 @@
@@ -0,0 +1,163 @@
+# Patch 0028: qwen35 recurrent-state gather fusion (Lever 1, bit-exact)
+
+The MoE-gap groundtruth (`MOE_GAP_VS_VLLM.md`) found `k_get_rows_float` to be the single biggest
@@ -172,6 +206,59 @@ index 0000000..c78e3c0
++ 0021 conv compute). Additive and risk-free; ready for the rigorous same-session A/B bench.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+
+## Rigorous same-session A/B bench (label gather-bench, DGX GB10 sm_121)
+
+Independently re-validated on a fresh GPU session. `build-cuda-base` = pre-lever-1 (0026, 33e7c65;
+NO `ssm_conv_update_ids` symbol) vs `build-cuda` = lever-1 (this commit; WITH it). Same env, back-to-back.
+
+### Gate re-confirm (greedy --temp 0 --seed 1 -n 48, -fa on) - base == lever1 == recorded baseline
+
+| model | base (0026) | lever1 (0028) | recorded baseline |
+|-------------------|----------------------------------|----------------------------------|----------------------------------|
+| q36-27b-nvfp4 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 |
+| q36-35b-a3b-nvfp4 | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd |
+
+test-backend-ops (CUDA0): SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
+GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
+
+### decode_agg (S_TG t/s) before/after, npp128 ntg128 -npl 32,128 -c 33000
+
+dense q36-27b-nvfp4 (LLAMA_KV_PAGED=1):
+
+| npl | base S_TG | lever1 S_TG | delta | vs vLLM 391 |
+|-----|-----------|-------------|--------|----------------|
+| 32 | 208.56 | 209.39 | +0.40% | - |
+| 128 | 369.95 | 377.83 | +2.13% | 94.6% -> 96.6% |
+
+MoE q36-35b-a3b-nvfp4 (LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1):
+
+| npl | base S_TG | lever1 S_TG | delta | vs vLLM 901 |
+|-----|-----------|-------------|--------|----------------|
+| 32 | 456.85 | 459.56 | +0.59% | - |
+| 128 | 763.47 | 777.95 | +1.90% | 84.7% -> 86.3% |
+
+Step time npl128: dense 346.0 -> 338.8 ms/batch-step, MoE 167.7 -> 164.5 ms/step (-3.13 ms/step).
+
+### nsys (MoE decode, npp128 ntg64 npl128, same env) - k_get_rows eliminated
+
+| kernel | base (0026) | lever1 (0028) |
+|---------------------------------|------------------------|----------------------------------------------|
+| k_get_rows_float<float,float> | 17334 inst / 358.37 ms | 15414 inst / 133.52 ms |
+| delta | | -1920 inst (= 30 GDN x 64 steps), -224.85 ms |
+| ssm_conv_update(_ids)_f32 | 219.71 ms (update) | 225.75 ms (update_ids, +6 ms) |
+| ssm_conv_gather_nonident_kernel | - | 1920 x ~1.13 us = 2.17 ms (no-op, all ident) |
+
+The 1920 big ~114 us conv-tap gathers are gone; only the ~1.13 us no-op gather kernel + ~6 ms folded
+into the update kernel are added. Net GDN get_rows saving ~216 ms / 64 steps = ~3.4 ms/step, matching
+the -3.13 ms/step throughput delta at npl128.
+
+### Verdict (gather-bench)
+
+Bit-exact (gate re-confirmed, both md5 byte-identical to baseline), the residual k_get_rows conv
+gather is independently nsys-confirmed eliminated (-1920 inst, -224.85 ms over 64 steps), and decode
+throughput lifts BOTH models in the same session: dense npl128 +2.13% (94.6 -> 96.6% of vLLM),
+MoE npl128 +1.90% (84.7 -> 86.3% of vLLM). Ship it.
diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
index 2a5cbce..5fa220a 100644
--- a/ggml/include/ggml.h

View File

@@ -1,42 +1,26 @@
# LEVER1_GATHER_PROGRESS.md - gather-build GPU agent checkpoint
# Lever 1 (residual recurrent-state gather fusion) - PROGRESS / gather-bench DONE
Status: **DONE.** Residual k_get_rows fused in-place, bit-exact, both gates pass. Patch 0028.
STATUS: COMPLETE. Bit-exact, both gates green, rigorous same-session A/B bench done, committed both trees.
## Lever
Fuse the residual `k_get_rows_float` in the GDN decode path (the biggest single kernel vLLM lacks,
~5.2 ms/step MoE per MOE_GAP_VS_VLLM.md). 0019 fused the SSM-state gather; 0021 fused the conv
compute but kept a `build_rs` gather for the conv taps. This patch closes that last gather.
## What
Fused the residual conv-state tap k_get_rows (build_conv_state_fused) in-place into the SSM_CONV
update via ggml_ssm_conv_update_inplace_ids (src[4]=ids discriminator). Mirrors 0019 (SSM-state) +
0018 (in-place). Eliminates the last k_get_rows in the GDN decode path. Bit-exact by construction
(read path gather -> indexed in-kernel read; values + reduction order unchanged).
## Located (nsys, DGX GB10, MoE q36-35b-a3b-nvfp4, npp128 ntg24 npl128)
The residual is the **conv-state tap gather** in `build_conv_state_fused`
(`src/models/delta-net-base.cpp`): the plain 4-arg `build_rs` -> `ggml_get_rows` of n_embd_r = 24576
floats (= (d_conv-1)*(d_inner + 2*n_group*d_state) = 3*8192) x 128 seqs, once per GDN layer per step.
Decode-window `k_get_rows_float<float,float>` had a BIG cluster of ~720 instances (30 GDN x 24) at
~115 us = ~3.4 ms/step (5.2 ms/step at steady ntg=128). grid (ne10=128, block_num_y=96) confirmed
ne00=24576 == n_embd_r (the SSM n_embd_s=524288 gather is already fused by 0019).
## Gates (lever1 build = build-cuda, base = build-cuda-base = 0026)
- md5 greedy --temp 0 --seed 1 -n 48: dense 5951a5b4d624ce891e22ab5fca9bc439 == baseline;
MoE 07db32c2bcb78d17a43ed18bc22705cd == baseline; base == lever1 (byte-identical).
- test-backend-ops CUDA0: SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
## Built (paged branch f32 default = 0026 hybrid default is f32)
New op `ggml_ssm_conv_update_inplace_ids` (src[4]=ids, op_params[1]=rs_head): reads each seq's prior
taps from cache[ids[s]] in-kernel (identity -> in place from conv_state_dst; non-identity -> disjoint
scratch via ssm_conv_gather_nonident_kernel). Mirrors 0019. Files: ggml.h, ggml.c, ssm-conv.cu,
ggml-cpu/ops.cpp, delta-net-base.cpp, tests/test-backend-ops.cpp. Build EXIT=0.
## GATE - PASS
- test-backend-ops (CUDA0 2/2): SSM_CONV_UPDATE_IDS OK (new), SSM_CONV_UPDATE OK, SSM_CONV OK,
GATED_DELTA_NET OK, GET_ROWS OK.
- greedy md5 (-temp 0 -seed 1 -n 48) BYTE-IDENTICAL both models:
dense 5951a5b4d624ce891e22ab5fca9bc439, MoE 07db32c2bcb78d17a43ed18bc22705cd (== baseline).
- nsys: k_get_rows<float,float> 10174 -> 9454 (720 fewer), 186.3 -> 102.8 ms; conv gathers replaced
by 720 x ~1.1 us no-op gather. MoE npl128 783.9 t/s (step 163.3 ms vs 169.8 @0025), dense 377.3 t/s.
## Bench (S_TG t/s, npp128 ntg128 npl 32/128)
- dense npl128 369.95 -> 377.83 (+2.13%, 94.6 -> 96.6% of vLLM 391); npl32 208.56 -> 209.39.
- MoE npl128 763.47 -> 777.95 (+1.90%, 84.7 -> 86.3% of vLLM 901); npl32 456.85 -> 459.56.
- nsys MoE decode: k_get_rows_float 17334 -> 15414 inst (-1920 = 30 GDN x 64 steps), 358.37 -> 133.52 ms;
step 167.7 -> 164.5 ms (-3.13 ms/step). gather eliminated, replaced by no-op nonident kernel.
## Artifacts
- DGX: commit `944636c` on branch `paged`; LEVER1_GATHER_RESULTS.md in llama tree; nsys
`/tmp/kgr_moe.nsys-rep` (before) + `/tmp/kgr_moe_after.nsys-rep` (after).
- LocalAI worktree: patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch + LEVER1_GATHER_RESULTS.md.
- BOTH trees committed (-s). NOT pushed.
## Next
Ready for the rigorous same-session A/B decode bench (npl 32/128, dense + MoE, before/after on the
same 0026 base). The kernel-elimination and bit-exactness are proven; the bench quantifies the lift.
Assisted-by: Claude:opus-4.8 [Claude Code]
- Patch: patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch (LocalAI worktree)
- Docs: LEVER1_GATHER_RESULTS.md (full bench tables)
- DGX bench outs: ab_{dense,moe}_{base,lever1}.out, nab_{base,lever1}.kern.csv, md5{d,m}_{base,lever1}.txt

View File

@@ -108,3 +108,56 @@ shared GDN conv path). This closes the last `k_get_rows` in the GDN decode path
+ 0021 conv compute). Additive and risk-free; ready for the rigorous same-session A/B bench.
Assisted-by: Claude:opus-4.8 [Claude Code]
## Rigorous same-session A/B bench (label gather-bench, DGX GB10 sm_121)
Independently re-validated on a fresh GPU session. `build-cuda-base` = pre-lever-1 (0026, 33e7c65;
NO `ssm_conv_update_ids` symbol) vs `build-cuda` = lever-1 (this commit; WITH it). Same env, back-to-back.
### Gate re-confirm (greedy --temp 0 --seed 1 -n 48, -fa on) - base == lever1 == recorded baseline
| model | base (0026) | lever1 (0028) | recorded baseline |
|-------------------|----------------------------------|----------------------------------|----------------------------------|
| q36-27b-nvfp4 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 |
| q36-35b-a3b-nvfp4 | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd |
test-backend-ops (CUDA0): SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
### decode_agg (S_TG t/s) before/after, npp128 ntg128 -npl 32,128 -c 33000
dense q36-27b-nvfp4 (LLAMA_KV_PAGED=1):
| npl | base S_TG | lever1 S_TG | delta | vs vLLM 391 |
|-----|-----------|-------------|--------|----------------|
| 32 | 208.56 | 209.39 | +0.40% | - |
| 128 | 369.95 | 377.83 | +2.13% | 94.6% -> 96.6% |
MoE q36-35b-a3b-nvfp4 (LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1):
| npl | base S_TG | lever1 S_TG | delta | vs vLLM 901 |
|-----|-----------|-------------|--------|----------------|
| 32 | 456.85 | 459.56 | +0.59% | - |
| 128 | 763.47 | 777.95 | +1.90% | 84.7% -> 86.3% |
Step time npl128: dense 346.0 -> 338.8 ms/batch-step, MoE 167.7 -> 164.5 ms/step (-3.13 ms/step).
### nsys (MoE decode, npp128 ntg64 npl128, same env) - k_get_rows eliminated
| kernel | base (0026) | lever1 (0028) |
|---------------------------------|------------------------|----------------------------------------------|
| k_get_rows_float<float,float> | 17334 inst / 358.37 ms | 15414 inst / 133.52 ms |
| delta | | -1920 inst (= 30 GDN x 64 steps), -224.85 ms |
| ssm_conv_update(_ids)_f32 | 219.71 ms (update) | 225.75 ms (update_ids, +6 ms) |
| ssm_conv_gather_nonident_kernel | - | 1920 x ~1.13 us = 2.17 ms (no-op, all ident) |
The 1920 big ~114 us conv-tap gathers are gone; only the ~1.13 us no-op gather kernel + ~6 ms folded
into the update kernel are added. Net GDN get_rows saving ~216 ms / 64 steps = ~3.4 ms/step, matching
the -3.13 ms/step throughput delta at npl128.
### Verdict (gather-bench)
Bit-exact (gate re-confirmed, both md5 byte-identical to baseline), the residual k_get_rows conv
gather is independently nsys-confirmed eliminated (-1920 inst, -224.85 ms over 64 steps), and decode
throughput lifts BOTH models in the same session: dense npl128 +2.13% (94.6 -> 96.6% of vLLM),
MoE npl128 +1.90% (84.7 -> 86.3% of vLLM). Ship it.

View File

@@ -16,7 +16,15 @@ cp -rfv $CURDIR/run.sh $CURDIR/package/
cp -rfLv $CURDIR/sources/go-piper/piper-phonemize/pi/lib/* $CURDIR/package/lib/
# Detect architecture and copy appropriate libraries
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
if [ "$(uname)" = "Darwin" ]; then
# macOS has no glibc loader to bundle. The piper binary links its bundled
# libs (libucd, libespeak-ng, libpiper_phonemize, libonnxruntime) via
# @rpath but ships with no LC_RPATH, so dyld aborts at launch with
# "Library not loaded: @rpath/libucd.dylib ... no LC_RPATH's found".
# Add an @loader_path/lib rpath so @rpath resolves to package/lib/.
echo "Detected macOS; adding @loader_path/lib rpath so bundled libs resolve via @rpath..."
install_name_tool -add_rpath @loader_path/lib "$CURDIR/package/piper"
elif [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
# x86_64 architecture
echo "Detected x86_64 architecture, copying x86_64 libraries..."
cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so

View File

@@ -4,7 +4,12 @@ set -ex
CURDIR=$(dirname "$(realpath "$0")")
export ESPEAK_NG_DATA="$CURDIR"/espeak-ng-data
export LD_LIBRARY_PATH="$CURDIR"/lib:$LD_LIBRARY_PATH
if [ "$(uname)" = "Darwin" ]; then
export DYLD_LIBRARY_PATH="$CURDIR"/lib:$DYLD_LIBRARY_PATH
else
export LD_LIBRARY_PATH="$CURDIR"/lib:$LD_LIBRARY_PATH
fi
# If there is a lib/ld.so, use it
if [ -f "$CURDIR"/lib/ld.so ]; then

View File

@@ -15,7 +15,14 @@ cp -avf $CURDIR/run.sh $CURDIR/package/
cp -rfLv $CURDIR/backend-assets/lib/* $CURDIR/package/lib/
# Detect architecture and copy appropriate libraries
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
if [ "$(uname)" = "Darwin" ]; then
# macOS has no glibc loader to bundle. silero-vad links its bundled
# libonnxruntime via @rpath but ships with no LC_RPATH, so dyld can't find
# it at runtime. Add an @loader_path/lib rpath so @rpath resolves to
# package/lib/ (matching the piper darwin fix, #10525).
echo "Detected macOS; adding @loader_path/lib rpath so bundled libs resolve via @rpath..."
install_name_tool -add_rpath @loader_path/lib "$CURDIR/package/silero-vad"
elif [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
# x86_64 architecture
echo "Detected x86_64 architecture, copying x86_64 libraries..."
cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so

View File

@@ -3,7 +3,11 @@ set -ex
CURDIR=$(dirname "$(realpath "$0")")
export LD_LIBRARY_PATH="$CURDIR"/lib:$LD_LIBRARY_PATH
if [ "$(uname)" = "Darwin" ]; then
export DYLD_LIBRARY_PATH="$CURDIR"/lib:$DYLD_LIBRARY_PATH
else
export LD_LIBRARY_PATH="$CURDIR"/lib:$LD_LIBRARY_PATH
fi
# If there is a lib/ld.so, use it
if [ -f "$CURDIR"/lib/ld.so ]; then

View File

@@ -60,7 +60,10 @@ func GetNodeEndpoint(registry *nodes.NodeRegistry) echo.HandlerFunc {
return func(c echo.Context) error {
ctx := c.Request().Context()
id := c.Param("id")
node, err := registry.Get(ctx, id)
// GetWithExtras (not Get) so the response carries the node's labels,
// loaded-model count, and in-flight total — the bare BackendNode keeps
// labels in a separate table, leaving the detail view's label list empty.
node, err := registry.GetWithExtras(ctx, id)
if err != nil {
return c.JSON(http.StatusNotFound, nodeError(http.StatusNotFound, "node not found"))
}

View File

@@ -673,6 +673,49 @@ func (r *NodeRegistry) Get(ctx context.Context, nodeID string) (*BackendNode, er
return &node, nil
}
// GetWithExtras returns a single node enriched with the same computed fields as
// ListWithExtras (labels, loaded-model count, in-flight total). The plain Get
// returns a bare BackendNode whose Labels live in a separate table, so the node
// detail view needs this to show a node's existing labels and live counts.
func (r *NodeRegistry) GetWithExtras(ctx context.Context, nodeID string) (*NodeWithExtras, error) {
node, err := r.Get(ctx, nodeID)
if err != nil {
return nil, err
}
labels := make(map[string]string)
nodeLabels, err := r.GetNodeLabels(ctx, nodeID)
if err != nil {
xlog.Warn("GetWithExtras: failed to get labels", "node", nodeID, "error", err)
} else {
for _, l := range nodeLabels {
labels[l.Key] = l.Value
}
}
var modelCount int64
if err := r.db.WithContext(ctx).Model(&NodeModel{}).
Where("node_id = ? AND state = ?", nodeID, "loaded").
Count(&modelCount).Error; err != nil {
xlog.Warn("GetWithExtras: failed to get model count", "node", nodeID, "error", err)
}
var inFlight struct{ Total int }
if err := r.db.WithContext(ctx).Model(&NodeModel{}).
Select("COALESCE(SUM(in_flight), 0) as total").
Where("node_id = ? AND state IN ?", nodeID, []string{"loaded", "unloading"}).
Scan(&inFlight).Error; err != nil {
xlog.Warn("GetWithExtras: failed to get in-flight count", "node", nodeID, "error", err)
}
return &NodeWithExtras{
BackendNode: *node,
ModelCount: int(modelCount),
InFlightCount: inFlight.Total,
Labels: labels,
}, nil
}
// GetByName returns a single node by name.
func (r *NodeRegistry) GetByName(ctx context.Context, name string) (*BackendNode, error) {
var node BackendNode

View File

@@ -646,6 +646,38 @@ var _ = Describe("NodeRegistry", func() {
})
})
Describe("GetWithExtras", func() {
It("returns the node enriched with its labels map", func() {
node := makeNode("extras-node", "10.0.0.80:50051", 8_000_000_000)
Expect(registry.Register(context.Background(), node, true)).To(Succeed())
Expect(registry.SetNodeLabel(context.Background(), node.ID, "env", "prod")).To(Succeed())
Expect(registry.SetNodeLabel(context.Background(), node.ID, "region", "us-east")).To(Succeed())
got, err := registry.GetWithExtras(context.Background(), node.ID)
Expect(err).ToNot(HaveOccurred())
Expect(got).ToNot(BeNil())
Expect(got.ID).To(Equal(node.ID))
Expect(got.Name).To(Equal("extras-node"))
Expect(got.Labels).To(Equal(map[string]string{"env": "prod", "region": "us-east"}))
})
It("returns an empty (non-nil) labels map when the node has none", func() {
node := makeNode("extras-no-labels", "10.0.0.81:50051", 8_000_000_000)
Expect(registry.Register(context.Background(), node, true)).To(Succeed())
got, err := registry.GetWithExtras(context.Background(), node.ID)
Expect(err).ToNot(HaveOccurred())
Expect(got).ToNot(BeNil())
Expect(got.Labels).ToNot(BeNil())
Expect(got.Labels).To(BeEmpty())
})
It("returns an error for an unknown node", func() {
_, err := registry.GetWithExtras(context.Background(), "does-not-exist")
Expect(err).To(HaveOccurred())
})
})
Describe("FindNodesBySelector", func() {
It("returns nodes matching all labels in selector", func() {
n1 := makeNode("sel-match", "10.0.0.80:50051", 8_000_000_000)

View File

@@ -209,6 +209,60 @@
- filename: llama-cpp/models/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF/Qwen3.6-35B-A3B-NVFP4-MTP-TURBO.gguf
sha256: f3d2fdc74e3ef19925ccbf794b04d7f6f11fb12eba7722b7749219d0cc5c36ed
uri: https://huggingface.co/michaelw9999/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-NVFP4-MTP-TURBO.gguf
- name: "ornith-1.0-9b"
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
urls:
- https://huggingface.co/deepreinforce-ai/Ornith-1.0-9B-GGUF
description: |
[](https://deep-reinforce.com/ornith.html)
# Ornith-1.0-9B-GGUF
Aloha! 🌺 Today, we are releasing Ornith-1.0, a self-improving family of open-source models for agentic coding.
Highlights:
- **State-of-the-Art Coding Agents**: Available in 9B-Dense, 31B-Dense, 35B-MoE, and 397B-MoE (post-trained on top of Gemma 4 and Qwen 3.5), achieving state-of-the-art performance among open-source models of comparable size on coding benchmarks such as Terminal-Bench 2.1, SWE-Bench, NL2Repo and OpenClaw.
- **Self-Improving Training Framework**:  Ornith-1.0 employs RL to learn to generate not only solution rollouts, but also the scallfold that drive those rollouts. By jointly optimizing the scaffold and the resulting solution, the model discovers better search trajectories and generates higher-quality solutions.
- **Licence**: MIT licensed, globally accessible, and free from regional limitations.
## Ornith 1.0 9B
This model card documents **Ornith-1.0-9B**, the most lightweight member of the Ornith family, designed for efficient single-GPU deployment.
### Benchmarks
Ornith-1.0-9B
Qwen3.5-9B
Qwen3.5-35B
Gemma4-12B
Gemma4-31B
Agentic Coding
...
license: "mit"
tags:
- llm
- gguf
overrides:
backend: llama-cpp
function:
automatic_tool_parsing_fallback: true
grammar:
disable: true
known_usecases:
- chat
options:
- use_jinja:true
parameters:
model: llama-cpp/models/Ornith-1.0-9B-GGUF/ornith-1.0-9b-Q4_K_M.gguf
template:
use_tokenizer_template: true
files:
- filename: llama-cpp/models/Ornith-1.0-9B-GGUF/ornith-1.0-9b-Q4_K_M.gguf
sha256: 5720d1f671b4996481274fffe01868c3c36e87c135cc8538471cc7bd6087b106
uri: https://huggingface.co/deepreinforce-ai/Ornith-1.0-9B-GGUF/resolve/main/ornith-1.0-9b-Q4_K_M.gguf
- name: "ornith-1.0-35b"
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
urls: