Commit Graph

5030 Commits

Author SHA1 Message Date
Parth Sareen
ee25219edd cmd: claude launch improvements (#14064) 2026-02-03 19:33:58 -08:00
Jeffrey Morgan
b1fccabb34 Revert "Update vendored llama.cpp to b7847" (#14061) 2026-02-03 18:39:36 -08:00
Bruce MacDonald
a6355329bf cmd: open browser on ollama signin when available (#14055)
When a browser is available open it to the connect URL automatically when running the `ollama signin` command. Browser is not opened in any other unauthorized scenario.
2026-02-03 16:42:09 -08:00
Parth Sareen
0398b24b42 cmd: launch defaults (#14035) v0.15.5-rc1 2026-02-02 23:19:11 -08:00
Parth Sareen
75b1dddf91 cmd: launch extra params (#14039) 2026-02-03 02:03:33 -05:00
Parth Sareen
e1e80ffc3e cmd/config: move config location (#14034) 2026-02-02 22:48:51 -05:00
Aleksandr Vukmirovich
71896485fd anthropic: add InputTokens to streaming response (#13934)
---------

Co-authored-by: ParthSareen <parth.sareen@ollama.com>
2026-02-02 18:29:37 -08:00
Jeffrey Morgan
ef00199fb4 Update vendor ggml code to a5bb8ba4 (#13832)
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
2026-02-02 17:31:59 -08:00
Jeffrey Morgan
8f4a008139 Add GLM-OCR vision model support (#14024) v0.15.5-rc0 2026-02-02 15:39:18 -08:00
Patrick Devine
d8cc798c2b glm 4.7 flash support on experimental engine (#13838) 2026-02-02 15:22:11 -08:00
Richard Lyons
6582f6da5c llm: Make "do load request" error message more informative 2026-02-02 11:13:21 -08:00
Jesse Gross
0334ffa625 server: use tiered VRAM-based default context length
Replace binary low VRAM mode with tiered VRAM thresholds that set
default context lengths for all models:

- < 24 GiB VRAM: 4,096 context
- 24-48 GiB VRAM: 32,768 context
- >= 48 GiB VRAM: 262,144 context
2026-02-02 10:47:09 -08:00
Jesse Gross
d11fbd2c60 server: fix ollama ps showing configured instead of actual context length
When context length is clamped to the model's trained context length,
ollama ps now shows the actual clamped value instead of the originally
configured value.
2026-02-02 10:47:09 -08:00
Jeffrey Morgan
6a7c3f188e openclaw: run onboarding for fresh installs (#14006)
When launching OpenClaw without prior onboarding, run the onboarding
wizard instead of going straight to gateway. This ensures proper
gateway configuration (mode, token, etc.) before first use.

- Add onboarded() to check for wizard.lastRunAt marker in config
- Run onboard with --auth-choice skip --gateway-token ollama for fresh installs
- Existing installs (onboarding completed) run gateway directly
v0.15.4
2026-02-01 13:46:45 -08:00
Jeffrey Morgan
427e2c962a docs: add redirect from clawdbot to openclaw (#14004) 2026-01-31 20:50:42 -08:00
Thanh Nguyen
27db7f806f cmd/config: rename integration to openclaw (#13979)
---------

Co-authored-by: ParthSareen <parth.sareen@ollama.com>
v0.15.3
2026-01-31 18:31:13 -05:00
Dhiraj Lochib
3590fbfa76 runner: fix typo 'baackend' -> 'backend' in error messages (#13645)
Fix typo in three error messages where 'baackend' was written instead
of 'backend' in the /health endpoint handler when initializing the
dummy model load.
2026-01-31 13:26:20 -08:00
noureldin-azzab
cd0094f772 added stakpak to web & desktop (#13961) 2026-01-31 13:04:34 -08:00
Louis Beaumont
06bc8e6712 docs: add Screenpipe to Community Integrations (#13906)
Screenpipe is a 24/7 screen & mic recording tool that uses Ollama
for local LLM-powered search and AI features. 16k+ GitHub stars.
2026-01-31 12:49:52 -08:00
frob
fc5f9bb448 docs: remove unsupported quantizations (#13982) 2026-01-31 12:46:20 -08:00
frob
a0740f7ef7 docs: add GB10 to supported devices (#13987) 2026-01-31 12:45:27 -08:00
Parth Sareen
a0923cbdd0 cmd: ollama launch add placeholder text for selector (#13966) 2026-01-29 09:48:49 -08:00
Seokrin Taron Sung
f92e362b2e cmd: capitalize Ollama in serve command help text (#13965) 2026-01-29 09:47:53 -08:00
Tincho
aa23d8ecd2 docs: update installation command for OpenCode CLI (#13971) 2026-01-29 09:47:02 -08:00
Gabe Goodhart
7b62c41060 cmd/config: use envconfig.Host() for base API in launch config packages (#13937) 2026-01-27 13:30:00 -08:00
Parth Sareen
26acab64b7 docs: add clawdbot (#13925) 2026-01-26 18:32:54 -08:00
Gyungrai Wang
e0f03790b1 parsers/ministral: fix nested tool call parsing by counting brace nesting (#13905)
* parsers/ministral: fix nested tool call parsing by counting brace nesting

* fix lint error

* parsers: refactor ministral parser

The old one was very tied to expecting to see only one token at a time,
which I don't like to assume (who knows what the future might hold wrt
speculative decoding, etc). This new one follows a similar structure to
qwen3-coder's parser, which incidentally makes it easier to test as well
(since we can test the individual events that come out when given
particular inputs).

---------

Co-authored-by: Devon Rifkin <drifkin@drifkin.net>
2026-01-26 15:03:43 -08:00
Parth Sareen
3ab842b0f5 cmd: clawdbot config fixes (#13922) v0.15.2 2026-01-26 14:34:29 -08:00
Parth Sareen
b8e8ef8929 cmd: ollama launch clawdbot (#13921) 2026-01-26 13:40:59 -08:00
Parth Sareen
465d124183 cmd: fix opencode config (#13894) v0.15.1 2026-01-24 18:42:56 -08:00
Parth Sareen
d310e56fa3 cmd: add fallback for claude (#13892) 2026-01-24 18:26:01 -08:00
Jeffrey Morgan
a1ca428c90 glm4moelite: fix attention scale calculation (#13893)
Use the original key dimension (qkNopeHeadDim + qkRopeHeadDim = 256) for
the attention scale instead of the MLA absorbed dimension (kvLoraRank +
qkRopeHeadDim = 576).

MLA absorption is a mathematically equivalent reorganization of the
attention computation - it should not change the effective attention
scale. The scale should match training, which uses 1/sqrt(256).

This improves tool calling and model looping issues.
2026-01-24 17:48:09 -08:00
Jeffrey Morgan
16750865d1 glm4moelite: quantize more tensors to q8_0 and avoid double BOS token (#13891) v0.15.1-rc1 2026-01-24 16:33:54 -08:00
Jeffrey Morgan
f3b476c592 build: add -O3 optimization to CGO flags (#13877)
CGO_CFLAGS and CGO_CXXFLAGS were being set without optimization flags,
which overrides Go's default -O2 and results in unoptimized C++ code.

This caused significant performance degradation in release builds
compared to local `go build` which uses the default optimization.

- build_darwin.sh: add -O3 to CGO_CFLAGS and CGO_CXXFLAGS exports
- Dockerfile: preserve CGO_CFLAGS/CGO_CXXFLAGS from build args instead
  of overwriting them
- app/README.md: update documentation to include -O3
v0.15.1-rc0
2026-01-24 10:55:38 -08:00
Parth Sareen
5267d31d56 docs: ollama launch (#13852) 2026-01-23 23:18:50 -08:00
Stillhart
b44f56319f README: Update the "Ollama for ruby" to the most popular and maintained ruby gem. (#13855)
* update README ruby link

the ollama-ai ruby gem is vastly less popular and seems unmaintained
https://rubygems.org/gems/ollama-ai

the defacto standard with the most downloads in the ruby ecosystem is ruby_llm
https://rubygems.org/gems/ruby_llm

I would link to that to avoid complication and guarantee feature compatibility with ollama.

* Update gem link ruby_llm from website to GitHub

ollama links mostly to github, not project websites, hence link to ruby_llm github.
2026-01-24 01:24:52 -05:00
Jeffrey Morgan
0209c268bb llama: fix CUDA MMA errors in release build (#13874) v0.15.0 v0.15.0-rc6 2026-01-23 20:10:04 -08:00
Jeffrey Morgan
912d984346 llama: fix fattn-tile shared memory overflow on sm_50/52 (#13872)
Use nthreads=128 for ncols=4 configurations in flash attention tile
kernel to reduce shared memory usage below 48KB limit on Maxwell
architectures (sm_50/52).

With nthreads=256 and ncols=4, np=2 which caused shared memory to
exceed 48KB. With nthreads=128 and ncols=4, np=1 keeps shared memory
under the limit.
v0.15.0-rc5
2026-01-23 19:22:32 -08:00
Parth Sareen
aae6ecbaff cmd: rename ollama config to ollama launch (#13871) v0.15.0-rc4 2026-01-23 18:40:40 -08:00
Jeffrey Morgan
64737330a4 Re-apply "model: add MLA absorption for glm4moelite" with fix (#13870)
The nvidia_fp32 config for (576, 512) head sizes had nbatch_fa=32,
which caused zero-sized arrays when computing array dimensions:
  nbatch_fa / (np * warp_size) = 32 / (2 * 32) = 0

This resulted in CUDA compilation failures on CUDA 12 (Windows and
Linux arm64):
- "static assertion failed with nbatch_fa % (np*warp_size) != 0"
- "the size of an array must be greater than zero"

Fix by changing nbatch_fa from 32 to 64 for all (576, 512) configs
in the nvidia_fp32 function, matching the nvidia_fp16 and AMD configs.
2026-01-23 18:40:28 -08:00
Jeffrey Morgan
2eda97f1c3 Revert "model: add MLA absorption for glm4moelite (#13810)" (#13869)
This reverts commit 1044b0419a.
v0.15.0-rc3
2026-01-23 17:14:15 -08:00
Jeffrey Morgan
66831dcf70 x/imagegen: fix image editing support (#13866)
- Fix panic in ollama show for image gen models (safe type assertion)
- Add vision capability for Flux2KleinPipeline models at create time
- Flatten transparent PNG images onto white background for better results
v0.15.0-rc2
2026-01-23 15:37:17 -08:00
Jeffrey Morgan
1044b0419a model: add MLA absorption for glm4moelite (#13810)
* model: add MLA absorption for glm4moelite

Split the combined KV_B tensor into separate K_B and V_B tensors
during conversion, enabling MLA (Multi-head Latent Attention)
absorption which compresses the KV cache for improved efficiency.

* ggml: enable MLA flash attention for GLM-4.7-flash

Add support for gqa_ratio 4 in MLA flash attention kernels. GLM-4.7-flash
uses head size 576 with gqa_ratio 4, which was previously only supported
for gqa_ratio 16 (DeepSeek).

Metal changes:
- Enable head size 576 for flash attention
- Increase simdgroups to 8 for large heads (>=512)
- Add case 8 kernel dispatch for 8 simdgroups

CUDA changes:
- Add gqa_ratio 4 support for head 576/512
- Add tile configs for (576, 512, 4) and (576, 512, 8)
- Add MMA config cases for ncols 4
- Add template instances for ncols2=4

* model: add compatibility validation for glm4moelite architecture
2026-01-23 14:47:42 -08:00
Parth Sareen
771d9280ec cmd: ollama config fix droid model name configuration (#13856) 2026-01-23 11:44:22 -08:00
Jeffrey Morgan
862bc0a3bf x/imagegen: respect stream=false in /api/generate (#13853)
When stream=false is set for image generation requests, return a single
JSON response instead of streaming multiple ndjson progress updates.
2026-01-22 22:16:39 -08:00
Jeffrey Morgan
c01608b6a1 x/imagegen: add image edit capabilities (#13846) v0.15.0-rc1 2026-01-22 20:35:08 -08:00
Parth Sareen
199c41e16e cmd: ollama config command to help configure integrations to use Ollama (#13712) 2026-01-22 20:17:11 -08:00
Jeffrey Morgan
3b3bf6c217 x/imagegen: replace memory estimation with actual weight size (#13848)
Remove static VRAM estimation (EstimateVRAM, CheckMemoryRequirements)
which wasn't helpful. Instead, report the actual tensor weight size
from the manifest for ollama ps.

- Remove memory estimation check from runner startup
- Remove EstimateVRAM, CheckMemoryRequirements, modelVRAMEstimates
- Add TotalTensorSize() to get actual weight size from manifest
- Use weight size for Server.vramSize instead of estimates

Note: This is better than showing 0 or inaccurate estimates, but the
weight size is a drastic underestimation of actual memory usage since
it doesn't account for activations, intermediate tensors, or MLX
overhead. Future work should query real-time memory from MLX
(e.g., MetalGetActiveMemory) for accurate reporting.
2026-01-22 18:32:41 -08:00
Parth Sareen
f52c21f457 fix: handle Enter key pressed during model loading (#13839) 2026-01-22 18:32:02 -08:00
Jeffrey Morgan
b5d0f72f16 x/imagegen: remove qwen_image and qwen_image_edit models (#13827)
Remove the Qwen image generation and image editing model packages
to clean up the codebase. These models will be reintroduced later.

- Delete x/imagegen/models/qwen_image/ (10 files)
- Delete x/imagegen/models/qwen_image_edit/ (5 files)
- Remove related CLI flags and imports from cmd/engine/main.go
- Update comments in cache/step.go to remove Qwen-specific references
v0.15.0-rc0
2026-01-21 13:37:08 -08:00